Health & Metrics

Horizon exposes two observability endpoints: a lightweight health check for uptime monitoring and a Prometheus-compatible metrics endpoint for detailed operational telemetry. Neither endpoint requires authentication, making them safe to use with external monitoring systems.

Health Check

GET /health

Returns the current health status of the Horizon API server. No authentication required.

Requires authentication via x-api-key header.

The health endpoint is designed for load balancer probes, container orchestrators (e.g., Kubernetes liveness and readiness probes), and uptime monitoring services. It performs a minimal internal check and responds quickly.

Request

curl -X GET https://api.horizonplatform.ai/health

const response = await fetch('https://api.horizonplatform.ai/health');
const health = await response.json();

console.log(health.status);  // "ok"
console.log(health.version); // "2.4.1"
console.log(health.uptime);  // 86400

import requests

response = requests.get('https://api.horizonplatform.ai/health')
health = response.json()

print(health['status'])   # "ok"
print(health['version'])  # "2.4.1"
print(health['uptime'])   # 86400

Response

// 200 OK
{
  "status": "ok",
  "version": "2.4.1",
  "uptime": 86400
}

Response Fields

Parameter	Type	Description
status required	string	The health status. Returns 'ok' when the server is healthy, or 'degraded' if non-critical subsystems are impaired.
version required	string	The current API server version (semver).
uptime required	number	Server uptime in seconds since the last process start.

Health Status Values

Status	HTTP Code	Meaning
`ok`	`200`	All systems are operational.
`degraded`	`200`	The API is functional but a non-critical subsystem (e.g., metrics collection) is impaired.
`unhealthy`	`503`	The server cannot serve requests. Database or Redis connection has failed.

Configuring Uptime Monitoring

Most uptime monitoring services can be pointed directly at the health endpoint. Here is a typical configuration:

URL: https://api.horizonplatform.ai/health
Method: GET
Expected status: 200
Interval: 30 seconds
Timeout: 5 seconds
Alert on: Status code not 200, or response time exceeding 2 seconds

Prometheus Metrics

GET /metrics

Returns Prometheus-compatible metrics in text exposition format. No authentication required.

Requires authentication via x-api-key header.

The metrics endpoint provides detailed operational telemetry for the Horizon Express server and BullMQ job queues. Metrics are exposed in the standard Prometheus text format and can be scraped by any Prometheus-compatible collector.

Request

curl -X GET https://api.horizonplatform.ai/metrics

const response = await fetch('https://api.horizonplatform.ai/metrics');
const metricsText = await response.text();

// Parse with a Prometheus client library if needed
console.log(metricsText);

import requests

response = requests.get('https://api.horizonplatform.ai/metrics')
metrics_text = response.text

# Parse with prometheus_client or feed directly into a collector
print(metrics_text)

Response

The endpoint returns text/plain; version=0.0.4 content. Here is an abbreviated example:

# HELP horizon_http_requests_total Total number of HTTP requests
# TYPE horizon_http_requests_total counter
horizon_http_requests_total{method="GET",path="/api/conversations",status="200"} 14523
horizon_http_requests_total{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",status="202"} 892
horizon_http_requests_total{method="GET",path="/health",status="200"} 98210

# HELP horizon_http_request_duration_seconds HTTP request latency in seconds
# TYPE horizon_http_request_duration_seconds histogram
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="0.1"} 450
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="0.5"} 780
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="1"} 870
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="+Inf"} 892

# HELP horizon_queue_depth Current number of jobs waiting in the BullMQ queue
# TYPE horizon_queue_depth gauge
horizon_queue_depth{queue="skill-execution"} 3
horizon_queue_depth{queue="webhooks"} 0
horizon_queue_depth{queue="scheduled-jobs"} 1

# HELP horizon_active_jobs Current number of jobs being processed
# TYPE horizon_active_jobs gauge
horizon_active_jobs{queue="skill-execution"} 5
horizon_active_jobs{queue="webhooks"} 2
horizon_active_jobs{queue="scheduled-jobs"} 0

# HELP horizon_errors_total Total number of errors by type
# TYPE horizon_errors_total counter
horizon_errors_total{type="validation_error"} 234
horizon_errors_total{type="authentication_required"} 89
horizon_errors_total{type="skill_execution_failed"} 12

Available Metrics

Metric	Type	Description
`horizon_http_requests_total`	Counter	Total HTTP requests, labeled by method, path, and status code.
`horizon_http_request_duration_seconds`	Histogram	Request latency distribution with configurable bucket boundaries.
`horizon_queue_depth`	Gauge	Number of jobs waiting in each BullMQ queue.
`horizon_active_jobs`	Gauge	Number of jobs currently being processed per queue.
`horizon_completed_jobs_total`	Counter	Total completed jobs per queue.
`horizon_failed_jobs_total`	Counter	Total failed jobs per queue.
`horizon_errors_total`	Counter	Total errors by error type.
`horizon_job_duration_seconds`	Histogram	Job processing time distribution per queue.
`horizon_supabase_pool_active`	Gauge	Number of active Supabase database connections.
`horizon_supabase_pool_idle`	Gauge	Number of idle database connections in the pool.
`horizon_redis_connected`	Gauge	Whether the Redis connection is active (1 = connected, 0 = disconnected).

Prometheus Configuration

Add the following scrape configuration to your prometheus.yml:

scrape_configs:
  - job_name: 'horizon-api'
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets: ['api.horizonplatform.ai:443']
    scheme: https

For self-hosted deployments behind a VPN or private network, adjust the target accordingly:

scrape_configs:
  - job_name: 'horizon-api'
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets: ['horizon-api.internal:3000']
    scheme: http

Grafana Dashboard

Horizon metrics work well with Grafana. Below are recommended panels for a production dashboard:

Panel	Query	Visualization
Request Rate	`rate(horizon_http_requests_total[5m])`	Time series
P95 Latency	`histogram_quantile(0.95, rate(horizon_http_request_duration_seconds_bucket[5m]))`	Time series
Queue Depth	`horizon_queue_depth`	Gauge / Time series
Active Jobs	`horizon_active_jobs`	Stat
Error Rate	`rate(horizon_errors_total[5m])`	Time series
Job Duration P99	`histogram_quantile(0.99, rate(horizon_job_duration_seconds_bucket[5m]))`	Time series
DB Pool Usage	`horizon_supabase_pool_active / (horizon_supabase_pool_active + horizon_supabase_pool_idle)`	Gauge

Alerting Recommendations

Use these PromQL expressions as starting points for alerting rules:

groups:
  - name: horizon
    rules:
      - alert: HighErrorRate
        expr: rate(horizon_errors_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Horizon error rate is elevated"

      - alert: QueueBacklog
        expr: horizon_queue_depth{queue="skill-execution"} > 50
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Skill execution queue backlog exceeding 50 jobs"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(horizon_http_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 API latency exceeding 2 seconds"

      - alert: RedisDisconnected
        expr: horizon_redis_connected == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Horizon lost connection to Redis"