Skip to content

Health & Metrics

Horizon exposes two observability endpoints: a lightweight health check for uptime monitoring and a Prometheus-compatible metrics endpoint for detailed operational telemetry. Neither endpoint requires authentication, making them safe to use with external monitoring systems.


GET /health

Returns the current health status of the Horizon API server. No authentication required.

Requires authentication via x-api-key header.

The health endpoint is designed for load balancer probes, container orchestrators (e.g., Kubernetes liveness and readiness probes), and uptime monitoring services. It performs a minimal internal check and responds quickly.

curl
curl -X GET https://api.horizonplatform.ai/health
JavaScript
const response = await fetch('https://api.horizonplatform.ai/health');
const health = await response.json();
console.log(health.status); // "ok"
console.log(health.version); // "2.4.1"
console.log(health.uptime); // 86400
Python
import requests
response = requests.get('https://api.horizonplatform.ai/health')
health = response.json()
print(health['status']) # "ok"
print(health['version']) # "2.4.1"
print(health['uptime']) # 86400
// 200 OK
{
"status": "ok",
"version": "2.4.1",
"uptime": 86400
}

Response Fields

ParameterTypeDescription
status required string The health status. Returns 'ok' when the server is healthy, or 'degraded' if non-critical subsystems are impaired.
version required string The current API server version (semver).
uptime required number Server uptime in seconds since the last process start.
StatusHTTP CodeMeaning
ok200All systems are operational.
degraded200The API is functional but a non-critical subsystem (e.g., metrics collection) is impaired.
unhealthy503The server cannot serve requests. Database or Redis connection has failed.

Most uptime monitoring services can be pointed directly at the health endpoint. Here is a typical configuration:

  • URL: https://api.horizonplatform.ai/health
  • Method: GET
  • Expected status: 200
  • Interval: 30 seconds
  • Timeout: 5 seconds
  • Alert on: Status code not 200, or response time exceeding 2 seconds

GET /metrics

Returns Prometheus-compatible metrics in text exposition format. No authentication required.

Requires authentication via x-api-key header.

The metrics endpoint provides detailed operational telemetry for the Horizon Express server and BullMQ job queues. Metrics are exposed in the standard Prometheus text format and can be scraped by any Prometheus-compatible collector.

curl
curl -X GET https://api.horizonplatform.ai/metrics
JavaScript
const response = await fetch('https://api.horizonplatform.ai/metrics');
const metricsText = await response.text();
// Parse with a Prometheus client library if needed
console.log(metricsText);
Python
import requests
response = requests.get('https://api.horizonplatform.ai/metrics')
metrics_text = response.text
# Parse with prometheus_client or feed directly into a collector
print(metrics_text)

The endpoint returns text/plain; version=0.0.4 content. Here is an abbreviated example:

# HELP horizon_http_requests_total Total number of HTTP requests
# TYPE horizon_http_requests_total counter
horizon_http_requests_total{method="GET",path="/api/conversations",status="200"} 14523
horizon_http_requests_total{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",status="202"} 892
horizon_http_requests_total{method="GET",path="/health",status="200"} 98210
# HELP horizon_http_request_duration_seconds HTTP request latency in seconds
# TYPE horizon_http_request_duration_seconds histogram
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="0.1"} 450
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="0.5"} 780
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="1"} 870
horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="+Inf"} 892
# HELP horizon_queue_depth Current number of jobs waiting in the BullMQ queue
# TYPE horizon_queue_depth gauge
horizon_queue_depth{queue="skill-execution"} 3
horizon_queue_depth{queue="webhooks"} 0
horizon_queue_depth{queue="scheduled-jobs"} 1
# HELP horizon_active_jobs Current number of jobs being processed
# TYPE horizon_active_jobs gauge
horizon_active_jobs{queue="skill-execution"} 5
horizon_active_jobs{queue="webhooks"} 2
horizon_active_jobs{queue="scheduled-jobs"} 0
# HELP horizon_errors_total Total number of errors by type
# TYPE horizon_errors_total counter
horizon_errors_total{type="validation_error"} 234
horizon_errors_total{type="authentication_required"} 89
horizon_errors_total{type="skill_execution_failed"} 12
MetricTypeDescription
horizon_http_requests_totalCounterTotal HTTP requests, labeled by method, path, and status code.
horizon_http_request_duration_secondsHistogramRequest latency distribution with configurable bucket boundaries.
horizon_queue_depthGaugeNumber of jobs waiting in each BullMQ queue.
horizon_active_jobsGaugeNumber of jobs currently being processed per queue.
horizon_completed_jobs_totalCounterTotal completed jobs per queue.
horizon_failed_jobs_totalCounterTotal failed jobs per queue.
horizon_errors_totalCounterTotal errors by error type.
horizon_job_duration_secondsHistogramJob processing time distribution per queue.
horizon_supabase_pool_activeGaugeNumber of active Supabase database connections.
horizon_supabase_pool_idleGaugeNumber of idle database connections in the pool.
horizon_redis_connectedGaugeWhether the Redis connection is active (1 = connected, 0 = disconnected).

Add the following scrape configuration to your prometheus.yml:

prometheus.yml
scrape_configs:
- job_name: 'horizon-api'
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets: ['api.horizonplatform.ai:443']
scheme: https

For self-hosted deployments behind a VPN or private network, adjust the target accordingly:

prometheus.yml (self-hosted)
scrape_configs:
- job_name: 'horizon-api'
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets: ['horizon-api.internal:3000']
scheme: http

Horizon metrics work well with Grafana. Below are recommended panels for a production dashboard:

PanelQueryVisualization
Request Raterate(horizon_http_requests_total[5m])Time series
P95 Latencyhistogram_quantile(0.95, rate(horizon_http_request_duration_seconds_bucket[5m]))Time series
Queue Depthhorizon_queue_depthGauge / Time series
Active Jobshorizon_active_jobsStat
Error Raterate(horizon_errors_total[5m])Time series
Job Duration P99histogram_quantile(0.99, rate(horizon_job_duration_seconds_bucket[5m]))Time series
DB Pool Usagehorizon_supabase_pool_active / (horizon_supabase_pool_active + horizon_supabase_pool_idle)Gauge

Use these PromQL expressions as starting points for alerting rules:

alert-rules.yml
groups:
- name: horizon
rules:
- alert: HighErrorRate
expr: rate(horizon_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Horizon error rate is elevated"
- alert: QueueBacklog
expr: horizon_queue_depth{queue="skill-execution"} > 50
for: 10m
labels:
severity: critical
annotations:
summary: "Skill execution queue backlog exceeding 50 jobs"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(horizon_http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P95 API latency exceeding 2 seconds"
- alert: RedisDisconnected
expr: horizon_redis_connected == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Horizon lost connection to Redis"