Health & Metrics
Horizon exposes two observability endpoints: a lightweight health check for uptime monitoring and a Prometheus-compatible metrics endpoint for detailed operational telemetry. Neither endpoint requires authentication, making them safe to use with external monitoring systems.
Health Check
Section titled “Health Check”/health Returns the current health status of the Horizon API server. No authentication required.
Requires authentication via x-api-key header.
The health endpoint is designed for load balancer probes, container orchestrators (e.g., Kubernetes liveness and readiness probes), and uptime monitoring services. It performs a minimal internal check and responds quickly.
Request
Section titled “Request”curl -X GET https://api.horizonplatform.ai/healthconst response = await fetch('https://api.horizonplatform.ai/health');const health = await response.json();
console.log(health.status); // "ok"console.log(health.version); // "2.4.1"console.log(health.uptime); // 86400import requests
response = requests.get('https://api.horizonplatform.ai/health')health = response.json()
print(health['status']) # "ok"print(health['version']) # "2.4.1"print(health['uptime']) # 86400Response
Section titled “Response”// 200 OK{ "status": "ok", "version": "2.4.1", "uptime": 86400}Response Fields
| Parameter | Type | Description |
|---|---|---|
| status required | string | The health status. Returns 'ok' when the server is healthy, or 'degraded' if non-critical subsystems are impaired. |
| version required | string | The current API server version (semver). |
| uptime required | number | Server uptime in seconds since the last process start. |
Health Status Values
Section titled “Health Status Values”| Status | HTTP Code | Meaning |
|---|---|---|
ok | 200 | All systems are operational. |
degraded | 200 | The API is functional but a non-critical subsystem (e.g., metrics collection) is impaired. |
unhealthy | 503 | The server cannot serve requests. Database or Redis connection has failed. |
Configuring Uptime Monitoring
Section titled “Configuring Uptime Monitoring”Most uptime monitoring services can be pointed directly at the health endpoint. Here is a typical configuration:
- URL:
https://api.horizonplatform.ai/health - Method: GET
- Expected status: 200
- Interval: 30 seconds
- Timeout: 5 seconds
- Alert on: Status code not 200, or response time exceeding 2 seconds
Prometheus Metrics
Section titled “Prometheus Metrics”/metrics Returns Prometheus-compatible metrics in text exposition format. No authentication required.
Requires authentication via x-api-key header.
The metrics endpoint provides detailed operational telemetry for the Horizon Express server and BullMQ job queues. Metrics are exposed in the standard Prometheus text format and can be scraped by any Prometheus-compatible collector.
Request
Section titled “Request”curl -X GET https://api.horizonplatform.ai/metricsconst response = await fetch('https://api.horizonplatform.ai/metrics');const metricsText = await response.text();
// Parse with a Prometheus client library if neededconsole.log(metricsText);import requests
response = requests.get('https://api.horizonplatform.ai/metrics')metrics_text = response.text
# Parse with prometheus_client or feed directly into a collectorprint(metrics_text)Response
Section titled “Response”The endpoint returns text/plain; version=0.0.4 content. Here is an abbreviated example:
# HELP horizon_http_requests_total Total number of HTTP requests# TYPE horizon_http_requests_total counterhorizon_http_requests_total{method="GET",path="/api/conversations",status="200"} 14523horizon_http_requests_total{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",status="202"} 892horizon_http_requests_total{method="GET",path="/health",status="200"} 98210
# HELP horizon_http_request_duration_seconds HTTP request latency in seconds# TYPE horizon_http_request_duration_seconds histogramhorizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="0.1"} 450horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="0.5"} 780horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="1"} 870horizon_http_request_duration_seconds_bucket{method="POST",path="/api/quickbooks/v1/profit-and-loss-report",le="+Inf"} 892
# HELP horizon_queue_depth Current number of jobs waiting in the BullMQ queue# TYPE horizon_queue_depth gaugehorizon_queue_depth{queue="skill-execution"} 3horizon_queue_depth{queue="webhooks"} 0horizon_queue_depth{queue="scheduled-jobs"} 1
# HELP horizon_active_jobs Current number of jobs being processed# TYPE horizon_active_jobs gaugehorizon_active_jobs{queue="skill-execution"} 5horizon_active_jobs{queue="webhooks"} 2horizon_active_jobs{queue="scheduled-jobs"} 0
# HELP horizon_errors_total Total number of errors by type# TYPE horizon_errors_total counterhorizon_errors_total{type="validation_error"} 234horizon_errors_total{type="authentication_required"} 89horizon_errors_total{type="skill_execution_failed"} 12Available Metrics
Section titled “Available Metrics”| Metric | Type | Description |
|---|---|---|
horizon_http_requests_total | Counter | Total HTTP requests, labeled by method, path, and status code. |
horizon_http_request_duration_seconds | Histogram | Request latency distribution with configurable bucket boundaries. |
horizon_queue_depth | Gauge | Number of jobs waiting in each BullMQ queue. |
horizon_active_jobs | Gauge | Number of jobs currently being processed per queue. |
horizon_completed_jobs_total | Counter | Total completed jobs per queue. |
horizon_failed_jobs_total | Counter | Total failed jobs per queue. |
horizon_errors_total | Counter | Total errors by error type. |
horizon_job_duration_seconds | Histogram | Job processing time distribution per queue. |
horizon_supabase_pool_active | Gauge | Number of active Supabase database connections. |
horizon_supabase_pool_idle | Gauge | Number of idle database connections in the pool. |
horizon_redis_connected | Gauge | Whether the Redis connection is active (1 = connected, 0 = disconnected). |
Prometheus Configuration
Section titled “Prometheus Configuration”Add the following scrape configuration to your prometheus.yml:
scrape_configs: - job_name: 'horizon-api' scrape_interval: 15s metrics_path: /metrics static_configs: - targets: ['api.horizonplatform.ai:443'] scheme: httpsFor self-hosted deployments behind a VPN or private network, adjust the target accordingly:
scrape_configs: - job_name: 'horizon-api' scrape_interval: 15s metrics_path: /metrics static_configs: - targets: ['horizon-api.internal:3000'] scheme: httpGrafana Dashboard
Section titled “Grafana Dashboard”Horizon metrics work well with Grafana. Below are recommended panels for a production dashboard:
| Panel | Query | Visualization |
|---|---|---|
| Request Rate | rate(horizon_http_requests_total[5m]) | Time series |
| P95 Latency | histogram_quantile(0.95, rate(horizon_http_request_duration_seconds_bucket[5m])) | Time series |
| Queue Depth | horizon_queue_depth | Gauge / Time series |
| Active Jobs | horizon_active_jobs | Stat |
| Error Rate | rate(horizon_errors_total[5m]) | Time series |
| Job Duration P99 | histogram_quantile(0.99, rate(horizon_job_duration_seconds_bucket[5m])) | Time series |
| DB Pool Usage | horizon_supabase_pool_active / (horizon_supabase_pool_active + horizon_supabase_pool_idle) | Gauge |
Alerting Recommendations
Section titled “Alerting Recommendations”Use these PromQL expressions as starting points for alerting rules:
groups: - name: horizon rules: - alert: HighErrorRate expr: rate(horizon_errors_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "Horizon error rate is elevated"
- alert: QueueBacklog expr: horizon_queue_depth{queue="skill-execution"} > 50 for: 10m labels: severity: critical annotations: summary: "Skill execution queue backlog exceeding 50 jobs"
- alert: HighLatency expr: histogram_quantile(0.95, rate(horizon_http_request_duration_seconds_bucket[5m])) > 2 for: 5m labels: severity: warning annotations: summary: "P95 API latency exceeding 2 seconds"
- alert: RedisDisconnected expr: horizon_redis_connected == 0 for: 1m labels: severity: critical annotations: summary: "Horizon lost connection to Redis"