This repository now includes a baseline observability stack for the production runtime that exists in-repo: the NestJS indexer service and its PostgreSQL dependency.
In-scope services:
indexer-prod: the long-running TeachLink indexer API and event processorpostgres: the indexer datastoreprometheus: metrics collection and rule evaluationalertmanager: alert routing and alert state inspectiongrafana: operational dashboardspostgres-exporter: PostgreSQL metrics exporterblackbox-exporter: HTTP health probing for the indexeralert-webhook: local webhook sink used to verify end-to-end alert delivery
Out of scope:
- Soroban contracts themselves are not long-running processes in this repository and therefore are not scraped directly.
- Managed Stellar infrastructure outside this repository is only checked indirectly through the indexer's Horizon dependency health check.
Architecture:
indexer-prodexposesGET /health,GET /metrics, andGET /metrics/json.- Prometheus scrapes the indexer metrics endpoint, PostgreSQL exporter, and the monitoring stack itself.
- Blackbox Exporter probes
http://indexer-prod:3000/healthfor user-facing availability. - Alertmanager receives Prometheus alerts and forwards them to the local webhook sink for verifiable testing.
- Grafana provisions dashboards from JSON files under
observability/grafana/dashboards/.
The indexer now exports Prometheus metrics from /metrics, including:
- HTTP request volume by route, method, and status code
- HTTP request latency
- Dashboard cache hits and misses
- Dashboard generation latency
- Persisted indexer totals for processed events and errors
- Last processed ledger
- Last processed event timestamp
- Event-processing lag in seconds
- Dependency health for PostgreSQL and Horizon
- Default process and Node.js runtime metrics from
prom-client
The /health endpoint now returns a structured readiness payload with:
- overall service status
- PostgreSQL connectivity status
- Horizon reachability status
- indexer state freshness based on
INDEXER_STALE_AFTER_SECONDS
postgres-exporter supplies database availability and core PostgreSQL metrics such as pg_up.
blackbox-exporter records probe_success against the indexer health endpoint, which catches failures that a plain scrape may miss.
Alert rules are defined in observability/prometheus/alerts.yml.
Critical alerts:
TeachLinkIndexerDown: Prometheus cannot scrape the indexer metrics endpointTeachLinkIndexerHealthcheckFailed: the indexer health endpoint fails blackbox probingTeachLinkIndexerDatabaseUnavailable: PostgreSQL is unreachable from the app or exporterTeachLinkIndexerHorizonUnavailable: the indexer cannot reach Stellar HorizonTeachLinkIndexerStaleProcessing: no event has been processed within the configured freshness window
Warning alerts:
TeachLinkIndexerHighHttpErrorRate: 5xx rate above 5% over 10 minutesTeachLinkIndexerHighLatency: average HTTP latency above 1 second over 10 minutesTeachLinkGrafanaDown: Grafana scrape target unavailableTeachLinkAlertmanagerDown: Alertmanager scrape target unavailable
Thresholds are intentionally conservative to keep noise low while still surfacing real failures.
Provisioned dashboards:
TeachLink Service OverviewTeachLink Platform Dependencies
They show:
- service uptime and probe health
- indexer ledger lag and processing totals
- throughput, latency, and active alerts
- database reachability and dependency health
- runtime memory and scrape health
From indexer/:
cp .env.example .env
docker compose --profile production --profile observability up -dEndpoints:
- Prometheus:
http://localhost:${PROMETHEUS_PORT:-9090} - Alertmanager:
http://localhost:${ALERTMANAGER_PORT:-9093} - Grafana:
http://localhost:${GRAFANA_PORT:-3001} - Indexer health:
http://localhost:3000/health - Indexer metrics:
http://localhost:3000/metrics - Alert webhook sink:
http://localhost:5001/alerts
Grafana credentials come from GRAFANA_ADMIN_USER and GRAFANA_ADMIN_PASSWORD. Set them in .env rather than editing compose files.
Run:
./observability/test-alerting.shWhat it does:
- Starts the production and observability profiles.
- Stops
indexer-prod. - Polls Alertmanager until
TeachLinkIndexerDownis firing. - Fetches the webhook payload from the local receiver.
- Starts
indexer-prodagain before exiting.
Expected result:
- the script prints that
TeachLinkIndexerDownis firing http://localhost:9093shows the alert as activehttp://localhost:5001/alertsshows the delivered payload
- Start the stack with the production and observability profiles.
- Open Grafana and confirm both dashboards are loaded.
- Open Prometheus and verify targets for
teachlink-indexer,postgres-exporter,grafana,alertmanager, andteachlink-indexer-healthareUP. - Stop
indexer-prodwithdocker compose stop indexer-prod. - Wait at least 2 minutes for the
TeachLinkIndexerDownandTeachLinkIndexerHealthcheckFailedalerts to fire. - Confirm Alertmanager lists the alerts and the webhook sink received a payload.
- Start the service again with
docker compose start indexer-prodand verify the alerts resolve.
Recommended checks:
docker compose config
docker run --rm -v "$PWD/observability/prometheus:/workspace" prom/prometheus:v2.54.1 promtool check config /workspace/prometheus.yml
docker run --rm -v "$PWD/observability/prometheus:/workspace" prom/prometheus:v2.54.1 promtool check rules /workspace/alerts.yml
npm install
npm run build- No alerting secrets or Grafana credentials are hardcoded in the configs.
- Metrics remain on the private compose network unless you intentionally publish the relevant ports.
- The local
alert-webhookservice is for validation and can be replaced with an external notification integration in real deployments. - Monitoring does not log wallet secrets or contract private material.
- The repository does not manage host-level infrastructure, so host CPU or disk alerts are not included.
- Horizon is checked as an external dependency via HTTP reachability only.
- The local webhook sink is meant for baseline verification; production teams should swap it for Slack, PagerDuty, Opsgenie, or email routing appropriate to their environment.