TeachLink Indexer Monitoring

This repository now includes a baseline observability stack for the production runtime that exists in-repo: the NestJS indexer service and its PostgreSQL dependency.

Scope and architecture

In-scope services:

indexer-prod: the long-running TeachLink indexer API and event processor
postgres: the indexer datastore
prometheus: metrics collection and rule evaluation
alertmanager: alert routing and alert state inspection
grafana: operational dashboards
postgres-exporter: PostgreSQL metrics exporter
blackbox-exporter: HTTP health probing for the indexer
alert-webhook: local webhook sink used to verify end-to-end alert delivery

Out of scope:

Soroban contracts themselves are not long-running processes in this repository and therefore are not scraped directly.
Managed Stellar infrastructure outside this repository is only checked indirectly through the indexer's Horizon dependency health check.

Architecture:

indexer-prod exposes GET /health, GET /metrics, and GET /metrics/json.
Prometheus scrapes the indexer metrics endpoint, PostgreSQL exporter, and the monitoring stack itself.
Blackbox Exporter probes http://indexer-prod:3000/health for user-facing availability.
Alertmanager receives Prometheus alerts and forwards them to the local webhook sink for verifiable testing.
Grafana provisions dashboards from JSON files under observability/grafana/dashboards/.

Metrics sources

Indexer metrics

The indexer now exports Prometheus metrics from /metrics, including:

HTTP request volume by route, method, and status code
HTTP request latency
Dashboard cache hits and misses
Dashboard generation latency
Persisted indexer totals for processed events and errors
Last processed ledger
Last processed event timestamp
Event-processing lag in seconds
Dependency health for PostgreSQL and Horizon
Default process and Node.js runtime metrics from prom-client

The /health endpoint now returns a structured readiness payload with:

overall service status
PostgreSQL connectivity status
Horizon reachability status
indexer state freshness based on INDEXER_STALE_AFTER_SECONDS

PostgreSQL metrics

postgres-exporter supplies database availability and core PostgreSQL metrics such as pg_up.

Probe-based metrics

blackbox-exporter records probe_success against the indexer health endpoint, which catches failures that a plain scrape may miss.

Alert rules

Alert rules are defined in observability/prometheus/alerts.yml.

Critical alerts:

TeachLinkIndexerDown: Prometheus cannot scrape the indexer metrics endpoint
TeachLinkIndexerHealthcheckFailed: the indexer health endpoint fails blackbox probing
TeachLinkIndexerDatabaseUnavailable: PostgreSQL is unreachable from the app or exporter
TeachLinkIndexerHorizonUnavailable: the indexer cannot reach Stellar Horizon
TeachLinkIndexerStaleProcessing: no event has been processed within the configured freshness window

Warning alerts:

TeachLinkIndexerHighHttpErrorRate: 5xx rate above 5% over 10 minutes
TeachLinkIndexerHighLatency: average HTTP latency above 1 second over 10 minutes
TeachLinkGrafanaDown: Grafana scrape target unavailable
TeachLinkAlertmanagerDown: Alertmanager scrape target unavailable

Thresholds are intentionally conservative to keep noise low while still surfacing real failures.

Dashboards

Provisioned dashboards:

TeachLink Service Overview
TeachLink Platform Dependencies

They show:

service uptime and probe health
indexer ledger lag and processing totals
throughput, latency, and active alerts
database reachability and dependency health
runtime memory and scrape health

Running the stack

From indexer/:

cp .env.example .env
docker compose --profile production --profile observability up -d

Endpoints:

Prometheus: http://localhost:${PROMETHEUS_PORT:-9090}
Alertmanager: http://localhost:${ALERTMANAGER_PORT:-9093}
Grafana: http://localhost:${GRAFANA_PORT:-3001}
Indexer health: http://localhost:3000/health
Indexer metrics: http://localhost:3000/metrics
Alert webhook sink: http://localhost:5001/alerts

Grafana credentials come from GRAFANA_ADMIN_USER and GRAFANA_ADMIN_PASSWORD. Set them in .env rather than editing compose files.

Alert test procedure

Automated test

Run:

./observability/test-alerting.sh

What it does:

Starts the production and observability profiles.
Stops indexer-prod.
Polls Alertmanager until TeachLinkIndexerDown is firing.
Fetches the webhook payload from the local receiver.
Starts indexer-prod again before exiting.

Expected result:

the script prints that TeachLinkIndexerDown is firing
http://localhost:9093 shows the alert as active
http://localhost:5001/alerts shows the delivered payload

Manual verification

Start the stack with the production and observability profiles.
Open Grafana and confirm both dashboards are loaded.
Open Prometheus and verify targets for teachlink-indexer, postgres-exporter, grafana, alertmanager, and teachlink-indexer-health are UP.
Stop indexer-prod with docker compose stop indexer-prod.
Wait at least 2 minutes for the TeachLinkIndexerDown and TeachLinkIndexerHealthcheckFailed alerts to fire.
Confirm Alertmanager lists the alerts and the webhook sink received a payload.
Start the service again with docker compose start indexer-prod and verify the alerts resolve.

Validation

Recommended checks:

docker compose config
docker run --rm -v "$PWD/observability/prometheus:/workspace" prom/prometheus:v2.54.1 promtool check config /workspace/prometheus.yml
docker run --rm -v "$PWD/observability/prometheus:/workspace" prom/prometheus:v2.54.1 promtool check rules /workspace/alerts.yml
npm install
npm run build

Security notes

No alerting secrets or Grafana credentials are hardcoded in the configs.
Metrics remain on the private compose network unless you intentionally publish the relevant ports.
The local alert-webhook service is for validation and can be replaced with an external notification integration in real deployments.
Monitoring does not log wallet secrets or contract private material.

Limitations

The repository does not manage host-level infrastructure, so host CPU or disk alerts are not included.
Horizon is checked as an external dependency via HTTP reachability only.
The local webhook sink is meant for baseline verification; production teams should swap it for Slack, PagerDuty, Opsgenie, or email routing appropriate to their environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TeachLink Indexer Monitoring

Scope and architecture

Metrics sources

Indexer metrics

PostgreSQL metrics

Probe-based metrics

Alert rules

Dashboards

Running the stack

Alert test procedure

Automated test

Manual verification

Validation

Security notes

Limitations

FilesExpand file tree

MONITORING.md

Latest commit

History

MONITORING.md

File metadata and controls

TeachLink Indexer Monitoring

Scope and architecture

Metrics sources

Indexer metrics

PostgreSQL metrics

Probe-based metrics

Alert rules

Dashboards

Running the stack

Alert test procedure

Automated test

Manual verification

Validation

Security notes

Limitations