Skip to content

feat: Observability and monitoring for Anticapture#1683

Open
PedroBinotto wants to merge 45 commits intodevfrom
feat/otel-instrumentation
Open

feat: Observability and monitoring for Anticapture#1683
PedroBinotto wants to merge 45 commits intodevfrom
feat/otel-instrumentation

Conversation

@PedroBinotto
Copy link
Collaborator

@PedroBinotto PedroBinotto commented Mar 3, 2026

Observability Infrastructure (packages/observability, infra/monitoring/)

A new shared @anticapture/observability package centralizes all telemetry
setup, used by every service.

Telemetry data flow

Services (API, Gateway, Indexer)
  │
  ├── METRICS (pull)
  │   └── GET /metrics (Prometheus text format)
  │           │
  │           └── Prometheus scrapes every 15s
  │                   │
  │                   └── TSDB storage (:9090)
  │                           │
  │                           ├── Grafana dashboards (:3000)
  │                           └── alerts.yml → Alertmanager (:9093) → Slack
  │
  └── TRACES (push)
      └── POST :4318/v1/traces (OTLP HTTP)
              │
              └── OTel Collector
                      │
                      └── Tempo (gRPC :4317)
                              │
                              └── Grafana (Tempo datasource)

Currently configured Prometheus alert rules

Alert Condition Severity
ServiceDown up == 0 for 1 min critical
HighCPUUsage CPU > 80% for 5 min warning
HighLatency p99 latency > 2s for 5 min warning
HighErrorRate 5xx rate > 1% for 2 min critical

Currently configured Grafana dashboards

Unified Anticapture dashboad

14 panels, 15s refresh:

  • Process CPU usage % (by job + state);
  • Process memory usage (by type);
  • System CPU usage %;
  • System memory usage;
  • Network I/O (by direction);
  • Network errors & dropped packets;
  • Request rate — HTTP req/s by route + method;
  • Error rate — 5xx errors/s (red-highlighted);
  • Latency — P99/P95/P50 via histogram_quantile();
  • Latency by route — per-route P99 breakdown;
  • Status code breakdown — pie chart of last 5m;
  • Trace search — link panel to Tempo explorer;
  • RPC Request counter;
  • RPC Request rate (time series);

Overview Dashboard (overview.json)

Focused on database metrics, with panels served from Prometheus. Refreshes on auto.

Host Metrics Dashboard (host-metrics.json)

6 panels, 15s refresh:

  • Process CPU usage % (by job + state);
  • Process memory usage (by type);
  • System CPU usage %;
  • System memory usage;
  • Network I/O (by direction);
  • Network errors & dropped packets;

Tracing Dashboard (tracing.json)

6 panels, 15s refresh:

  • Request rate — HTTP req/s by route + method;
  • Error rate — 5xx errors/s (red-highlighted);
  • Latency — P99/P95/P50 via histogram_quantile();
  • Latency by route — per-route P99 breakdown;
  • Status code breakdown — pie chart of last 5m;
  • Trace search — link panel to Tempo explorer;

Adding new rules and dashboards

Alert rules live in two files that must be kept in sync:

  • prometheus/alerts.yml — used locally via compose.yaml
  • railway/alerts.yml — deployed to Railway

Adding an alert

Add a new entry under the rules list in both files:

- alert: MyAlert
  expr: <promql expression>
  for: <duration>          # how long condition must hold before firing
  labels:
    severity: warning      # warning | critical
  annotations:
    summary: "Short description"
    description: "Detail with {{ $labels.job }} and {{ $value }}"

Severity convention:

  • critical — requires immediate action (service down, high error rate)
  • warning — degraded but not broken (high latency, elevated CPU)
Useful PromQL patterns
# Request rate per job
sum(rate(http_server_request_duration_seconds_count[1m])) by (job)

# p99 latency
histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le))

# 5xx error rate
sum(rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[2m]))
/
sum(rate(http_server_request_duration_seconds_count[2m]))

# Service up/down
up{job="anticapture-api"}
Reloading rules without restart
curl -X POST http://localhost:9090/-/reload
Testing an expression

Open http://localhost:9090/graph, paste the expr, and check the result before committing.


Grafana dashboards

Dashboard JSON files live in grafana/dashboards/. Grafana auto-provisions them on startup.

File Contents
overview.json Database metrics overview
host-metrics.json CPU, memory, network per service
tracing.json HTTP request rate, latency, error rate, trace search
Editing a dashboard
  1. Open Grafana at http://localhost:3000 and navigate to the dashboard.
  2. Make your changes in the UI.
  3. Click ShareExportSave to file.
  4. Replace the corresponding JSON file in grafana/dashboards/.
  5. Commit both the JSON file and any related alert changes together.
Adding a new dashboard
  1. Build the dashboard in Grafana UI.
  2. Export it: ShareExport → toggle Export for sharing externally on → Save to file.
  3. Save the file to grafana/dashboards/<name>.json.
  4. Restart Grafana (or wait — provisioning picks up new files on restart):
docker compose --profile monitoring restart grafana

All metrics are labeled with http_request_method, http_route, and
http_response_status_code. Trace IDs are injected into HTTP responses via the
X-Trace-Id header, linking requests to their Tempo traces.

Deployment: All monitoring components have Railway-ready Dockerfiles and .railway.toml configs under infra/monitoring/railway/. A local compose.yaml is also provided for development. API services previously built with Railpack must be configured to used the provided Dockerfiles.

@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
anticapture Ready Ready Preview, Comment Mar 12, 2026 6:02pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
anticapture-storybook Skipped Skipped Mar 12, 2026 6:02pm

Request Review

@PedroBinotto
Copy link
Collaborator Author

@claude

@claude

This comment was marked as resolved.

@vercel vercel bot temporarily deployed to Preview – anticapture-storybook March 3, 2026 21:23 Inactive
@PedroBinotto
Copy link
Collaborator Author

@claude

@claude

This comment was marked as resolved.

@PedroBinotto
Copy link
Collaborator Author

@claude

@railway-app railway-app bot temporarily deployed to anticapture-infra / dev March 9, 2026 20:18 Inactive
@PedroBinotto
Copy link
Collaborator Author

@claude

@claude

This comment was marked as resolved.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4aed2e954d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +71 to +72
process.once("SIGTERM", shutdown);
process.once("SIGINT", shutdown);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve default SIGTERM exit after telemetry shutdown

Registering SIGTERM/SIGINT handlers here overrides Node’s default signal-exit behavior, but the handler only shuts down telemetry providers and never closes the HTTP server or exits the process; in services that keep sockets open, a normal SIGTERM (for deploy/stop) will no longer terminate promptly and will require a forced kill. This affects every app that calls createObservabilityProvider.

Useful? React with 👍 / 👎.


const server = createServer((req, res) => {
const server = createServer(async (req, res) => {
if (!validateAuthToken(req, res)) return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exempt metrics route from gateway token validation

The gateway now enforces validateAuthToken before routing /metrics, so any deployment with BLOCKFUL_API_TOKEN set will return 401 to Prometheus scrapes; both monitoring scrape configs target /metrics without auth headers (infra/monitoring/prometheus/prometheus.yml and infra/monitoring/railway/prometheus.yml), which makes gateway observability fail in secured environments.

Useful? React with 👍 / 👎.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be fixed

Copy link
Member

@pikonha pikonha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great implementation, just some minor things to consider. the only thing that needs clarification is the requirement of OTEL during dev on the API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have to be manually implemented? In fact, what does it do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it does (?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have a monitoring/railway package? what does it have to do with Railway specifically?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename it to remote or something like that (though I think railwayis fine); its just a separate package that contains the configs for deploying to Railway (as opposed to the config files in monitoring which are for running the setup locally for dev)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move all of these to packages/observability?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should it? packages/observability is a library package

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we differentiate the alerts from dev and prod? I don't wanna get pinged with urgency for dev alerts

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dev and prod will have each their own instances of observability environments - the simples way to differentiate them is to have dev and prod pointing to separate webhooks for slack integration

"build": "tsup",
"build:watch": "tsc --watch",
"dev:watch": "tsx watch src/index.ts",
"dev:watch": "tsx watch --import tsx/esm src/index.ts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines +21 to +26
export const httpRequestTotal: Counter = meter.createCounter(
"http_server_requests_total",
{
description: "Total number of HTTP requests",
},
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have a counter for actual RPC requests sent from the API to a RPC node

const observability = createObservabilityProvider("anticapture-api");

export const exporter = observability.exporter;
export const meterProvider = observability.meterProvider;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

la ele

Comment on lines +12 to +21
const start = performance.now();
await next();
const duration = (performance.now() - start) / 1000;

const spanContext = trace.getActiveSpan()?.spanContext();
const traceId = spanContext?.traceId;

if (traceId) {
c.res.headers.set("X-Trace-Id", traceId);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fire 🔥


const server = createServer((req, res) => {
const server = createServer(async (req, res) => {
if (!validateAuthToken(req, res)) return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be fixed

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f25cd8302

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

serviceName: string,
): ObservabilityProvider {
const collectorEndpoint =
process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318";

This comment was marked as resolved.


const labels = {
[ATTR_HTTP_REQUEST_METHOD]: c.req.method,
[ATTR_HTTP_ROUTE]: c.req.routePath ?? c.req.path,

This comment was marked as resolved.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45891cd101

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


export const metricsMiddleware: MiddlewareHandler = async (c, next) => {
const start = performance.now();
await next();

This comment was marked as resolved.

metrics_path: "/metrics"
scrape_interval: 15s
static_configs:
- targets: ["172.17.0.1:42069"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Point API and indexer scrape jobs at distinct targets

This local Prometheus config scrapes both anticapture-ens-indexer and anticapture-api from the same endpoint (172.17.0.1:42069), so one running service is counted as two healthy jobs and the other service's outage is masked. In practice this duplicates metrics under different job labels and makes ServiceDown/latency/error alerts for API vs indexer unreliable unless the targets are separated.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e04ff90f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

metrics_path: "/metrics"
scrape_interval: 15s
static_configs:
- targets: ["172.17.0.1:42069"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use stable host alias for local scrape targets

All three local scrape jobs are pinned to 172.17.0.1, which only works when the Docker bridge gateway happens to use that address; on Docker Desktop and many custom bridge subnets, Prometheus cannot reach host-run services and each job stays up=0. Since infra/monitoring/docker-compose.yml already configures host.docker.internal via extra_hosts, these fixed IP targets make the local monitoring setup fail in common environments.

Useful? React with 👍 / 👎.

@PedroBinotto
Copy link
Collaborator Author

@claude

@claude

This comment was marked as resolved.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a26971620

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

"clean": "rm -rf node_modules generated .ponder dump *.tsbuildinfo"
},
"dependencies": {
"@anticapture/observability": "workspace:*",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace workspace dependency in indexer install path

Adding "@anticapture/observability": "workspace:*" makes the indexer container build fail, because infra/indexer/Dockerfile.indexer installs dependencies with plain npm install after copying only apps/indexer (no workspace root/package source). In that context npm cannot resolve workspace:* (EUNSUPPORTEDPROTOCOL), so the image no longer builds and the compose indexer service cannot start.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plain wrong

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96312824a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +73 to +74
await meterProvider.shutdown();
await tracerProvider.shutdown();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Run caller shutdown hook even if telemetry flush fails

If meterProvider.shutdown() or tracerProvider.shutdown() rejects (for example during exporter flush timeouts), execution aborts before onShutdown?.() runs. Services that rely on this hook to close HTTP servers or exit the process will then skip their cleanup path during SIGTERM/SIGINT, which can leave the process hanging until forcibly killed.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants