feat: Observability and monitoring for Anticapture by PedroBinotto · Pull Request #1683 · blockful/anticapture

PedroBinotto · 2026-03-03T17:57:25Z

Observability Infrastructure (packages/observability, infra/monitoring/)

A new shared @anticapture/observability package centralizes all telemetry
setup, used by every service.

Telemetry data flow

Services (API, Gateway, Indexer)
  │
  ├── METRICS (pull)
  │   └── GET /metrics (Prometheus text format)
  │           │
  │           └── Prometheus scrapes every 15s
  │                   │
  │                   └── TSDB storage (:9090)
  │                           │
  │                           ├── Grafana dashboards (:3000)
  │                           └── alerts.yml → Alertmanager (:9093) → Slack
  │
  └── TRACES (push)
      └── POST :4318/v1/traces (OTLP HTTP)
              │
              └── OTel Collector
                      │
                      └── Tempo (gRPC :4317)
                              │
                              └── Grafana (Tempo datasource)

Currently configured Prometheus alert rules

Alert	Condition	Severity
ServiceDown	up == 0 for 1 min	critical
HighCPUUsage	CPU > 80% for 5 min	warning
HighLatency	p99 latency > 2s for 5 min	warning
HighErrorRate	5xx rate > 1% for 2 min	critical

Currently configured Grafana dashboards

Unified Anticapture dashboad

14 panels, 15s refresh:

Process CPU usage % (by job + state);
Process memory usage (by type);
System CPU usage %;
System memory usage;
Network I/O (by direction);
Network errors & dropped packets;
Request rate — HTTP req/s by route + method;
Error rate — 5xx errors/s (red-highlighted);
Latency — P99/P95/P50 via histogram_quantile();
Latency by route — per-route P99 breakdown;
Status code breakdown — pie chart of last 5m;
Trace search — link panel to Tempo explorer;
RPC Request counter;
RPC Request rate (time series);

Overview Dashboard (overview.json)

~~Focused on database metrics, with panels served from Prometheus. Refreshes on auto.~~

Host Metrics Dashboard (host-metrics.json)

~~6 panels, 15s refresh:~~

~~Process CPU usage % (by job + state);~~
~~Process memory usage (by type);~~
~~System CPU usage %;~~
~~System memory usage;~~
~~Network I/O (by direction);~~
~~Network errors & dropped packets;~~

Tracing Dashboard (tracing.json)

~~6 panels, 15s refresh:~~

~~Request rate — HTTP req/s by route + method;~~
~~Error rate — 5xx errors/s (red-highlighted);~~
~~Latency — P99/P95/P50 via histogram_quantile();~~
~~Latency by route — per-route P99 breakdown;~~
~~Status code breakdown — pie chart of last 5m;~~
~~Trace search — link panel to Tempo explorer;~~

Adding new rules and dashboards

Alert rules live in two files that must be kept in sync:

prometheus/alerts.yml — used locally via compose.yaml
railway/alerts.yml — deployed to Railway

Adding an alert

Add a new entry under the rules list in both files:

- alert: MyAlert
  expr: <promql expression>
  for: <duration>          # how long condition must hold before firing
  labels:
    severity: warning      # warning | critical
  annotations:
    summary: "Short description"
    description: "Detail with {{ $labels.job }} and {{ $value }}"

Severity convention:

critical — requires immediate action (service down, high error rate)
warning — degraded but not broken (high latency, elevated CPU)

Useful PromQL patterns

# Request rate per job
sum(rate(http_server_request_duration_seconds_count[1m])) by (job)

# p99 latency
histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket[5m])) by (le))

# 5xx error rate
sum(rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[2m]))
/
sum(rate(http_server_request_duration_seconds_count[2m]))

# Service up/down
up{job="anticapture-api"}

Reloading rules without restart

curl -X POST http://localhost:9090/-/reload

Testing an expression

Open http://localhost:9090/graph, paste the expr, and check the result before committing.

Grafana dashboards

Dashboard JSON files live in grafana/dashboards/. Grafana auto-provisions them on startup.

File	Contents
`overview.json`	Database metrics overview
`host-metrics.json`	CPU, memory, network per service
`tracing.json`	HTTP request rate, latency, error rate, trace search

Editing a dashboard

Open Grafana at http://localhost:3000 and navigate to the dashboard.
Make your changes in the UI.
Click Share → Export → Save to file.
Replace the corresponding JSON file in grafana/dashboards/.
Commit both the JSON file and any related alert changes together.

Adding a new dashboard

Build the dashboard in Grafana UI.
Export it: Share → Export → toggle Export for sharing externally on → Save to file.
Save the file to grafana/dashboards/<name>.json.
Restart Grafana (or wait — provisioning picks up new files on restart):

docker compose --profile monitoring restart grafana

All metrics are labeled with http_request_method, http_route, and
http_response_status_code. Trace IDs are injected into HTTP responses via the
X-Trace-Id header, linking requests to their Tempo traces.

Deployment: All monitoring components have Railway-ready Dockerfiles and .railway.toml configs under infra/monitoring/railway/. A local compose.yaml is also provided for development. API services previously built with Railpack must be configured to used the provided Dockerfiles.

vercel · 2026-03-03T17:57:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
anticapture	Ready	Preview, Comment	Mar 12, 2026 6:02pm

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
anticapture-storybook	Skipped		Mar 12, 2026 6:02pm

PedroBinotto · 2026-03-03T20:48:34Z

@claude

PedroBinotto · 2026-03-03T21:24:33Z

@claude

feat: enable tracing

PedroBinotto · 2026-03-05T19:42:17Z

@claude

PedroBinotto · 2026-03-09T20:19:49Z

@claude

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4aed2e954d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T20:32:51Z

packages/observability/src/index.ts

+  process.once("SIGTERM", shutdown);
+  process.once("SIGINT", shutdown);


Preserve default SIGTERM exit after telemetry shutdown

Registering SIGTERM/SIGINT handlers here overrides Node’s default signal-exit behavior, but the handler only shuts down telemetry providers and never closes the HTTP server or exits the process; in services that keep sockets open, a normal SIGTERM (for deploy/stop) will no longer terminate promptly and will require a forced kill. This affects every app that calls createObservabilityProvider.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-09T20:32:51Z

apps/api-gateway/src/index.ts


-  const server = createServer((req, res) => {
+  const server = createServer(async (req, res) => {
    if (!validateAuthToken(req, res)) return;


Exempt metrics route from gateway token validation

The gateway now enforces validateAuthToken before routing /metrics, so any deployment with BLOCKFUL_API_TOKEN set will return 401 to Prometheus scrapes; both monitoring scrape configs target /metrics without auth headers (infra/monitoring/prometheus/prometheus.yml and infra/monitoring/railway/prometheus.yml), which makes gateway observability fail in secured environments.

Useful? React with 👍 / 👎.

this should probably be fixed

pikonha

great implementation, just some minor things to consider. the only thing that needs clarification is the requirement of OTEL during dev on the API

pikonha · 2026-03-09T23:10:50Z

packages/observability/src/index.ts

why does this have to be manually implemented? In fact, what does it do?

I'm pretty sure it does (?)

Dockerfile.api

pikonha · 2026-03-09T23:12:34Z

infra/monitoring/railway/datasources/prometheus.yml

why do we have a monitoring/railway package? what does it have to do with Railway specifically?

We could rename it to remote or something like that (though I think railwayis fine); its just a separate package that contains the configs for deploying to Railway (as opposed to the config files in monitoring which are for running the setup locally for dev)

pikonha · 2026-03-09T23:12:58Z

infra/monitoring/prometheus/alerts.yml

could we move all of these to packages/observability?

why should it? packages/observability is a library package

pikonha · 2026-03-09T23:13:32Z

infra/monitoring/alertmanager/config.yml

how can we differentiate the alerts from dev and prod? I don't wanna get pinged with urgency for dev alerts

Dev and prod will have each their own instances of observability environments - the simples way to differentiate them is to have dev and prod pointing to separate webhooks for slack integration

pikonha · 2026-03-09T23:18:47Z

apps/api/package.json

    "build": "tsup",
    "build:watch": "tsc --watch",
-    "dev:watch": "tsx watch src/index.ts",
+    "dev:watch": "tsx watch --import tsx/esm src/index.ts",


pikonha · 2026-03-09T23:19:26Z

apps/api/src/metrics.ts

+export const httpRequestTotal: Counter = meter.createCounter(
+  "http_server_requests_total",
+  {
+    description: "Total number of HTTP requests",
+  },
+);


we could have a counter for actual RPC requests sent from the API to a RPC node

pikonha · 2026-03-09T23:20:50Z

apps/api/src/instrumentation.ts

+const observability = createObservabilityProvider("anticapture-api");
+
+export const exporter = observability.exporter;
+export const meterProvider = observability.meterProvider;


pikonha · 2026-03-09T23:21:31Z

apps/api/src/middlewares/metricsMiddleware.ts

+  const start = performance.now();
+  await next();
+  const duration = (performance.now() - start) / 1000;
+
+  const spanContext = trace.getActiveSpan()?.spanContext();
+  const traceId = spanContext?.traceId;
+
+  if (traceId) {
+    c.res.headers.set("X-Trace-Id", traceId);
+  }


pikonha · 2026-03-09T23:23:14Z

apps/api-gateway/src/index.ts


-  const server = createServer((req, res) => {
+  const server = createServer(async (req, res) => {
    if (!validateAuthToken(req, res)) return;


this should probably be fixed

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f25cd8302

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

packages/observability/src/index.ts

+  serviceName: string,
+): ObservabilityProvider {
+  const collectorEndpoint =
+    process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318";


apps/api/src/middlewares/metricsMiddleware.ts

+
+  const labels = {
+    [ATTR_HTTP_REQUEST_METHOD]: c.req.method,
+    [ATTR_HTTP_ROUTE]: c.req.routePath ?? c.req.path,


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45891cd101

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

apps/api/src/middlewares/metricsMiddleware.ts

+
+export const metricsMiddleware: MiddlewareHandler = async (c, next) => {
+  const start = performance.now();
+  await next();


chatgpt-codex-connector · 2026-03-10T21:33:55Z

infra/monitoring/prometheus/prometheus.yml

+    metrics_path: "/metrics"
+    scrape_interval: 15s
+    static_configs:
+      - targets: ["172.17.0.1:42069"]


Point API and indexer scrape jobs at distinct targets

This local Prometheus config scrapes both anticapture-ens-indexer and anticapture-api from the same endpoint (172.17.0.1:42069), so one running service is counted as two healthy jobs and the other service's outage is masked. In practice this duplicates metrics under different job labels and makes ServiceDown/latency/error alerts for API vs indexer unreliable unless the targets are separated.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e04ff90f8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-11T00:26:17Z

infra/monitoring/prometheus/prometheus.yml

+    metrics_path: "/metrics"
+    scrape_interval: 15s
+    static_configs:
+      - targets: ["172.17.0.1:42069"]


Use stable host alias for local scrape targets

All three local scrape jobs are pinned to 172.17.0.1, which only works when the Docker bridge gateway happens to use that address; on Docker Desktop and many custom bridge subnets, Prometheus cannot reach host-run services and each job stays up=0. Since infra/monitoring/docker-compose.yml already configures host.docker.internal via extra_hosts, these fixed IP targets make the local monitoring setup fail in common environments.

Useful? React with 👍 / 👎.

PedroBinotto · 2026-03-11T17:22:15Z

@claude

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a26971620

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-11T17:32:41Z

apps/indexer/package.json

    "clean": "rm -rf node_modules generated .ponder dump *.tsbuildinfo"
  },
  "dependencies": {
+    "@anticapture/observability": "workspace:*",


Replace workspace dependency in indexer install path

Adding "@anticapture/observability": "workspace:*" makes the indexer container build fail, because infra/indexer/Dockerfile.indexer installs dependencies with plain npm install after copying only apps/indexer (no workspace root/package source). In that context npm cannot resolve workspace:* (EUNSUPPORTEDPROTOCOL), so the image no longer builds and the compose indexer service cannot start.

Useful? React with 👍 / 👎.

plain wrong

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96312824a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-11T21:26:40Z

packages/observability/src/index.ts

+    await meterProvider.shutdown();
+    await tracerProvider.shutdown();


Run caller shutdown hook even if telemetry flush fails

If meterProvider.shutdown() or tracerProvider.shutdown() rejects (for example during exporter flush timeouts), execution aborts before onShutdown?.() runs. Services that rely on this hook to close HTTP servers or exit the process will then skip their cleanup path during SIGTERM/SIGINT, which can leave the process hanging until forcibly killed.

Useful? React with 👍 / 👎.

feat: vibeslopped init commit for OTEL instrumentation

4997ffb

vercel bot deployed to Preview – anticapture March 3, 2026 17:58 View deployment

chore: move observability config to separate package

d596f9a

vercel bot deployed to Preview – anticapture-storybook March 3, 2026 20:46 View deployment

vercel bot deployed to Preview – anticapture March 3, 2026 20:47 View deployment

This comment was marked as resolved.

Sign in to view

chore: swapping prom-api to OTEL

5f81c04

vercel bot temporarily deployed to Preview – anticapture-storybook March 3, 2026 21:23 Inactive

vercel bot deployed to Preview – anticapture March 3, 2026 21:24 View deployment

This comment was marked as resolved.

Sign in to view

PedroBinotto added 3 commits March 3, 2026 18:44

feat: enable tracing

4c6cd7f

WIP

8eec2a9

fix: silent noop

6099945

vercel bot temporarily deployed to Preview – anticapture-storybook March 5, 2026 18:06 Inactive

vercel bot deployed to Preview – anticapture March 5, 2026 18:07 View deployment

github-actions bot added the API label Mar 5, 2026

Merge branch 'feat/otel-instrumentation' into feat/otel-tracing

a48890c

github-actions bot added Indexer Gateway labels Mar 5, 2026

PedroBinotto added 4 commits March 5, 2026 15:14

fix: swap out deprecated method being used

fd3d23d

feat: lfg

b81244e

feat: gateway instrumentation and config file cleanup

3be9ad7

Merge pull request #1685 from blockful/feat/otel-tracing

b50d418

feat: enable tracing

vercel bot deployed to Preview – anticapture-storybook March 5, 2026 19:39 View deployment

vercel bot deployed to Preview – anticapture March 5, 2026 19:40 View deployment

Merge branch 'dev' into feat/otel-instrumentation

a9bdc22

vercel bot deployed to Preview – anticapture-storybook March 9, 2026 19:57 View deployment

vercel bot deployed to Preview – anticapture March 9, 2026 19:58 View deployment

fix: claude review

8dbae75

railway-app bot temporarily deployed to anticapture-infra / dev March 9, 2026 20:07 Inactive

vercel bot deployed to Preview – anticapture-storybook March 9, 2026 20:08 View deployment

vercel bot deployed to Preview – anticapture March 9, 2026 20:09 View deployment

fix: message composition

4aed2e9

railway-app bot temporarily deployed to anticapture-infra / dev March 9, 2026 20:18 Inactive

vercel bot deployed to Preview – anticapture-storybook March 9, 2026 20:19 View deployment

This comment was marked as resolved.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 9, 2026

View reviewed changes

pikonha requested changes Mar 9, 2026

View reviewed changes

PedroBinotto added 2 commits March 10, 2026 18:12

feat: add RPC counter

4f25cd8

Merge branch 'dev' into feat/otel-instrumentation

d295103

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

feat: add rpc counter to dashboard

45891cd

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

refactor: unify dashboards and remove bloat

9e04ff9

chatgpt-codex-connector bot reviewed Mar 11, 2026

View reviewed changes

PedroBinotto added 3 commits March 11, 2026 13:37

fix: grafana process metric panels

6fd7550

fix: metrics middleware

07a1ea0

fix: noop exporter

8a26971

This comment was marked as resolved.

Sign in to view

chatgpt-codex-connector bot reviewed Mar 11, 2026

View reviewed changes

fix: rpc client wrapper typing

9631282

chatgpt-codex-connector bot reviewed Mar 11, 2026

View reviewed changes

pikonha approved these changes Mar 12, 2026

View reviewed changes

fix: remove unncecessary cast

e7e8890

		process.once("SIGTERM", shutdown);
		process.once("SIGINT", shutdown);

		await meterProvider.shutdown();
		await tracerProvider.shutdown();

Conversation

PedroBinotto commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Observability Infrastructure (packages/observability, infra/monitoring/)

Telemetry data flow

Currently configured Prometheus alert rules

Currently configured Grafana dashboards

Unified Anticapture dashboad

Overview Dashboard (overview.json)

Host Metrics Dashboard (host-metrics.json)

Tracing Dashboard (tracing.json)

Adding new rules and dashboards

Adding an alert

Useful PromQL patterns

Reloading rules without restart

Testing an expression

Grafana dashboards

Editing a dashboard

Adding a new dashboard

Uh oh!

vercel bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PedroBinotto commented Mar 3, 2026

Uh oh!

This comment was marked as resolved.

PedroBinotto commented Mar 3, 2026

Uh oh!

This comment was marked as resolved.

PedroBinotto commented Mar 5, 2026

Uh oh!

PedroBinotto commented Mar 9, 2026

Uh oh!

This comment was marked as resolved.

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pikonha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

PedroBinotto commented Mar 3, 2026 •

edited

Loading

vercel bot commented Mar 3, 2026 •

edited

Loading