feat: Observability and monitoring for Anticapture#1683
feat: Observability and monitoring for Anticapture#1683PedroBinotto wants to merge 45 commits intodevfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4aed2e954d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| process.once("SIGTERM", shutdown); | ||
| process.once("SIGINT", shutdown); |
There was a problem hiding this comment.
Preserve default SIGTERM exit after telemetry shutdown
Registering SIGTERM/SIGINT handlers here overrides Node’s default signal-exit behavior, but the handler only shuts down telemetry providers and never closes the HTTP server or exits the process; in services that keep sockets open, a normal SIGTERM (for deploy/stop) will no longer terminate promptly and will require a forced kill. This affects every app that calls createObservabilityProvider.
Useful? React with 👍 / 👎.
apps/api-gateway/src/index.ts
Outdated
|
|
||
| const server = createServer((req, res) => { | ||
| const server = createServer(async (req, res) => { | ||
| if (!validateAuthToken(req, res)) return; |
There was a problem hiding this comment.
Exempt metrics route from gateway token validation
The gateway now enforces validateAuthToken before routing /metrics, so any deployment with BLOCKFUL_API_TOKEN set will return 401 to Prometheus scrapes; both monitoring scrape configs target /metrics without auth headers (infra/monitoring/prometheus/prometheus.yml and infra/monitoring/railway/prometheus.yml), which makes gateway observability fail in secured environments.
Useful? React with 👍 / 👎.
pikonha
left a comment
There was a problem hiding this comment.
great implementation, just some minor things to consider. the only thing that needs clarification is the requirement of OTEL during dev on the API
There was a problem hiding this comment.
why does this have to be manually implemented? In fact, what does it do?
There was a problem hiding this comment.
I'm pretty sure it does (?)
There was a problem hiding this comment.
why do we have a monitoring/railway package? what does it have to do with Railway specifically?
There was a problem hiding this comment.
We could rename it to remote or something like that (though I think railwayis fine); its just a separate package that contains the configs for deploying to Railway (as opposed to the config files in monitoring which are for running the setup locally for dev)
There was a problem hiding this comment.
could we move all of these to packages/observability?
There was a problem hiding this comment.
why should it? packages/observability is a library package
There was a problem hiding this comment.
how can we differentiate the alerts from dev and prod? I don't wanna get pinged with urgency for dev alerts
There was a problem hiding this comment.
Dev and prod will have each their own instances of observability environments - the simples way to differentiate them is to have dev and prod pointing to separate webhooks for slack integration
apps/api/package.json
Outdated
| "build": "tsup", | ||
| "build:watch": "tsc --watch", | ||
| "dev:watch": "tsx watch src/index.ts", | ||
| "dev:watch": "tsx watch --import tsx/esm src/index.ts", |
| export const httpRequestTotal: Counter = meter.createCounter( | ||
| "http_server_requests_total", | ||
| { | ||
| description: "Total number of HTTP requests", | ||
| }, | ||
| ); |
There was a problem hiding this comment.
we could have a counter for actual RPC requests sent from the API to a RPC node
| const observability = createObservabilityProvider("anticapture-api"); | ||
|
|
||
| export const exporter = observability.exporter; | ||
| export const meterProvider = observability.meterProvider; |
| const start = performance.now(); | ||
| await next(); | ||
| const duration = (performance.now() - start) / 1000; | ||
|
|
||
| const spanContext = trace.getActiveSpan()?.spanContext(); | ||
| const traceId = spanContext?.traceId; | ||
|
|
||
| if (traceId) { | ||
| c.res.headers.set("X-Trace-Id", traceId); | ||
| } |
apps/api-gateway/src/index.ts
Outdated
|
|
||
| const server = createServer((req, res) => { | ||
| const server = createServer(async (req, res) => { | ||
| if (!validateAuthToken(req, res)) return; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f25cd8302
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
packages/observability/src/index.ts
Outdated
| serviceName: string, | ||
| ): ObservabilityProvider { | ||
| const collectorEndpoint = | ||
| process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? "http://localhost:4318"; |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
|
||
| const labels = { | ||
| [ATTR_HTTP_REQUEST_METHOD]: c.req.method, | ||
| [ATTR_HTTP_ROUTE]: c.req.routePath ?? c.req.path, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 45891cd101
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| export const metricsMiddleware: MiddlewareHandler = async (c, next) => { | ||
| const start = performance.now(); | ||
| await next(); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| metrics_path: "/metrics" | ||
| scrape_interval: 15s | ||
| static_configs: | ||
| - targets: ["172.17.0.1:42069"] |
There was a problem hiding this comment.
Point API and indexer scrape jobs at distinct targets
This local Prometheus config scrapes both anticapture-ens-indexer and anticapture-api from the same endpoint (172.17.0.1:42069), so one running service is counted as two healthy jobs and the other service's outage is masked. In practice this duplicates metrics under different job labels and makes ServiceDown/latency/error alerts for API vs indexer unreliable unless the targets are separated.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e04ff90f8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| metrics_path: "/metrics" | ||
| scrape_interval: 15s | ||
| static_configs: | ||
| - targets: ["172.17.0.1:42069"] |
There was a problem hiding this comment.
Use stable host alias for local scrape targets
All three local scrape jobs are pinned to 172.17.0.1, which only works when the Docker bridge gateway happens to use that address; on Docker Desktop and many custom bridge subnets, Prometheus cannot reach host-run services and each job stays up=0. Since infra/monitoring/docker-compose.yml already configures host.docker.internal via extra_hosts, these fixed IP targets make the local monitoring setup fail in common environments.
Useful? React with 👍 / 👎.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8a26971620
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "clean": "rm -rf node_modules generated .ponder dump *.tsbuildinfo" | ||
| }, | ||
| "dependencies": { | ||
| "@anticapture/observability": "workspace:*", |
There was a problem hiding this comment.
Replace workspace dependency in indexer install path
Adding "@anticapture/observability": "workspace:*" makes the indexer container build fail, because infra/indexer/Dockerfile.indexer installs dependencies with plain npm install after copying only apps/indexer (no workspace root/package source). In that context npm cannot resolve workspace:* (EUNSUPPORTEDPROTOCOL), so the image no longer builds and the compose indexer service cannot start.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96312824a4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| await meterProvider.shutdown(); | ||
| await tracerProvider.shutdown(); |
There was a problem hiding this comment.
Run caller shutdown hook even if telemetry flush fails
If meterProvider.shutdown() or tracerProvider.shutdown() rejects (for example during exporter flush timeouts), execution aborts before onShutdown?.() runs. Services that rely on this hook to close HTTP servers or exit the process will then skip their cleanup path during SIGTERM/SIGINT, which can leave the process hanging until forcibly killed.
Useful? React with 👍 / 👎.
Observability Infrastructure (packages/observability, infra/monitoring/)
A new shared @anticapture/observability package centralizes all telemetry
setup, used by every service.
Telemetry data flow
Currently configured Prometheus alert rules
Currently configured Grafana dashboards
Unified Anticapture dashboad
14 panels, 15s refresh:
Overview Dashboard (overview.json)Focused on database metrics, with panels served from Prometheus. Refreshes on auto.Host Metrics Dashboard (host-metrics.json)6 panels, 15s refresh:Process CPU usage % (by job + state);Process memory usage (by type);System CPU usage %;System memory usage;Network I/O (by direction);Network errors & dropped packets;Tracing Dashboard (tracing.json)6 panels, 15s refresh:Request rate — HTTP req/s by route + method;Error rate — 5xx errors/s (red-highlighted);Latency — P99/P95/P50 via histogram_quantile();Latency by route — per-route P99 breakdown;Status code breakdown — pie chart of last 5m;Trace search — link panel to Tempo explorer;Adding new rules and dashboards
Alert rules live in two files that must be kept in sync:
prometheus/alerts.yml— used locally viacompose.yamlrailway/alerts.yml— deployed to RailwayAdding an alert
Add a new entry under the
ruleslist in both files:Severity convention:
critical— requires immediate action (service down, high error rate)warning— degraded but not broken (high latency, elevated CPU)Useful PromQL patterns
Reloading rules without restart
Testing an expression
Open http://localhost:9090/graph, paste the
expr, and check the result before committing.Grafana dashboards
Dashboard JSON files live in
grafana/dashboards/. Grafana auto-provisions them on startup.overview.jsonhost-metrics.jsontracing.jsonEditing a dashboard
grafana/dashboards/.Adding a new dashboard
grafana/dashboards/<name>.json.All metrics are labeled with
http_request_method,http_route, andhttp_response_status_code. Trace IDs are injected into HTTP responses via theX-Trace-Idheader, linking requests to their Tempo traces.Deployment: All monitoring components have Railway-ready Dockerfiles and
.railway.tomlconfigs underinfra/monitoring/railway/. A localcompose.yamlis also provided for development. API services previously built with Railpack must be configured to used the provided Dockerfiles.