Sync to upstream (ws keepalive fix) + re-apply OTLP gRPC export by tnsardesai · Pull Request #2 · kernel/locomotive

tnsardesai · 2026-07-01T19:26:32Z

Why

The deployed forwarder (rgarcia/otel-support) branched from upstream before 2026-03-09 and is missing upstream's subscription-reliability rewrite. The gap has a concrete production impact: the fork's websocket layer has no keepalive. SafeConnRead calls conn.Read(ctx) with the process-lifetime context and no read deadline, and coder/websocket does not auto-ping. When a long-lived Railway subscription goes half-open (connection dies without a close frame reaching the client — exactly what an edge/network change causes), the read blocks forever with no error:

the canvas-invalidation subscription parks permanently → deployment discovery never re-queries → new deployments are never picked up;
the per-deployment HTTP-log reader stays pinned to a drained deployment ("waiting for cargo");
because the goroutine hangs rather than erroring, the process never exits and never restarts, so it never gets the startup re-seed that would recover it.

The result: a healthy-looking process forwarding nothing for days after an edge cutover, with no crash or error to signal it.

What

Sync rgarcia/otel-support up to upstream aaefe1d and re-apply the fork's OTLP export on top of the rewritten base.

Upstream fixes included (most important first):

Websocket ping keepalive (internal/railway/subscribe/helpers.go): 30s jittered ping / 10s timeout; a failed ping closes the connection to force a resubscribe. This is the direct fix for the stall above.
Subscription.Run refactor, exponential backoff + jitter on resubscribes, and domain-filtered deployment tailing.

Re-applied OTel support (adapted to the new base):

internal/otel/ — OTLP gRPC log exporter + record transforms (service.name from OTEL_SERVICE_NAME, railway.* attributes, status-derived severity).
internal/pipeline/otel.go — an OTel pipeline variant that emits via the SDK instead of the webhook dispatcher, reusing the same subscription retry/backoff.
internal/config — OTEL_* vars; WEBHOOK_URL is no longer required when OTEL_ENABLED=true.

Behavior is unchanged for the deployed config

The port keeps the gRPC transport and the synthetic service.name from OTEL_SERVICE_NAME (rather than upstream's native otel_http mode, which sets service.name to the real Railway service name). That preserves the ServiceName value the uptime pipeline's ClickHouse query filters on — no downstream query change needed.

Merge note

This branch is a merge commit descending from rgarcia/otel-support with upstream/main as a second parent, so it fast-forward-merges cleanly and brings the full upstream history with it.

Test plan

go build ./..., go vet ./..., go test ./... all pass.
Deploy to a non-prod environment with the current OTEL_* vars and confirm http_logs_processed increments and rows land in ClickHouse under ServiceName='railway-http-logs'.
Verify recovery: force a resubscribe / deployment change and confirm the forwarder re-points instead of going quiet.

Small fixes

feat: add otel_http mode

Per review feedback

YOLO!

Add service_namespace for Grafana

rename otel

Rename otel_http to otel

Container publish workflow - add arm64 support

Better CI

Update docker-build.yml

Update release.yml

Build and push a beta-tagged Docker image when a PR has the `beta` label. Rebuilds on each push while the label is present. Tags as `beta_<branch>`. Made-with: Cursor

Add beta image workflow

…magic strings with constants Made-with: Cursor

Add dispatch queue with retry/backoff, unified log pipeline, and reconstructor optimizations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add VictoriaLogs webhook mode

Map the Railway environment name onto Sentry's reserved `sentry.environment` log attribute so logs are filterable by environment in the Sentry UI, instead of only surfacing it as a generic `_metadata__environment_name` attribute. The reserved key is declared (empty) in the Item template alongside the other `sentry.*` defaults and filled per-log from the `environment_name` metadata for both environment and HTTP logs. The backslash in the sjson path escapes the dot so the flat key `sentry.environment` is addressed rather than a nested path. Also add the reserved `sentry.origin` attribute (`auto.log.locomotive`, a constant) to label the logs' source in Sentry, matching the `auto.log.<integration>` convention used by the official SDK integrations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(sentry): forward Railway environment as native sentry.environment

HTTP-log subscriptions could resubscribe many times per second. When a log subscription ended with a `complete` message, locomotive tore down the WebSocket and dialed a brand-new one (TLS handshake + connection_init/ack + subscribe) with zero delay — once per deployment, in parallel — producing a reconnect storm. Changes: - Rate-limit resubscribes: a constant 1s delay before every resubscribe attempt (including the first), so a subscription that ends immediately can no longer hot-loop. - Reuse the socket on `complete`: send a fresh `subscribe` frame over the existing connection instead of redialing (no new TLS handshake or connection_init/ack). Falls back to a full redial only if the socket is actually dead. Read errors still redial, since the connection is broken. - Cap concurrent subscription initializations with a global semaphore, and bound the handshake with the existing 10s timeout so a stuck open can't hold a slot. Established streams remain unlimited. - Forward-tail on resubscribe: resume from the last-seen log timestamp via the subscription's `beforeDate` argument so each reconnect fetches only new logs instead of re-requesting the backlog window. Locomotive dedups by last-seen timestamp, so reconnects neither drop nor duplicate logs. - Refactor all three subscriptions (http, env, canvas) onto a shared subscribe.Subscription that owns the connection and a payload provider, so resume state lives in the caller's closure rather than per-call arguments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The http, env, and canvas subscriptions each carried a near-identical read/redial/reuse/type-check loop. Move that policy into Subscription.Run, which reads messages and drives resubscribe (redial on connection error, reuse on a non-`next` message, skip malformed frames) and hands each `next` payload to a per-subscription callback. - Centralizes the resubscribe policy in one place, so future backoff changes touch one function instead of three. - Makes malformed-message handling uniform and resilient: env and canvas previously tore down and restarted the whole subscription on a single unparseable frame; now all three log and skip it. - http: replace the activeDeployments membership polling with per-deployment context cancellation. SubscribeToHttpLogs tracks a deployment->cancel map (single-writer, no mutex) and cancels a deployment's context when it leaves the wanted set, so getHttpLogs simply exits on ctx.Done(). The function also derives a cancelable context for everything it starts (per-deployment goroutines, the deployment-changes subscription, the flush loop) and cancels it on return, fixing a goroutine leak across subscription retries. No change to delivered logs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the "deploy-logs"/"http-logs" magic strings at the runLogPipeline call sites with pipelineDeployLogs/pipelineHTTPLogs constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the "-webhook"/"-subscription" literals appended to a pipeline name with dispatcherNameSuffix/subscriptionNameSuffix constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codebase cleanup pass: - Remove dead code: unused axiom/betterstack JSON-lines reconstructors, the sentry_attribute Builder and unused Value accessors, Dispatcher.QueueDepth, slice.Sync.Length/Clear. - Replace deployment_changes.DeploymentIdWithInfo with uuid.UUID. CreatedAt was no longer read (the HTTP-log cursor now seeds from a fixed backlog window) and had become an accidental part of the struct's equality key; keying the tracked set on the ID alone removes that footgun. The add/remove diff is now a simple map-set comparison. - Pass initTime as a parameter to getHttpLogs instead of smuggling it through a context value, dropping the magic key and its can't-happen error path. - config: rename per-mode Headers to DefaultHeaders (a plain map, distinct from the user's AdditionalHeaders) and drop the empty-map boilerplate; replace the inline header-lookup closure with a helper and the magic len(attrs) > 2 checks with explicit flags; sort AdditionalHeaders.Keys() output; pass the mode's default headers into SendRawWebhook instead of reading the global mid-call. - Misc: fix ReserverdAttributes typo, drop redundant parens, unalias sync in the slice package, guard ErrorsAttr against nil, and remove commented-out debug lines and a few restatement comments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The periodic status line now shows only the processed counts at info level; the per-pipeline failed counts move into the debug-only block alongside the mem stats. A creeping "failed" number on normal output tends to alarm people, and real failures already surface via the dispatcher's warn/error logs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The host-validation self-match branch was missing the len(ExpectedHostContains) > 0 guard its sibling else branch has, so any mode without an expected-host list (otel_http, and the default json/jsonl) logged a spurious "possible webhook misconfiguration" on every startup with an empty expected_host_contains. Forwarding was never affected — this only gated the warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…scribe-hardening Stop resubscribe storms; harden and consolidate the subscription/pipeline layer

Tag the shared resubscribe / connection-error debug lines with which subscription they came from (http / environment / environment_invalidation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This reverts commit 8847945.

Tag the shared resubscribe / connection-error debug lines with which subscription they came from (http / environment / environment_invalidation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion-type Add the subscription type to resubscribe debug logs

… retries Replace the flat/constant retry intervals across the subscription and retry paths with one exponential-backoff-with-jitter primitive, so locomotive no longer resubscribes in a tight loop when the backend repeatedly ends a stream, and instead backs off (and de-syncs) under sustained failures. - queue: add RetryBackoff (exponential, jittered, optional unlimited retries); remove RetryConstant and the now-unused RetryInterval param. Jitter is symmetric but capped at maxBackoff so the ceiling is never exceeded. - subscribe: drive (re)subscribe through RetryBackoff. A stream that delivers data resets the backoff; one that ends without delivering anything grows it. Track socket reuse-vs-redial on the Conn itself; surface genuine connection-establishment failures at error level. - apply the same backoff to the environment-data fetch and the pipeline's subscription-restart wrapper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Previously the first attempt ran immediately, so a caller that loops over RetryBackoff (re-invoking it after each success) could re-run fn with no delay — e.g. a subscription whose stream stays productive but ends immediately would resubscribe in the same second, never pacing. Move the backoff wait to the top of the loop so initialBackoff is a floor under every attempt; consecutive retries still escalate from there. Removes the need for a separate delay in the subscribe loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…env logs beforeLimit is a per-poll ceiling (SQL LIMIT), so the backend returns at most that many of the newest logs in each poll window and drops any older tail. At 500 a deployment emitting more than ~125 logs/sec could silently lose the oldest logs in a window. Raise both http and environment payloads to 5000 (the backend's validated maximum), lifting the threshold ~10x. It only fetches more when a service actually produces more, so low-traffic streams are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(subscribe): exponential backoff + jitter for resubscribes and retries

…ubscribes - Only open an HTTP-log subscription for services that have a domain — a domainless service receives no HTTP traffic, so it produces no HTTP logs. The set is recomputed on each environment change, so a domain added later is picked up on the next change/redeploy. - Even-stagger the start of each deployment's HTTP-log stream across a short window so deployments coming up together don't issue requests in lockstep. - Let HTTP-log resubscribes back off further (up to 130s) than other streams, which can take longer to become available again. - Scrub backend-internal details from a few code comments. - Document the domain requirement and its caveat in the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The empty deployment set after the domain filter logged "initial deployment IDs received" with deployment_ids=null, which was misleading. Log a distinct info line ("no services with domains, skipping HTTP logs") with environment/service context instead, so it's clear why no HTTP logs are being tailed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ions feat(http-logs): tail only services with a domain, stagger + tune resubscribes

Brings in upstream's subscription reliability rewrite — most importantly a websocket ping/keepalive (30s interval, 10s timeout) that closes a half-open connection to force a resubscribe. The prior fork lacked any keepalive, so a half-open Railway subscription blocked the read forever: deployment discovery silently stalled and HTTP-log forwarding went dry with no error and no restart. Re-applies the fork's native OTLP gRPC export (OTEL_ENABLED path) on top of the rewritten base: - internal/otel: OTLP gRPC log exporter + record transforms (service.name from OTEL_SERVICE_NAME, railway.* attributes, status-derived severity). - internal/pipeline: OTel pipeline variant that emits via the SDK instead of the webhook dispatcher, reusing the same subscription retry/backoff. - config: OTEL_* vars; WEBHOOK_URL no longer required when OTEL_ENABLED=true. Output behavior is unchanged from the deployed forwarder (same transport, same synthetic service.name), so the uptime pipeline's ServiceName filter is preserved.

brody192 and others added 30 commits February 12, 2026 14:13

fixes

6d395ec

Merge pull request brody192#7 from brody192/brody/small-fixes

ccdca55

Small fixes

feat: add otel_http mode

7a4e27b

Merge pull request brody192#6 from meistrari/main

cb9ade5

feat: add otel_http mode

Add service_namespace functionality

860253d

Address review feedback

9c9e41c

Scope service namespace to just Grafana & Otel

f179557

Per review feedback

Remove the test

3ebf916

YOLO!

Merge pull request brody192#8 from case/grafana-service-namespace

8856db5

Add service_namespace for Grafana

rename otel

e9c0bc3

Merge pull request brody192#9 from brody192/brody/rename-otel

19eb2d4

rename otel

rename folder

4ae8c60

Merge pull request brody192#10 from brody192/brody/rename-otel-folder

1641bab

Rename otel_http to otel

Container publish workflow - add arm64 support

c76dcbe

Merge pull request brody192#11 from case/add-arm64

a2158d2

Container publish workflow - add arm64 support

better ci

4770496

parallel builds

169d918

consolidate build logic

e388e05

Merge pull request brody192#12 from brody192/brody/release-ci

5966d14

Better CI

Update docker-build.yml

961824c

Merge pull request brody192#13 from brody192/brody/release-ci

f57f326

Update docker-build.yml

Update release.yml

775971d

Merge pull request brody192#14 from brody192/brody/release-ci

3ecc672

Update release.yml

first pass

efafb4e

errgroup context

76e68c0

bug fixes

032a03c

bug fixes and improvements

54705f4

Add beta image workflow

bce5065

Build and push a beta-tagged Docker image when a PR has the `beta` label. Rebuilds on each push while the label is present. Tags as `beta_<branch>`. Made-with: Cursor

Merge pull request brody192#15 from brody192/brody/beta-workflow

04b8741

Add beta image workflow

Consolidate duplicated subscription retry logic and replace metadata …

4fadd05

…magic strings with constants Made-with: Cursor

brody192 and others added 30 commits April 10, 2026 19:03

Merge branch 'main' into brody/queue

8166604

Merge pull request brody192#16 from brody192/brody/queue

6cdcd64

Add dispatch queue with retry/backoff, unified log pipeline, and reconstructor optimizations

docs: replace angle bracket placeholders with curly braces in README

5dbf3e0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' of https://github.com/brody192/locomotive

25ea278

feat: support victoria logs

bb024e3

Merge pull request brody192#20 from xLanStar/main

397a554

Add VictoriaLogs webhook mode

Merge pull request brody192#21 from brody192/brody/sentry-environment

5ec618c

feat(sentry): forward Railway environment as native sentry.environment

refactor: name the log-pipeline strings instead of passing literals

02cb3c9

Replace the "deploy-logs"/"http-logs" magic strings at the runLogPipeline call sites with pipelineDeployLogs/pipelineHTTPLogs constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

refactor: name the pipeline sub-component name suffixes

61a1f93

Replace the "-webhook"/"-subscription" literals appended to a pipeline name with dispatcherNameSuffix/subscriptionNameSuffix constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clean funcs

0c6f12e

Merge pull request brody192#22 from brody192/brody/subscription-resub…

7bab457

…scribe-hardening Stop resubscribe storms; harden and consolidate the subscription/pipeline layer

chore(subscribe): include the subscription type in debug logs

8847945

Tag the shared resubscribe / connection-error debug lines with which subscription they came from (http / environment / environment_invalidation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Revert "chore(subscribe): include the subscription type in debug logs"

c26fb38

This reverts commit 8847945.

chore(subscribe): include the subscription type in debug logs

c10fe65

Tag the shared resubscribe / connection-error debug lines with which subscription they came from (http / environment / environment_invalidation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge pull request brody192#23 from brody192/brody/debug-log-subscrip…

f2db784

…tion-type Add the subscription type to resubscribe debug logs

better logging

1eefaab

Merge pull request brody192#24 from brody192/brody/retry-backoff

b9f1446

feat(subscribe): exponential backoff + jitter for resubscribes and retries

Merge pull request brody192#25 from brody192/brody/http-log-subscript…

aaefe1d

…ions feat(http-logs): tail only services with a domain, stagger + tune resubscribes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync to upstream (ws keepalive fix) + re-apply OTLP gRPC export#2

Sync to upstream (ws keepalive fix) + re-apply OTLP gRPC export#2
tnsardesai wants to merge 81 commits into
rgarcia/otel-supportfrom
hypeship/upstream-sync-otel-pr

tnsardesai commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

tnsardesai commented Jul 1, 2026

Why

What

Behavior is unchanged for the deployed config

Merge note

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants