Skip to content

Sync to upstream (ws keepalive fix) + re-apply OTLP gRPC export#2

Draft
tnsardesai wants to merge 81 commits into
rgarcia/otel-supportfrom
hypeship/upstream-sync-otel-pr
Draft

Sync to upstream (ws keepalive fix) + re-apply OTLP gRPC export#2
tnsardesai wants to merge 81 commits into
rgarcia/otel-supportfrom
hypeship/upstream-sync-otel-pr

Conversation

@tnsardesai

Copy link
Copy Markdown

Why

The deployed forwarder (rgarcia/otel-support) branched from upstream before 2026-03-09 and is missing upstream's subscription-reliability rewrite. The gap has a concrete production impact: the fork's websocket layer has no keepalive. SafeConnRead calls conn.Read(ctx) with the process-lifetime context and no read deadline, and coder/websocket does not auto-ping. When a long-lived Railway subscription goes half-open (connection dies without a close frame reaching the client — exactly what an edge/network change causes), the read blocks forever with no error:

  • the canvas-invalidation subscription parks permanently → deployment discovery never re-queries → new deployments are never picked up;
  • the per-deployment HTTP-log reader stays pinned to a drained deployment ("waiting for cargo");
  • because the goroutine hangs rather than erroring, the process never exits and never restarts, so it never gets the startup re-seed that would recover it.

The result: a healthy-looking process forwarding nothing for days after an edge cutover, with no crash or error to signal it.

What

Sync rgarcia/otel-support up to upstream aaefe1d and re-apply the fork's OTLP export on top of the rewritten base.

Upstream fixes included (most important first):

  • Websocket ping keepalive (internal/railway/subscribe/helpers.go): 30s jittered ping / 10s timeout; a failed ping closes the connection to force a resubscribe. This is the direct fix for the stall above.
  • Subscription.Run refactor, exponential backoff + jitter on resubscribes, and domain-filtered deployment tailing.

Re-applied OTel support (adapted to the new base):

  • internal/otel/ — OTLP gRPC log exporter + record transforms (service.name from OTEL_SERVICE_NAME, railway.* attributes, status-derived severity).
  • internal/pipeline/otel.go — an OTel pipeline variant that emits via the SDK instead of the webhook dispatcher, reusing the same subscription retry/backoff.
  • internal/configOTEL_* vars; WEBHOOK_URL is no longer required when OTEL_ENABLED=true.

Behavior is unchanged for the deployed config

The port keeps the gRPC transport and the synthetic service.name from OTEL_SERVICE_NAME (rather than upstream's native otel_http mode, which sets service.name to the real Railway service name). That preserves the ServiceName value the uptime pipeline's ClickHouse query filters on — no downstream query change needed.

Merge note

This branch is a merge commit descending from rgarcia/otel-support with upstream/main as a second parent, so it fast-forward-merges cleanly and brings the full upstream history with it.

Test plan

  • go build ./..., go vet ./..., go test ./... all pass.
  • Deploy to a non-prod environment with the current OTEL_* vars and confirm http_logs_processed increments and rows land in ClickHouse under ServiceName='railway-http-logs'.
  • Verify recovery: force a resubscribe / deployment change and confirm the forwarder re-points instead of going quiet.

brody192 and others added 30 commits February 12, 2026 14:13
Container publish workflow - add arm64 support
Build and push a beta-tagged Docker image when a PR has the `beta` label.
Rebuilds on each push while the label is present. Tags as `beta_<branch>`.

Made-with: Cursor
…magic strings with constants

Made-with: Cursor
brody192 and others added 30 commits April 10, 2026 19:03
Add dispatch queue with retry/backoff, unified log pipeline, and reconstructor optimizations
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add VictoriaLogs webhook mode
Map the Railway environment name onto Sentry's reserved
`sentry.environment` log attribute so logs are filterable by environment
in the Sentry UI, instead of only surfacing it as a generic
`_metadata__environment_name` attribute.

The reserved key is declared (empty) in the Item template alongside the
other `sentry.*` defaults and filled per-log from the
`environment_name` metadata for both environment and HTTP logs. The
backslash in the sjson path escapes the dot so the flat key
`sentry.environment` is addressed rather than a nested path.

Also add the reserved `sentry.origin` attribute (`auto.log.locomotive`,
a constant) to label the logs' source in Sentry, matching the
`auto.log.<integration>` convention used by the official SDK
integrations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(sentry): forward Railway environment as native sentry.environment
HTTP-log subscriptions could resubscribe many times per second. When a log
subscription ended with a `complete` message, locomotive tore down the
WebSocket and dialed a brand-new one (TLS handshake + connection_init/ack +
subscribe) with zero delay — once per deployment, in parallel — producing a
reconnect storm.

Changes:

- Rate-limit resubscribes: a constant 1s delay before every resubscribe
  attempt (including the first), so a subscription that ends immediately can
  no longer hot-loop.

- Reuse the socket on `complete`: send a fresh `subscribe` frame over the
  existing connection instead of redialing (no new TLS handshake or
  connection_init/ack). Falls back to a full redial only if the socket is
  actually dead. Read errors still redial, since the connection is broken.

- Cap concurrent subscription initializations with a global semaphore, and
  bound the handshake with the existing 10s timeout so a stuck open can't hold
  a slot. Established streams remain unlimited.

- Forward-tail on resubscribe: resume from the last-seen log timestamp via the
  subscription's `beforeDate` argument so each reconnect fetches only new logs
  instead of re-requesting the backlog window. Locomotive dedups by last-seen
  timestamp, so reconnects neither drop nor duplicate logs.

- Refactor all three subscriptions (http, env, canvas) onto a shared
  subscribe.Subscription that owns the connection and a payload provider, so
  resume state lives in the caller's closure rather than per-call arguments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The http, env, and canvas subscriptions each carried a near-identical
read/redial/reuse/type-check loop. Move that policy into Subscription.Run,
which reads messages and drives resubscribe (redial on connection error, reuse
on a non-`next` message, skip malformed frames) and hands each `next` payload
to a per-subscription callback.

- Centralizes the resubscribe policy in one place, so future backoff changes
  touch one function instead of three.

- Makes malformed-message handling uniform and resilient: env and canvas
  previously tore down and restarted the whole subscription on a single
  unparseable frame; now all three log and skip it.

- http: replace the activeDeployments membership polling with per-deployment
  context cancellation. SubscribeToHttpLogs tracks a deployment->cancel map
  (single-writer, no mutex) and cancels a deployment's context when it leaves
  the wanted set, so getHttpLogs simply exits on ctx.Done(). The function also
  derives a cancelable context for everything it starts (per-deployment
  goroutines, the deployment-changes subscription, the flush loop) and cancels
  it on return, fixing a goroutine leak across subscription retries.

No change to delivered logs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the "deploy-logs"/"http-logs" magic strings at the runLogPipeline call
sites with pipelineDeployLogs/pipelineHTTPLogs constants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the "-webhook"/"-subscription" literals appended to a pipeline name
with dispatcherNameSuffix/subscriptionNameSuffix constants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codebase cleanup pass:

- Remove dead code: unused axiom/betterstack JSON-lines reconstructors, the
  sentry_attribute Builder and unused Value accessors, Dispatcher.QueueDepth,
  slice.Sync.Length/Clear.

- Replace deployment_changes.DeploymentIdWithInfo with uuid.UUID. CreatedAt was
  no longer read (the HTTP-log cursor now seeds from a fixed backlog window) and
  had become an accidental part of the struct's equality key; keying the tracked
  set on the ID alone removes that footgun. The add/remove diff is now a simple
  map-set comparison.

- Pass initTime as a parameter to getHttpLogs instead of smuggling it through a
  context value, dropping the magic key and its can't-happen error path.

- config: rename per-mode Headers to DefaultHeaders (a plain map, distinct from
  the user's AdditionalHeaders) and drop the empty-map boilerplate; replace the
  inline header-lookup closure with a helper and the magic len(attrs) > 2 checks
  with explicit flags; sort AdditionalHeaders.Keys() output; pass the mode's
  default headers into SendRawWebhook instead of reading the global mid-call.

- Misc: fix ReserverdAttributes typo, drop redundant parens, unalias sync in the
  slice package, guard ErrorsAttr against nil, and remove commented-out debug
  lines and a few restatement comments.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The periodic status line now shows only the processed counts at info level; the
per-pipeline failed counts move into the debug-only block alongside the mem
stats. A creeping "failed" number on normal output tends to alarm people, and
real failures already surface via the dispatcher's warn/error logs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The host-validation self-match branch was missing the len(ExpectedHostContains)
> 0 guard its sibling else branch has, so any mode without an expected-host list
(otel_http, and the default json/jsonl) logged a spurious "possible webhook
misconfiguration" on every startup with an empty expected_host_contains. Forwarding
was never affected — this only gated the warning.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scribe-hardening

Stop resubscribe storms; harden and consolidate the subscription/pipeline layer
Tag the shared resubscribe / connection-error debug lines with which
subscription they came from (http / environment / environment_invalidation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tag the shared resubscribe / connection-error debug lines with which
subscription they came from (http / environment / environment_invalidation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion-type

Add the subscription type to resubscribe debug logs
… retries

Replace the flat/constant retry intervals across the subscription and retry
paths with one exponential-backoff-with-jitter primitive, so locomotive no
longer resubscribes in a tight loop when the backend repeatedly ends a stream,
and instead backs off (and de-syncs) under sustained failures.

- queue: add RetryBackoff (exponential, jittered, optional unlimited retries);
  remove RetryConstant and the now-unused RetryInterval param. Jitter is
  symmetric but capped at maxBackoff so the ceiling is never exceeded.
- subscribe: drive (re)subscribe through RetryBackoff. A stream that delivers
  data resets the backoff; one that ends without delivering anything grows it.
  Track socket reuse-vs-redial on the Conn itself; surface genuine
  connection-establishment failures at error level.
- apply the same backoff to the environment-data fetch and the pipeline's
  subscription-restart wrapper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously the first attempt ran immediately, so a caller that loops over
RetryBackoff (re-invoking it after each success) could re-run fn with no delay
— e.g. a subscription whose stream stays productive but ends immediately would
resubscribe in the same second, never pacing. Move the backoff wait to the top
of the loop so initialBackoff is a floor under every attempt; consecutive
retries still escalate from there. Removes the need for a separate delay in the
subscribe loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…env logs

beforeLimit is a per-poll ceiling (SQL LIMIT), so the backend returns at most
that many of the newest logs in each poll window and drops any older tail. At
500 a deployment emitting more than ~125 logs/sec could silently lose the
oldest logs in a window. Raise both http and environment payloads to 5000 (the
backend's validated maximum), lifting the threshold ~10x. It only fetches more
when a service actually produces more, so low-traffic streams are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(subscribe): exponential backoff + jitter for resubscribes and retries
…ubscribes

- Only open an HTTP-log subscription for services that have a domain — a
  domainless service receives no HTTP traffic, so it produces no HTTP logs. The
  set is recomputed on each environment change, so a domain added later is
  picked up on the next change/redeploy.
- Even-stagger the start of each deployment's HTTP-log stream across a short
  window so deployments coming up together don't issue requests in lockstep.
- Let HTTP-log resubscribes back off further (up to 130s) than other streams,
  which can take longer to become available again.
- Scrub backend-internal details from a few code comments.
- Document the domain requirement and its caveat in the README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The empty deployment set after the domain filter logged "initial deployment IDs
received" with deployment_ids=null, which was misleading. Log a distinct info
line ("no services with domains, skipping HTTP logs") with environment/service
context instead, so it's clear why no HTTP logs are being tailed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions

feat(http-logs): tail only services with a domain, stagger + tune resubscribes
Brings in upstream's subscription reliability rewrite — most importantly a
websocket ping/keepalive (30s interval, 10s timeout) that closes a half-open
connection to force a resubscribe. The prior fork lacked any keepalive, so a
half-open Railway subscription blocked the read forever: deployment discovery
silently stalled and HTTP-log forwarding went dry with no error and no restart.

Re-applies the fork's native OTLP gRPC export (OTEL_ENABLED path) on top of the
rewritten base:
- internal/otel: OTLP gRPC log exporter + record transforms (service.name from
  OTEL_SERVICE_NAME, railway.* attributes, status-derived severity).
- internal/pipeline: OTel pipeline variant that emits via the SDK instead of the
  webhook dispatcher, reusing the same subscription retry/backoff.
- config: OTEL_* vars; WEBHOOK_URL no longer required when OTEL_ENABLED=true.

Output behavior is unchanged from the deployed forwarder (same transport, same
synthetic service.name), so the uptime pipeline's ServiceName filter is preserved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants