Sync to upstream (ws keepalive fix) + re-apply OTLP gRPC export#2
Draft
tnsardesai wants to merge 81 commits into
Draft
Sync to upstream (ws keepalive fix) + re-apply OTLP gRPC export#2tnsardesai wants to merge 81 commits into
tnsardesai wants to merge 81 commits into
Conversation
feat: add otel_http mode
Per review feedback
YOLO!
Add service_namespace for Grafana
Rename otel_http to otel
Container publish workflow - add arm64 support
Update docker-build.yml
Update release.yml
Build and push a beta-tagged Docker image when a PR has the `beta` label. Rebuilds on each push while the label is present. Tags as `beta_<branch>`. Made-with: Cursor
Add beta image workflow
…magic strings with constants Made-with: Cursor
Add dispatch queue with retry/backoff, unified log pipeline, and reconstructor optimizations
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add VictoriaLogs webhook mode
Map the Railway environment name onto Sentry's reserved `sentry.environment` log attribute so logs are filterable by environment in the Sentry UI, instead of only surfacing it as a generic `_metadata__environment_name` attribute. The reserved key is declared (empty) in the Item template alongside the other `sentry.*` defaults and filled per-log from the `environment_name` metadata for both environment and HTTP logs. The backslash in the sjson path escapes the dot so the flat key `sentry.environment` is addressed rather than a nested path. Also add the reserved `sentry.origin` attribute (`auto.log.locomotive`, a constant) to label the logs' source in Sentry, matching the `auto.log.<integration>` convention used by the official SDK integrations. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(sentry): forward Railway environment as native sentry.environment
HTTP-log subscriptions could resubscribe many times per second. When a log subscription ended with a `complete` message, locomotive tore down the WebSocket and dialed a brand-new one (TLS handshake + connection_init/ack + subscribe) with zero delay — once per deployment, in parallel — producing a reconnect storm. Changes: - Rate-limit resubscribes: a constant 1s delay before every resubscribe attempt (including the first), so a subscription that ends immediately can no longer hot-loop. - Reuse the socket on `complete`: send a fresh `subscribe` frame over the existing connection instead of redialing (no new TLS handshake or connection_init/ack). Falls back to a full redial only if the socket is actually dead. Read errors still redial, since the connection is broken. - Cap concurrent subscription initializations with a global semaphore, and bound the handshake with the existing 10s timeout so a stuck open can't hold a slot. Established streams remain unlimited. - Forward-tail on resubscribe: resume from the last-seen log timestamp via the subscription's `beforeDate` argument so each reconnect fetches only new logs instead of re-requesting the backlog window. Locomotive dedups by last-seen timestamp, so reconnects neither drop nor duplicate logs. - Refactor all three subscriptions (http, env, canvas) onto a shared subscribe.Subscription that owns the connection and a payload provider, so resume state lives in the caller's closure rather than per-call arguments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The http, env, and canvas subscriptions each carried a near-identical read/redial/reuse/type-check loop. Move that policy into Subscription.Run, which reads messages and drives resubscribe (redial on connection error, reuse on a non-`next` message, skip malformed frames) and hands each `next` payload to a per-subscription callback. - Centralizes the resubscribe policy in one place, so future backoff changes touch one function instead of three. - Makes malformed-message handling uniform and resilient: env and canvas previously tore down and restarted the whole subscription on a single unparseable frame; now all three log and skip it. - http: replace the activeDeployments membership polling with per-deployment context cancellation. SubscribeToHttpLogs tracks a deployment->cancel map (single-writer, no mutex) and cancels a deployment's context when it leaves the wanted set, so getHttpLogs simply exits on ctx.Done(). The function also derives a cancelable context for everything it starts (per-deployment goroutines, the deployment-changes subscription, the flush loop) and cancels it on return, fixing a goroutine leak across subscription retries. No change to delivered logs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the "deploy-logs"/"http-logs" magic strings at the runLogPipeline call sites with pipelineDeployLogs/pipelineHTTPLogs constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the "-webhook"/"-subscription" literals appended to a pipeline name with dispatcherNameSuffix/subscriptionNameSuffix constants. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codebase cleanup pass: - Remove dead code: unused axiom/betterstack JSON-lines reconstructors, the sentry_attribute Builder and unused Value accessors, Dispatcher.QueueDepth, slice.Sync.Length/Clear. - Replace deployment_changes.DeploymentIdWithInfo with uuid.UUID. CreatedAt was no longer read (the HTTP-log cursor now seeds from a fixed backlog window) and had become an accidental part of the struct's equality key; keying the tracked set on the ID alone removes that footgun. The add/remove diff is now a simple map-set comparison. - Pass initTime as a parameter to getHttpLogs instead of smuggling it through a context value, dropping the magic key and its can't-happen error path. - config: rename per-mode Headers to DefaultHeaders (a plain map, distinct from the user's AdditionalHeaders) and drop the empty-map boilerplate; replace the inline header-lookup closure with a helper and the magic len(attrs) > 2 checks with explicit flags; sort AdditionalHeaders.Keys() output; pass the mode's default headers into SendRawWebhook instead of reading the global mid-call. - Misc: fix ReserverdAttributes typo, drop redundant parens, unalias sync in the slice package, guard ErrorsAttr against nil, and remove commented-out debug lines and a few restatement comments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The periodic status line now shows only the processed counts at info level; the per-pipeline failed counts move into the debug-only block alongside the mem stats. A creeping "failed" number on normal output tends to alarm people, and real failures already surface via the dispatcher's warn/error logs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The host-validation self-match branch was missing the len(ExpectedHostContains) > 0 guard its sibling else branch has, so any mode without an expected-host list (otel_http, and the default json/jsonl) logged a spurious "possible webhook misconfiguration" on every startup with an empty expected_host_contains. Forwarding was never affected — this only gated the warning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scribe-hardening Stop resubscribe storms; harden and consolidate the subscription/pipeline layer
Tag the shared resubscribe / connection-error debug lines with which subscription they came from (http / environment / environment_invalidation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This reverts commit 8847945.
Tag the shared resubscribe / connection-error debug lines with which subscription they came from (http / environment / environment_invalidation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion-type Add the subscription type to resubscribe debug logs
… retries Replace the flat/constant retry intervals across the subscription and retry paths with one exponential-backoff-with-jitter primitive, so locomotive no longer resubscribes in a tight loop when the backend repeatedly ends a stream, and instead backs off (and de-syncs) under sustained failures. - queue: add RetryBackoff (exponential, jittered, optional unlimited retries); remove RetryConstant and the now-unused RetryInterval param. Jitter is symmetric but capped at maxBackoff so the ceiling is never exceeded. - subscribe: drive (re)subscribe through RetryBackoff. A stream that delivers data resets the backoff; one that ends without delivering anything grows it. Track socket reuse-vs-redial on the Conn itself; surface genuine connection-establishment failures at error level. - apply the same backoff to the environment-data fetch and the pipeline's subscription-restart wrapper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Previously the first attempt ran immediately, so a caller that loops over RetryBackoff (re-invoking it after each success) could re-run fn with no delay — e.g. a subscription whose stream stays productive but ends immediately would resubscribe in the same second, never pacing. Move the backoff wait to the top of the loop so initialBackoff is a floor under every attempt; consecutive retries still escalate from there. Removes the need for a separate delay in the subscribe loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…env logs beforeLimit is a per-poll ceiling (SQL LIMIT), so the backend returns at most that many of the newest logs in each poll window and drops any older tail. At 500 a deployment emitting more than ~125 logs/sec could silently lose the oldest logs in a window. Raise both http and environment payloads to 5000 (the backend's validated maximum), lifting the threshold ~10x. It only fetches more when a service actually produces more, so low-traffic streams are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(subscribe): exponential backoff + jitter for resubscribes and retries
…ubscribes - Only open an HTTP-log subscription for services that have a domain — a domainless service receives no HTTP traffic, so it produces no HTTP logs. The set is recomputed on each environment change, so a domain added later is picked up on the next change/redeploy. - Even-stagger the start of each deployment's HTTP-log stream across a short window so deployments coming up together don't issue requests in lockstep. - Let HTTP-log resubscribes back off further (up to 130s) than other streams, which can take longer to become available again. - Scrub backend-internal details from a few code comments. - Document the domain requirement and its caveat in the README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The empty deployment set after the domain filter logged "initial deployment IDs
received" with deployment_ids=null, which was misleading. Log a distinct info
line ("no services with domains, skipping HTTP logs") with environment/service
context instead, so it's clear why no HTTP logs are being tailed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ions feat(http-logs): tail only services with a domain, stagger + tune resubscribes
Brings in upstream's subscription reliability rewrite — most importantly a websocket ping/keepalive (30s interval, 10s timeout) that closes a half-open connection to force a resubscribe. The prior fork lacked any keepalive, so a half-open Railway subscription blocked the read forever: deployment discovery silently stalled and HTTP-log forwarding went dry with no error and no restart. Re-applies the fork's native OTLP gRPC export (OTEL_ENABLED path) on top of the rewritten base: - internal/otel: OTLP gRPC log exporter + record transforms (service.name from OTEL_SERVICE_NAME, railway.* attributes, status-derived severity). - internal/pipeline: OTel pipeline variant that emits via the SDK instead of the webhook dispatcher, reusing the same subscription retry/backoff. - config: OTEL_* vars; WEBHOOK_URL no longer required when OTEL_ENABLED=true. Output behavior is unchanged from the deployed forwarder (same transport, same synthetic service.name), so the uptime pipeline's ServiceName filter is preserved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The deployed forwarder (
rgarcia/otel-support) branched from upstream before 2026-03-09 and is missing upstream's subscription-reliability rewrite. The gap has a concrete production impact: the fork's websocket layer has no keepalive.SafeConnReadcallsconn.Read(ctx)with the process-lifetime context and no read deadline, andcoder/websocketdoes not auto-ping. When a long-lived Railway subscription goes half-open (connection dies without a close frame reaching the client — exactly what an edge/network change causes), the read blocks forever with no error:The result: a healthy-looking process forwarding nothing for days after an edge cutover, with no crash or error to signal it.
What
Sync
rgarcia/otel-supportup to upstreamaaefe1dand re-apply the fork's OTLP export on top of the rewritten base.Upstream fixes included (most important first):
internal/railway/subscribe/helpers.go): 30s jittered ping / 10s timeout; a failed ping closes the connection to force a resubscribe. This is the direct fix for the stall above.Subscription.Runrefactor, exponential backoff + jitter on resubscribes, and domain-filtered deployment tailing.Re-applied OTel support (adapted to the new base):
internal/otel/— OTLP gRPC log exporter + record transforms (service.namefromOTEL_SERVICE_NAME,railway.*attributes, status-derived severity).internal/pipeline/otel.go— an OTel pipeline variant that emits via the SDK instead of the webhook dispatcher, reusing the same subscription retry/backoff.internal/config—OTEL_*vars;WEBHOOK_URLis no longer required whenOTEL_ENABLED=true.Behavior is unchanged for the deployed config
The port keeps the gRPC transport and the synthetic
service.namefromOTEL_SERVICE_NAME(rather than upstream's nativeotel_httpmode, which setsservice.nameto the real Railway service name). That preserves theServiceNamevalue the uptime pipeline's ClickHouse query filters on — no downstream query change needed.Merge note
This branch is a merge commit descending from
rgarcia/otel-supportwithupstream/mainas a second parent, so it fast-forward-merges cleanly and brings the full upstream history with it.Test plan
go build ./...,go vet ./...,go test ./...all pass.OTEL_*vars and confirmhttp_logs_processedincrements and rows land in ClickHouse underServiceName='railway-http-logs'.