Skip to content

feat: inject W3C trace context into TrainJob annotations#447

Open
Rajneesh180 wants to merge 2 commits into
kubeflow:mainfrom
Rajneesh180:feat/inject-w3c-trace-context
Open

feat: inject W3C trace context into TrainJob annotations#447
Rajneesh180 wants to merge 2 commits into
kubeflow:mainfrom
Rajneesh180:feat/inject-w3c-trace-context

Conversation

@Rajneesh180
Copy link
Copy Markdown

Addresses #446.

When opentelemetry-api is installed, train() now injects the active W3C trace context (traceparent, tracestate) into TrainJob annotations under the opentelemetry.io/ prefix. This lets traces started by the SDK propagate to the training controller and worker pods.

The implementation uses a lazy import with a fallback — when the package isn't installed (the common case today), the function is a no-op that returns annotations unchanged with zero overhead. The opentelemetry.io/ annotation prefix follows the same convention Tekton Pipelines uses for CRD-level context propagation.

Changes:

  • inject_trace_context() in utils.py — uses the global OTel propagator to write trace headers into a carrier dict, then merges them as prefixed annotations
  • One-line call in KubernetesBackend.train() right before TrainJob construction
  • 6 unit tests covering all branches (no OTel, no active context, injection, preservation of existing annotations)

No new required dependencies. A follow-up could add opentelemetry-api as an optional extra (pip install kubeflow[telemetry]).

Relates to #164

Copilot AI review requested due to automatic review settings April 5, 2026 12:01
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional OpenTelemetry W3C trace-context propagation from the SDK into Kubernetes TrainJob CRD annotations so downstream controller/pods can continue the trace.

Changes:

  • Introduces inject_trace_context() utility to inject trace headers into annotations under the opentelemetry.io/ prefix.
  • Calls inject_trace_context() in KubernetesBackend.train() before TrainJob construction.
  • Adds unit tests covering injection and no-op paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
kubeflow/trainer/backends/kubernetes/utils.py Adds annotation injection helper and constant prefix for trace context propagation.
kubeflow/trainer/backends/kubernetes/backend.py Wires trace context injection into TrainJob creation flow.
kubeflow/trainer/backends/kubernetes/utils_test.py Adds tests for the new injection helper across multiple branches.

Comment on lines +670 to +692
Uses the global OTel propagator to inject ``traceparent`` / ``tracestate``
headers into a carrier dict, then merges them into *annotations* under the
``opentelemetry.io/`` key prefix.

Returns *annotations* unchanged when the ``opentelemetry`` package is not
installed or when no active span context exists.
"""
try:
from opentelemetry.propagate import inject
except ImportError:
return annotations

carrier: dict[str, str] = {}
inject(carrier)

if not carrier:
return annotations

if annotations is None:
annotations = {}

for key, value in carrier.items():
annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inject_trace_context() merges every key produced by the global OTel propagator, so it can add non-trace headers like baggage (and potentially large/PII values) even though the docstring says only W3C traceparent/tracestate are injected; consider filtering to just those keys or using the trace-context propagator explicitly to avoid leaking/oversizing Kubernetes annotations.

Copilot uses AI. Check for mistakes.
annotations = {}

for key, value in carrier.items():
annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value will overwrite user-provided opentelemetry.io/* annotations (e.g., an explicit opentelemetry.io/traceparent), which contradicts the intent to preserve existing annotations; use a non-overwriting merge (e.g., only set when the key is absent) or document/rename the behavior.

Suggested change
annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value
annotation_key = f"{_TRACE_ANNOTATION_PREFIX}{key}"
if annotation_key not in annotations:
annotations[annotation_key] = value

Copilot uses AI. Check for mistakes.
Comment on lines +868 to +877
class TestInjectTraceContext:
def test_passthrough_without_otel(self):
"""Annotations returned unchanged when opentelemetry is not installed."""
existing = {"user-key": "user-value"}
result = utils.inject_trace_context(existing)
assert result is existing

def test_none_passthrough_without_otel(self):
"""None returned when annotations is None and opentelemetry is absent."""
assert utils.inject_trace_context(None) is None
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_passthrough_without_otel / test_none_passthrough_without_otel depend on opentelemetry not being installed in the test environment, which makes the tests brittle for contributors; patch the import (or sys.modules) to force the ImportError branch so these tests are deterministic.

Copilot uses AI. Check for mistakes.
Comment on lines +917 to +928
def test_preserves_existing_annotations(self):
"""User-supplied annotations are not overwritten."""
mock_mod = _mock_otel_propagate({"traceparent": SAMPLE_TRACEPARENT})
original = {"team": "ml-platform", "created-by": "sdk"}
with patch.dict(
"sys.modules",
{"opentelemetry": MagicMock(), "opentelemetry.propagate": mock_mod},
):
result = utils.inject_trace_context(original)
assert result["team"] == "ml-platform"
assert result["created-by"] == "sdk"
assert result["opentelemetry.io/traceparent"] == SAMPLE_TRACEPARENT
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_preserves_existing_annotations doesn't cover the actual overwrite risk for trace annotations (it only checks unrelated keys); add a case where the input already contains opentelemetry.io/traceparent/tracestate and assert the function does not replace them (or update the test/name to match intended behavior).

Copilot uses AI. Check for mistakes.
@Rajneesh180
Copy link
Copy Markdown
Author

Opened this as a concrete follow-up to #446 since that issue has been sitting without any design feedback for a while now.

The core question is really about the annotation key format — I went with opentelemetry.io/traceparent and opentelemetry.io/tracestate to stay consistent with what Tekton Pipelines does for CRD-level propagation, but if there's a preference for something else (env vars on the trainer container, labels, different prefix) I can rework it quickly.

@andreyvelich since you scoped the observability work in #164 and have context on where the SDK-controller boundary should be drawn for tracing — would appreciate your eyes on whether this annotation-based approach makes sense as the propagation mechanism, or if there's a different direction you had in mind. The implementation is intentionally minimal (lazy import, no new deps) so it's easy to iterate on.

Also worth noting this is complementary to #401 — that PR handles SDK-internal spans while this one handles the cross-boundary propagation into the CRD itself.

@Rajneesh180 Rajneesh180 force-pushed the feat/inject-w3c-trace-context branch from ee38abd to 14f9d10 Compare April 5, 2026 12:35
When opentelemetry-api is installed, inject the active W3C trace
context (traceparent, tracestate) into TrainJob CRD annotations
under the opentelemetry.io/ prefix before creation.

Uses the global OTel propagator via lazy import — zero overhead
when the package is absent. Follows the same annotation convention
used by Tekton Pipelines for CRD-level context propagation.

Relates to kubeflow#446, kubeflow#164

Signed-off-by: Rajneesh Chaudhary <rajneeshrehsaan48@gmail.com>
- Only inject traceparent/tracestate into annotations, skip baggage
  and other propagator-injected keys that could carry PII or
  unbounded values into K8s annotations.
- Don't overwrite existing opentelemetry.io/* annotations — if the
  user explicitly set them, respect that.
- Mock sys.modules in no-otel tests so they're deterministic
  regardless of whether opentelemetry is installed in the env.
- Add tests for overwrite protection and baggage filtering.

Signed-off-by: Rajneesh Chaudhary <rajneeshrehsaan48@gmail.com>
@Rajneesh180 Rajneesh180 force-pushed the feat/inject-w3c-trace-context branch from 14f9d10 to 308716c Compare April 15, 2026 07:41
@Rajneesh180
Copy link
Copy Markdown
Author

@andreyvelich Quick ping — this is the SDK-side complement to the OTel integration in #164. The approach is intentionally minimal (lazy import, no new required deps). Would appreciate your take on whether annotations are the right propagation mechanism before I invest more time refining the implementation. Happy to adjust the approach based on your feedback.

@andreyvelich
Copy link
Copy Markdown
Member

Sorry for the late reply @Rajneesh180!
@kramaranya @astefanutti @abhijeet-dhumal @Fiona-Waters @szaher @dhanishaphadate @XploY04 Would you be able to check this OpenTelemetry integration since you plan to propose this as a GSoC project?
I know we have open KEP for that too: #382

@XploY04
Copy link
Copy Markdown
Contributor

XploY04 commented May 13, 2026

Sure @andreyvelich I will review it.

@XploY04
Copy link
Copy Markdown
Contributor

XploY04 commented May 15, 2026

@andreyvelich should we discuss the KEP first before moving forward with this PR? This implementation is quite different from my KEP (#382), and I think we should align on the design before we lock in the code.

A few differences I noticed:

  • The KEP propagates TRACEPARENT and TRACESTATE by injecting them as environment variables into pods, so the training code can pick up the parent span. This PR adds them as CRD annotations, which helps with controller-side correlation but does not pass the context into the training containers.
  • The KEP uses TraceContextTextMapPropagator directly, so propagation is W3C-only. This PR uses the global composite propagator through opentelemetry.propagate.inject, then filters the result with an allowlist. That works, but it feels like a workaround.
  • The KEP puts shared telemetry code in kubeflow/common/telemetry/propagation.py, so other backends can reuse it. This PR puts the helper in kubeflow/trainer/backends/kubernetes/utils.py, which makes it Kubernetes-specific.

My concern is that if we merge this shape now, we may need to rework the propagation code once the KEP lands. I think it would be better to agree on the KEP design first, then make the PR follow that structure. What do you think?

@andreyvelich
Copy link
Copy Markdown
Member

Yes, let's firstly discuss the implementation in the KEP.

I added this item to the next Kubeflow SDK call: https://docs.google.com/document/d/1jH2WAX2ePxOfI4JuiVK9nPlesDMiyg67xzLwhpR7wTQ/edit?tab=t.0

cc @kramaranya @Fiona-Waters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants