feat: inject W3C trace context into TrainJob annotations#447
feat: inject W3C trace context into TrainJob annotations#447Rajneesh180 wants to merge 2 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds optional OpenTelemetry W3C trace-context propagation from the SDK into Kubernetes TrainJob CRD annotations so downstream controller/pods can continue the trace.
Changes:
- Introduces
inject_trace_context()utility to inject trace headers into annotations under theopentelemetry.io/prefix. - Calls
inject_trace_context()inKubernetesBackend.train()before TrainJob construction. - Adds unit tests covering injection and no-op paths.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
kubeflow/trainer/backends/kubernetes/utils.py |
Adds annotation injection helper and constant prefix for trace context propagation. |
kubeflow/trainer/backends/kubernetes/backend.py |
Wires trace context injection into TrainJob creation flow. |
kubeflow/trainer/backends/kubernetes/utils_test.py |
Adds tests for the new injection helper across multiple branches. |
| Uses the global OTel propagator to inject ``traceparent`` / ``tracestate`` | ||
| headers into a carrier dict, then merges them into *annotations* under the | ||
| ``opentelemetry.io/`` key prefix. | ||
|
|
||
| Returns *annotations* unchanged when the ``opentelemetry`` package is not | ||
| installed or when no active span context exists. | ||
| """ | ||
| try: | ||
| from opentelemetry.propagate import inject | ||
| except ImportError: | ||
| return annotations | ||
|
|
||
| carrier: dict[str, str] = {} | ||
| inject(carrier) | ||
|
|
||
| if not carrier: | ||
| return annotations | ||
|
|
||
| if annotations is None: | ||
| annotations = {} | ||
|
|
||
| for key, value in carrier.items(): | ||
| annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value |
There was a problem hiding this comment.
inject_trace_context() merges every key produced by the global OTel propagator, so it can add non-trace headers like baggage (and potentially large/PII values) even though the docstring says only W3C traceparent/tracestate are injected; consider filtering to just those keys or using the trace-context propagator explicitly to avoid leaking/oversizing Kubernetes annotations.
| annotations = {} | ||
|
|
||
| for key, value in carrier.items(): | ||
| annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value |
There was a problem hiding this comment.
The assignment annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value will overwrite user-provided opentelemetry.io/* annotations (e.g., an explicit opentelemetry.io/traceparent), which contradicts the intent to preserve existing annotations; use a non-overwriting merge (e.g., only set when the key is absent) or document/rename the behavior.
| annotations[f"{_TRACE_ANNOTATION_PREFIX}{key}"] = value | |
| annotation_key = f"{_TRACE_ANNOTATION_PREFIX}{key}" | |
| if annotation_key not in annotations: | |
| annotations[annotation_key] = value |
| class TestInjectTraceContext: | ||
| def test_passthrough_without_otel(self): | ||
| """Annotations returned unchanged when opentelemetry is not installed.""" | ||
| existing = {"user-key": "user-value"} | ||
| result = utils.inject_trace_context(existing) | ||
| assert result is existing | ||
|
|
||
| def test_none_passthrough_without_otel(self): | ||
| """None returned when annotations is None and opentelemetry is absent.""" | ||
| assert utils.inject_trace_context(None) is None |
There was a problem hiding this comment.
test_passthrough_without_otel / test_none_passthrough_without_otel depend on opentelemetry not being installed in the test environment, which makes the tests brittle for contributors; patch the import (or sys.modules) to force the ImportError branch so these tests are deterministic.
| def test_preserves_existing_annotations(self): | ||
| """User-supplied annotations are not overwritten.""" | ||
| mock_mod = _mock_otel_propagate({"traceparent": SAMPLE_TRACEPARENT}) | ||
| original = {"team": "ml-platform", "created-by": "sdk"} | ||
| with patch.dict( | ||
| "sys.modules", | ||
| {"opentelemetry": MagicMock(), "opentelemetry.propagate": mock_mod}, | ||
| ): | ||
| result = utils.inject_trace_context(original) | ||
| assert result["team"] == "ml-platform" | ||
| assert result["created-by"] == "sdk" | ||
| assert result["opentelemetry.io/traceparent"] == SAMPLE_TRACEPARENT |
There was a problem hiding this comment.
test_preserves_existing_annotations doesn't cover the actual overwrite risk for trace annotations (it only checks unrelated keys); add a case where the input already contains opentelemetry.io/traceparent/tracestate and assert the function does not replace them (or update the test/name to match intended behavior).
|
Opened this as a concrete follow-up to #446 since that issue has been sitting without any design feedback for a while now. The core question is really about the annotation key format — I went with @andreyvelich since you scoped the observability work in #164 and have context on where the SDK-controller boundary should be drawn for tracing — would appreciate your eyes on whether this annotation-based approach makes sense as the propagation mechanism, or if there's a different direction you had in mind. The implementation is intentionally minimal (lazy import, no new deps) so it's easy to iterate on. Also worth noting this is complementary to #401 — that PR handles SDK-internal spans while this one handles the cross-boundary propagation into the CRD itself. |
ee38abd to
14f9d10
Compare
When opentelemetry-api is installed, inject the active W3C trace context (traceparent, tracestate) into TrainJob CRD annotations under the opentelemetry.io/ prefix before creation. Uses the global OTel propagator via lazy import — zero overhead when the package is absent. Follows the same annotation convention used by Tekton Pipelines for CRD-level context propagation. Relates to kubeflow#446, kubeflow#164 Signed-off-by: Rajneesh Chaudhary <rajneeshrehsaan48@gmail.com>
- Only inject traceparent/tracestate into annotations, skip baggage and other propagator-injected keys that could carry PII or unbounded values into K8s annotations. - Don't overwrite existing opentelemetry.io/* annotations — if the user explicitly set them, respect that. - Mock sys.modules in no-otel tests so they're deterministic regardless of whether opentelemetry is installed in the env. - Add tests for overwrite protection and baggage filtering. Signed-off-by: Rajneesh Chaudhary <rajneeshrehsaan48@gmail.com>
14f9d10 to
308716c
Compare
|
@andreyvelich Quick ping — this is the SDK-side complement to the OTel integration in #164. The approach is intentionally minimal (lazy import, no new required deps). Would appreciate your take on whether annotations are the right propagation mechanism before I invest more time refining the implementation. Happy to adjust the approach based on your feedback. |
|
Sorry for the late reply @Rajneesh180! |
|
Sure @andreyvelich I will review it. |
|
@andreyvelich should we discuss the KEP first before moving forward with this PR? This implementation is quite different from my KEP (#382), and I think we should align on the design before we lock in the code. A few differences I noticed:
My concern is that if we merge this shape now, we may need to rework the propagation code once the KEP lands. I think it would be better to agree on the KEP design first, then make the PR follow that structure. What do you think? |
|
Yes, let's firstly discuss the implementation in the KEP. I added this item to the next Kubeflow SDK call: https://docs.google.com/document/d/1jH2WAX2ePxOfI4JuiVK9nPlesDMiyg67xzLwhpR7wTQ/edit?tab=t.0 |
Addresses #446.
When
opentelemetry-apiis installed,train()now injects the active W3C trace context (traceparent,tracestate) into TrainJob annotations under theopentelemetry.io/prefix. This lets traces started by the SDK propagate to the training controller and worker pods.The implementation uses a lazy import with a fallback — when the package isn't installed (the common case today), the function is a no-op that returns annotations unchanged with zero overhead. The
opentelemetry.io/annotation prefix follows the same convention Tekton Pipelines uses for CRD-level context propagation.Changes:
inject_trace_context()inutils.py— uses the global OTel propagator to write trace headers into a carrier dict, then merges them as prefixed annotationsKubernetesBackend.train()right before TrainJob constructionNo new required dependencies. A follow-up could add
opentelemetry-apias an optional extra (pip install kubeflow[telemetry]).Relates to #164