Skip to content

feat: add OpenTelemetry observability with GenAI semantic conventions#49

Open
kirang89 wants to merge 9 commits intomainfrom
feat/observability
Open

feat: add OpenTelemetry observability with GenAI semantic conventions#49
kirang89 wants to merge 9 commits intomainfrom
feat/observability

Conversation

@kirang89
Copy link
Contributor

@kirang89 kirang89 commented Mar 21, 2026

Summary

Add OpenTelemetry instrumentation to ask-forge so consumers can observe LLM interactions using any OTel-compatible backend (Langfuse, Jaeger, Honeycomb, etc.).

Design decisions

  • Library depends only on @opentelemetry/api — zero overhead no-op when no SDK is installed. No backend coupling.
  • GenAI semantic conventions for span and attribute naming (gen_ai.chat, gen_ai.execute_tool, etc.)
  • Pure functions in src/tracing.ts — no classes, no state (aside from the idiomatic module-level tracer)
  • One trace per ask() call, correlated across multi-turn conversations via ask_forge.session.id — matches the standard pattern used by Langfuse, LangSmith, and OpenLLMetry
  • All error paths record exceptions with full stack traces

Trace structure

Each session.ask() produces a trace:

ask (root)
├── compaction
├── gen_ai.chat (iteration 1)
├── gen_ai.execute_tool (rg)
├── gen_ai.execute_tool (read)
├── gen_ai.chat (iteration 2)
└── gen_ai.chat (iteration 3, final response)

Instrumentation points

Span Attributes Events
ask (root) gen_ai.operation.name, gen_ai.request.model, ask_forge.session.id, ask_forge.repo.url, ask_forge.repo.commitish, token usage, iteration/tool counts, link stats gen_ai.system_instructions, gen_ai.input.messages (question)
compaction was_compacted, tokens_before, tokens_after Exception on error
gen_ai.chat model, provider, iteration, token usage (incl. cache), stop reason gen_ai.input.messages, gen_ai.output.messages, exception on error
gen_ai.execute_tool tool name, call ID gen_ai.tool.call.arguments, gen_ai.tool.call.result

Error handling

  • Tool execution errors: tool span ends with ERROR, exception propagates and ask span also ends with ERROR
  • Stream/API errors: generation span ends with ERROR, ask span ends normally (returns error result)
  • Max iterations: ask span ends with ERROR (error.type = "max_iterations_reached")
  • Compaction failures: compaction span ends with ERROR, execution continues

All spans are guaranteed to end (no orphans) via try/catch guards.

Multi-turn conversations

Each ask() call creates an independent trace. Multi-turn conversations are correlated via ask_forge.session.id on the root span. This matches the industry standard (Langfuse sessions, LangSmith threads, OpenLLMetry association properties).

Follow-up: #56 — adopt gen_ai.conversation.id from the OTel GenAI semantic conventions (v1.40.0) for spec-compliant conversation tracking.

Files changed

  • src/tracing.tsNEW: OTel span helpers with GenAI semantic conventions
  • src/session.tsMODIFIED: instrumented at 4 integration points
  • test/tracing.test.tsNEW: 20 tests with in-memory TracerProvider
  • README.mdMODIFIED: added Observability section with setup guide and metrics table
  • package.jsonMODIFIED: added @opentelemetry/api dependency

Consumer setup

By default, tracing is a zero-overhead no-op (no console output, no network calls). To enable, the consumer installs an OTel SDK and registers an exporter before calling ask():

import { NodeSDK } from "@opentelemetry/sdk-node";
import { LangfuseSpanProcessor } from "@langfuse/otel";

const sdk = new NodeSDK({ spanProcessors: [new LangfuseSpanProcessor()] });
sdk.start();

// ask-forge spans now flow to Langfuse automatically

Uses only @opentelemetry/api (no-op without consumer SDK).
Pure functions — no classes, no state, no custom interfaces.
…points

- Root ask span with system prompt event
- Compaction span (success + error paths)
- Generation span per LLM iteration with input/output message events
- Tool span per execution with arguments/result events
- All error paths record exceptions with full stack traces
16 tests covering: root ask span attributes/events, compaction spans,
generation spans with usage/messages, tool spans with args/results,
parent-child relationships, and error path exception recording.
Covers: setup example (Langfuse via OTel), trace structure diagram,
captured metrics table for all span types, and error handling behavior.
…ool call count

- gen_ai.provider.name on generation spans (standard semconv)
- gen_ai.response.finish_reason on generation spans (end_turn, tool_use, etc.)
- ask_forge.total_iterations on root ask span
- ask_forge.total_tool_calls on root ask span
- Add try/catch around tool execution to end tool spans on error
- Wrap iteration loop in try/catch to guarantee ask span ends
- Add endToolSpanWithError helper for failed tool executions
- Remove unused params from startAskSpan (response, inferenceTimeMs)
- Add question as input event on root ask span
- Make systemPrompt optional to match upstream Context type
- Import Span type instead of inline import() expressions
- Replace console.log with this.#logger.log for compaction
- Remove stale test count from README
- Add test for tool execution error (span + ask span both end)
- Add test for API error string recorded as exception
- Add test for compaction error span lifecycle
- Fix TestSpan.addEvent signature to match OTel Span interface
- Fix TracerProvider type import (was trace.TracerProvider)
- Add addLink/addLinks stubs required by OTel Span interface
- Remove unused afterEach import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant