-
Notifications
You must be signed in to change notification settings - Fork 817
fix(bedrock): Add prompt caching support for Converse API #3390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(bedrock): Add prompt caching support for Converse API #3390
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Changes requested ❌
Reviewed everything up to 4fa3792 in 2 minutes and 2 seconds. Click for details.
- Reviewed
156lines of code in3files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py:359
- Draft comment:
Good integration of prompt_caching_converse_handling in _handle_converse. In the streaming handler (lines ~400), note that if both read and write tokens are present, the span attribute may be overwritten. Ensure this is the intended behavior. - Reason this comment was not posted:
Comment was on unchanged code.
2. packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py:56
- Draft comment:
The test correctly validates prompt caching metrics for Converse API. The cumulative workaround for metric values indicates the underlying counter is cumulative. Consider resetting metrics between tests to avoid cross-test interference if possible. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment has two parts: 1) An observation about the cumulative nature of the metrics which is already documented in the code comments, and 2) A speculative suggestion about resetting metrics that isn't clearly necessary since the current approach works. The comment doesn't identify any actual problems or required changes. The suggestion about resetting metrics could be valid if there's evidence of cross-test interference, but we don't see any such evidence. The current workaround seems intentional and functional. Since the current approach is working and documented, and there's no evidence of actual problems, the suggestion is more speculative than necessary. Delete the comment as it's primarily informative/observational and makes a speculative suggestion without clear evidence of need for change.
Workflow ID: wflow_bNUeXv3pUdPPxbhz
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
...pentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py
Outdated
Show resolved
Hide resolved
…nstrumentation/bedrock/prompt_caching.py Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @AlanPonnachan - looks like tests are failing, can you take a look?
….com/AlanPonnachan/openllmetry into feat-bedrock-converse-prompt-caching
|
Hi @nirga I’ve resolved the lint test failures. The remaining failing test, As I don’t have access to an active AWS account, I’m unable to generate the Thanks for your help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @AlanPonnachan, will do it - can you fix the small comment I wrote? I'll then run it locally and record a test. BTW - if you can rely on existing converse tests it might be easier
...pentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py
Show resolved
Hide resolved
|
Thanks for the great suggestion and for your willingness to help record the test! I agree that relying on an existing test is a cleaner approach. Before I push the changes, I just want to confirm my plan sounds good to you. Here is what I am planning to do:
This will result in the cassette for Does this plan look good? If so, I'll go ahead and make the changes. |
|
@nirga any update on this? |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdded response-based prompt caching handling for Bedrock Converse and invoked it from both synchronous and streaming Converse flows to record span attributes and emit cache read/write metrics. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant BedrockInstrumentor
participant BedrockAPI
participant PromptCacheHandler
participant Span
participant Metrics
Client->>BedrockInstrumentor: invoke_converse(...)
BedrockInstrumentor->>BedrockAPI: call Converse
BedrockAPI-->>BedrockInstrumentor: response (includes usage_metadata)
BedrockInstrumentor->>PromptCacheHandler: prompt_caching_converse_handling(response, vendor, model, metric_params)
rect #E8F4FF
Note over PromptCacheHandler: extract cache_read/cache_creation from response\nset span attrs and gen_ai.usage.* token attributes
PromptCacheHandler->>Span: set CACHED ("read"/"write") and token attrs
end
alt cache_read > 0
PromptCacheHandler->>Metrics: emit cache read metric
end
alt cache_creation > 0
PromptCacheHandler->>Metrics: emit cache write metric
end
BedrockInstrumentor-->>Client: return response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
[pre_merge_check_pass] The PR implementation addresses the core coding requirements from issue #3337: a new function [pre_merge_check_pass] All code changes in this PR are directly aligned with the requirements outlined in issue #3337. The modifications to ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay @AlanPonnachan! I was OOO and this fell between the cracks (team missed it for some reason). I'm back now! Left a small comment reg. the test - lmk if you need any assistance
| assert data_point.value == usage["cache_creation_input_tokens"] | ||
|
|
||
|
|
||
| @pytest.mark.vcr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlanPonnachan can you record a real test case here, like the other tests we have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
59-85: Span attribute overwrite when both cache operations occur.If a response contains both
cache_read_input_tokens > 0andcache_creation_input_tokens > 0, the second block (line 82) will overwrite theCACHEDattribute set by the first block (line 68). This means only "write" will be recorded on the span, losing the "read" information.Consider one of these approaches:
- Accumulate both values in a single attribute (e.g.,
"read,write")- Use separate span attributes for read and write operations
- Prioritize one operation type if simultaneous occurrence is expected to be rare
Apply this diff to record both operations:
if read_cached_tokens > 0: if metric_params.prompt_caching: metric_params.prompt_caching.add( read_cached_tokens, attributes={ **base_attrs, CacheSpanAttrs.TYPE: "read", }, ) - span.set_attribute(CacheSpanAttrs.CACHED, "read") + span.set_attribute(f"{CacheSpanAttrs.CACHED}.read", True) span.set_attribute( "gen_ai.usage.cache_read_input_tokens", read_cached_tokens ) if write_cached_tokens > 0: if metric_params.prompt_caching: metric_params.prompt_caching.add( write_cached_tokens, attributes={ **base_attrs, CacheSpanAttrs.TYPE: "write", }, ) - span.set_attribute(CacheSpanAttrs.CACHED, "write") + span.set_attribute(f"{CacheSpanAttrs.CACHED}.write", True) span.set_attribute( "gen_ai.usage.cache_creation_input_tokens", write_cached_tokens )
🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
14-43: Consider adding is_recording check for consistency.The new
prompt_caching_converse_handlingfunction includes anis_recording()check (line 52 in the new function), but this existing function doesn't. For consistency and to avoid unnecessary span operations when not recording, consider adding the same check here.Apply this diff:
def prompt_caching_handling(headers, vendor, model, metric_params): base_attrs = { "gen_ai.system": vendor, "gen_ai.response.model": model, } span = trace.get_current_span() - if not isinstance(span, trace.Span): + if not isinstance(span, trace.Span) or not span.is_recording(): return
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py(3 hunks)packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py(1 hunks)packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.pypackages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.pypackages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py
🧬 Code graph analysis (2)
packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py (3)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
PromptCaching(470-472)packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
CacheSpanAttrs(9-11)packages/opentelemetry-instrumentation-bedrock/tests/metrics/conftest.py (1)
test_context(35-47)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (2)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (2)
prompt_caching_converse_handling(46-85)prompt_caching_handling(14-43)packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/guardrail.py (1)
guardrail_converse(153-176)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py
30-30: Create your own exception
(TRY002)
30-30: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (6)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
46-57: Good defensive programming with is_recording check.The addition of
span.is_recording()check and the conditional metric recording (if metric_params.prompt_caching:) are good defensive practices that prevent unnecessary work when telemetry is disabled.packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (3)
25-28: LGTM: Clean import of prompt caching handlers.The import statements are correctly structured and bring in both the new
prompt_caching_converse_handlingand the existingprompt_caching_handlingfunctions.
357-361: LGTM: Proper integration in synchronous Converse handler.The call to
prompt_caching_converse_handlingis correctly placed afterguardrail_converseand receives the appropriate parameters (response, provider, model, metric_params). This mirrors the pattern used in_handle_call(line 332) for the invoke_model API.
399-406: LGTM: Correct metadata extraction for streaming events.The metadata is properly extracted from the event before being passed to
prompt_caching_converse_handling. This ensures that cache metrics are recorded when the final metadata event arrives in the stream.packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py (2)
33-47: Verify metric assertion logic handles all scenarios.The current assertion logic iterates through all data points and validates each one individually. However, it doesn't verify that the expected number of data points exist (e.g., exactly one for "read" and one for "write"). This could mask issues where duplicate or missing data points occur.
Consider adding explicit checks:
def assert_metric(reader, usage): metrics_data = reader.get_metrics_data() resource_metrics = metrics_data.resource_metrics assert len(resource_metrics) > 0 m = get_metric(resource_metrics, PromptCaching.LLM_BEDROCK_PROMPT_CACHING) read_points = [dp for dp in m.data.data_points if dp.attributes[CacheSpanAttrs.TYPE] == "read"] write_points = [dp for dp in m.data.data_points if dp.attributes[CacheSpanAttrs.TYPE] == "write"] if usage["cache_read_input_tokens"] > 0: assert len(read_points) == 1 assert read_points[0].value == usage["cache_read_input_tokens"] if usage["cache_creation_input_tokens"] > 0: assert len(write_points) == 1 assert write_points[0].value == usage["cache_creation_input_tokens"]
50-69: Test structure is sound but VCR cassette needs recording.The test correctly validates both cache creation (first call) and cache reading (second call) scenarios. The cumulative metric workaround at line 68 is necessary due to how the InMemoryMetricReader accumulates counter values across reads.
Note: As mentioned in the PR description, this test requires a VCR cassette to be recorded with actual AWS credentials.
Ensure the VCR cassette is recorded before merging, as the test will fail without it.
| def get_metric(resource_metrics, name): | ||
| for rm in resource_metrics: | ||
| for sm in rm.scope_metrics: | ||
| for metric in sm.metrics: | ||
| if metric.name == name: | ||
| return metric | ||
| raise Exception(f"No metric found with name {name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use AssertionError for test helper.
The generic Exception should be replaced with AssertionError since this is a test assertion helper and test failures should propagate as assertion failures.
As per static analysis hints.
Apply this diff:
def get_metric(resource_metrics, name):
for rm in resource_metrics:
for sm in rm.scope_metrics:
for metric in sm.metrics:
if metric.name == name:
return metric
- raise Exception(f"No metric found with name {name}")
+ raise AssertionError(f"No metric found with name {name}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def get_metric(resource_metrics, name): | |
| for rm in resource_metrics: | |
| for sm in rm.scope_metrics: | |
| for metric in sm.metrics: | |
| if metric.name == name: | |
| return metric | |
| raise Exception(f"No metric found with name {name}") | |
| def get_metric(resource_metrics, name): | |
| for rm in resource_metrics: | |
| for sm in rm.scope_metrics: | |
| for metric in sm.metrics: | |
| if metric.name == name: | |
| return metric | |
| raise AssertionError(f"No metric found with name {name}") |
🧰 Tools
🪛 Ruff (0.14.2)
30-30: Create your own exception
(TRY002)
30-30: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In
packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py
around lines 24 to 30, the helper raises a generic Exception when a metric is
not found; change it to raise an AssertionError (or use assert False) with the
same message so test failures surface as assertion failures. Update the raise
statement accordingly and keep the message f"No metric found with name {name}"
intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (1)
16-22: Consider using a more specific exception type.While the function logic is correct, using a generic
Exceptionis flagged by Ruff. For test utilities, this is acceptable, but you could optionally create a custom exception class or useAssertionErrorfor better clarity in test failures.Apply this diff if you'd like to improve the exception handling:
- raise Exception(f"No metric found with name {name}") + raise AssertionError(f"No metric found with name {name}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
PromptCaching(470-472)packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
CacheSpanAttrs(9-11)packages/opentelemetry-instrumentation-bedrock/tests/metrics/conftest.py (1)
test_context(35-47)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(64-261)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
22-22: Create your own exception
(TRY002)
22-22: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (4)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (4)
12-13: LGTM! Imports are appropriate for caching support.The new imports are necessary for testing prompt caching functionality and are properly scoped.
25-46: LGTM! Metric assertion logic is sound.The function correctly handles cumulative metrics by tracking both read and write data points separately and asserting the appropriate one based on the
is_readparameter.
1076-1090: Verify VCR cassette recording.This test requires a VCR cassette to be recorded with actual AWS credentials. According to the PR comments, you've requested that a maintainer with AWS access run and record this cassette. Ensure this is completed before merging.
1077-1130: Comprehensive caching test with good coverage.The test thoroughly validates both cache write and read scenarios, including:
- Usage token assertions from API responses
- Span attribute validation for cache state
- Metric emission verification
The test structure is clear and follows best practices.
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
Outdated
Show resolved
Hide resolved
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (1)
1113-1114: Resolved: replaced hardcoded attr strings with semconv constants.Good use of SpanAttributes for cache token attributes; previous review concern addressed.
Also applies to: 1121-1122
🧹 Nitpick comments (4)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (4)
12-14: Avoid depending on instrumentation internals in tests.Importing PromptCaching and CacheSpanAttrs from instrumentation modules couples tests to internal paths. Prefer stable semconv constants or add local fallbacks for attribute keys to reduce breakage risk on refactors.
16-22: Use pytest.fail and drop bare Exception; simplify lookup.Replace manual loops + bare Exception with next(...) and pytest.fail for clearer failures and to satisfy TRY002/TRY003.
def get_metric(resource_metrics, name): - for rm in resource_metrics: - for sm in rm.scope_metrics: - for metric in sm.metrics: - if metric.name == name: - return metric - raise Exception(f"No metric found with name {name}") + try: + return next( + m + for rm in resource_metrics + for sm in rm.scope_metrics + for m in sm.metrics + if m.name == name + ) + except StopIteration: + pytest.fail(f"Metric '{name}' not found in collected metrics")
25-46: Make metric assertion robust to cumulative/export cycles.Sum data points by type to tolerate multiple collections and exporter behavior; add clearer empty-metrics assertion.
def assert_metric(reader, usage, is_read=False): - metrics_data = reader.get_metrics_data() - resource_metrics = metrics_data.resource_metrics - assert len(resource_metrics) > 0 - - m = get_metric(resource_metrics, PromptCaching.LLM_BEDROCK_PROMPT_CACHING) - # This check is now more specific to handle cumulative metrics - found_read = False - found_write = False - for data_point in m.data.data_points: - if data_point.attributes[CacheSpanAttrs.TYPE] == "read": - found_read = True - assert data_point.value == usage["cache_read_input_tokens"] - elif data_point.attributes[CacheSpanAttrs.TYPE] == "write": - found_write = True - assert data_point.value == usage["cache_creation_input_tokens"] - - if is_read: - assert found_read - else: - assert found_write + metrics_data = reader.get_metrics_data() + resource_metrics = metrics_data.resource_metrics + assert resource_metrics, "No resource metrics collected" + + m = get_metric(resource_metrics, PromptCaching.LLM_BEDROCK_PROMPT_CACHING) + read = sum( + dp.value for dp in m.data.data_points + if dp.attributes.get(CacheSpanAttrs.TYPE) == "read" + ) + write = sum( + dp.value for dp in m.data.data_points + if dp.attributes.get(CacheSpanAttrs.TYPE) == "write" + ) + if is_read: + assert read == usage["cache_read_input_tokens"] + else: + assert write == usage["cache_creation_input_tokens"]
1106-1123: Don’t rely on span ordering; select by attribute.Use CacheSpanAttrs.CACHED to locate write/read spans to avoid flakiness.
- # Assertions for the first span (cache write) - span_write = spans[0] - assert span_write.name == "bedrock.converse" - attributes_write = span_write.attributes + # Assertions for the first span (cache write) + span_write = next(s for s in spans if s.attributes.get(CacheSpanAttrs.CACHED) == "write") + assert span_write.name == "bedrock.converse" + attributes_write = span_write.attributes @@ - # Assertions for the second span (cache read) - span_read = spans[1] - assert span_read.name == "bedrock.converse" - attributes_read = span_read.attributes + # Assertions for the second span (cache read) + span_read = next(s for s in spans if s.attributes.get(CacheSpanAttrs.CACHED) == "read") + assert span_read.name == "bedrock.converse" + attributes_read = span_read.attributes
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
PromptCaching(470-472)packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
CacheSpanAttrs(9-11)packages/opentelemetry-instrumentation-bedrock/tests/metrics/conftest.py (1)
test_context(35-47)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(64-261)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
22-22: Create your own exception
(TRY002)
22-22: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (1)
1124-1130: Metrics assertions look good after helper hardening.Cumulative usage check for read/write is aligned with counter semantics; ok once assert_metric is updated as suggested.
Optionally assert there are exactly two data points (read/write) for the metric in this test context after moving fixture for isolation.
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
Show resolved
Hide resolved
|
Hi @nirga , I've updated the PR with the changes we discussed. I moved the logic into a new test in It should be ready for the cassette to be recorded now. Thanks again for your help! |
|
Thanks @AlanPonnachan! Can you record the test? |
|
Hi @nirga, thank you again for all your guidance. I tried to run the recording command locally as you requested. As expected, since I don't have AWS credentials, the test run fails with an I believe the PR is now ready from a code perspective. Would you be able to run the recording on your end when you have a moment? Thank you so much for your help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py(3 hunks)packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml(1 hunks)packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/init.py
🧰 Additional context used
📓 Path-based instructions (2)
**/cassettes/**/*.{yaml,yml,json}
📄 CodeRabbit inference engine (CLAUDE.md)
Never commit secrets or PII in VCR cassettes; scrub sensitive data
Files:
packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧠 Learnings (1)
📚 Learning: 2025-08-17T15:06:48.109Z
Learnt from: CR
PR: traceloop/openllmetry#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T15:06:48.109Z
Learning: Applies to tests/**/*.py : Tests that make API calls must utilize VCR cassettes
Applied to files:
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
PromptCaching(470-472)packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
CacheSpanAttrs(9-11)packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
reader(37-41)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(64-245)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
22-22: Create your own exception
(TRY002)
22-22: Avoid specifying long messages outside the exception class
(TRY003)
1077-1077: Unused function argument: instrument_legacy
(ARG001)
🔇 Additional comments (4)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (4)
12-13: LGTM!The imports for
PromptCachingandCacheSpanAttrsare properly added to support the caching test scenario.
16-22: Helper function is clear and appropriate for test utilities.The
get_metrichelper correctly searches through resource metrics. While static analysis suggests using a custom exception class, a genericExceptionis acceptable for test utilities where the primary goal is clear failure messages during test debugging.
25-45: LGTM!The
assert_metrichelper correctly validates prompt caching metrics for both read and write scenarios. The logic properly handles cumulative metrics by iterating through all data points and checking that the appropriate cache type was found.
1076-1129: Test logic is well-structured and correctly validates caching behavior.The test properly exercises both cache-write and cache-read paths:
- First call writes to cache (
cache_creation_input_tokens > 0)- Second call reads from cache (
cache_read_input_tokens > 0)- Span attributes and metrics are validated for both scenarios
Good use of semantic convention constants (
SpanAttributes.*) instead of hardcoded strings, addressing previous review feedback.Note: The
instrument_legacyfixture is flagged as unused by static analysis, but it's required for side effects to set up the Bedrock instrumentation before the test runs.The test will work correctly once the VCR cassette is properly recorded with valid AWS credentials (see cassette file review comment).
| interactions: | ||
| - request: | ||
| body: '{"messages": [{"role": "user", "content": [{"text": "Hello, this is a test | ||
| prompt for caching."}]}], "inferenceConfig": {"maxTokens": 50}, "additionalModelRequestFields": | ||
| {"cacheControl": {"type": "ephemeral"}}}' | ||
| headers: | ||
| Content-Length: | ||
| - '211' | ||
| Content-Type: | ||
| - !!binary | | ||
| YXBwbGljYXRpb24vanNvbg== | ||
| User-Agent: | ||
| - !!binary | | ||
| Qm90bzMvMS4zNC4xNDUgbWQvQm90b2NvcmUjMS4zNC4xNDUgdWEvMi4wIG9zL3dpbmRvd3MjMTAg | ||
| bWQvYXJjaCNhbWQ2NCBsYW5nL3B5dGhvbiMzLjExLjggbWQvcHlpbXBsI0NQeXRob24gY2ZnL3Jl | ||
| dHJ5LW1vZGUjbGVnYWN5IEJvdG9jb3JlLzEuMzQuMTQ1 | ||
| X-Amz-Date: | ||
| - !!binary | | ||
| MjAyNTEwMzBUMTY1MTI4Wg== | ||
| amz-sdk-invocation-id: | ||
| - !!binary | | ||
| ZTFmM2VjZTMtM2VlMC00ZGZiLWE2MWEtYjBiYmU4M2U2NDFm | ||
| amz-sdk-request: | ||
| - !!binary | | ||
| YXR0ZW1wdD0x | ||
| method: POST | ||
| uri: https://bedrock-runtime.us-east-1.amazonaws.com/model/amazon.titan-text-express-v1/converse | ||
| response: | ||
| body: | ||
| string: '{"message":"The security token included in the request is invalid."}' | ||
| headers: | ||
| Connection: | ||
| - keep-alive | ||
| Content-Length: | ||
| - '68' | ||
| Content-Type: | ||
| - application/json | ||
| Date: | ||
| - Thu, 30 Oct 2025 16:51:30 GMT | ||
| x-amzn-ErrorType: | ||
| - UnrecognizedClientException:http://internal.amazon.com/coral/com.amazon.coral.service/ | ||
| x-amzn-RequestId: | ||
| - f0fdf86b-d85a-4f16-8713-1b9b5fccbe25 | ||
| status: | ||
| code: 403 | ||
| message: Forbidden | ||
| version: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Cassette contains authentication failure instead of valid caching response.
The cassette records a 403 Forbidden response with "The security token included in the request is invalid" rather than a successful API response with caching metadata. This means the test test_titan_converse_with_caching will fail because:
- Line 1086-1088: Expects
usage_write["cache_creation_input_tokens"] > 0, but the response has nousagefield - Line 1091-1099: Expects a second successful call with cache read data
Per the PR objectives, you need AWS credentials to record a valid cassette. Ask the maintainer to run:
and commit the generated cassette.
🏁 Script executed:
poetry run pytest packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py::test_titan_converse_with_caching --record-mode=onceLength of output: 44
Once I locate the test file, I'll examine the specific test to verify the assertions mentioned in the review comment.
---
🏁 Script executed:
```shell
#!/bin/bash
# Find and inspect the test_titan.py file and locate the test_titan_converse_with_caching test
fd "test_titan\.py" --type f
Length of output: 77
🏁 Script executed:
#!/bin/bash
# Inspect the test function test_titan_converse_with_caching
ast-grep --pattern 'def test_titan_converse_with_caching($_) {
$$$
}'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for the test function and display context
rg -n "def test_titan_converse_with_caching" --type py -A 50Length of output: 6541
Cassette is invalid and will cause test failure; AWS credentials required to regenerate.
The cassette indeed contains a 403 Forbidden response instead of a valid caching response. The test will fail at line 1086 when attempting to access response_write["usage"] from the error response, which lacks this field. The cassette must be regenerated with valid AWS credentials to record a successful response with cache metadata.
🤖 Prompt for AI Agents
packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml
lines 1-47: the cassette records a 403 UnrecognizedClientException (invalid AWS
credentials) instead of a successful converse response, causing tests to crash
when they expect response_write["usage"]; regenerate or replace this cassette by
re-running the test recording with valid AWS credentials so the cassette
contains a successful response that includes cache/usage metadata, and then
commit the new cassette (or remove the failing cassette and update the test to
use a properly recorded fixture).
|
Thanks @AlanPonnachan, I tried running it locally and the test failed: FAILED tests/traces/test_titan.py::test_titan_converse_with_caching - botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the Converse operation: The model returned the following errors: Malformed input request: extraneous key [cacheControl] is not permitted, please reformat your input and try again. |
feat(instrumentation): ...orfix(instrumentation): ....Description
This PR introduces prompt caching telemetry for the AWS Bedrock Converse and Converse Stream APIs, bringing feature parity with the existing
invoke_modelinstrumentation.The Converse API reports caching information in the
usagefield of the response body, rather than through HTTP headers. This implementation adds the necessary logic to parse this information and record it as metrics and span attributes.Changes include:
prompt_caching_converse_handlinginprompt_caching.pyto extractcache_read_input_tokensandcache_creation_input_tokensfrom the response body.__init__.py: The new function is now called from_handle_converseand_handle_converse_streamto process caching data for both standard and streaming calls.test_bedrock_converse_prompt_caching_metrics.pyto validate that thegen_ai.prompt.cachingmetric is correctly emitted for the Converse API.Fixes #3337
Important
Adds prompt caching telemetry for AWS Bedrock Converse APIs, including new function for caching data extraction and corresponding tests.
prompt_caching_converse_handlinginprompt_caching.pyto extract caching data from Converse API response body.prompt_caching_converse_handlinginto_handle_converseand_handle_converse_streamin__init__.py.test_bedrock_converse_prompt_caching_metrics.pyto validategen_ai.prompt.cachingmetric emission for Converse API.This description was created by
for 4fa3792. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
New Features
Tests