Skip to content

Commit 0c8674b

Browse files
committed
Merge branch 'main' into feat/reasoning
2 parents 21b1481 + ec189ab commit 0c8674b

File tree

2 files changed

+96
-8
lines changed

2 files changed

+96
-8
lines changed

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1414
* `ChatOpenAI()` and `ChatAzureOpenAI()` gain access to latest models, built-in tools, image generation, etc. as a result of moving to the new [Responses API](https://platform.openai.com/docs/api-reference/responses). (#192)
1515
* `ChatOpenAI()`, `ChatAnthropic()`, and `ChatGoogle()` gain a new `reasoning` parameter to easily opt-into, and fully customize, reasoning capabilities. (#202)
1616
* A new `ContentThinking` content type was added and captures the "thinking" portion of a reasoning model. (#192)
17+
* `ChatAnthropic()` and `ChatBedrockAnthropic()` gain new `cache` parameter to control caching. By default it is set to "5m". This should (on average) reduce the cost of your chats. (#215)
1718
* Added support for systematic evaluation via [Inspect AI](https://inspect.aisi.org.uk/). This includes:
1819
* A new `.export_eval()` method for exporting conversation history as an Inspect eval dataset sample. This supports multi-turn conversations, tool calls, images, PDFs, and structured data.
1920
* A new `.to_solver()` method for translating chat instances into Inspect solvers that can be used with Inspect's evaluation framework.
2021
* A new `Turn.to_inspect_messages()` method for converting turns to Inspect's message format.
2122
* Comprehensive documentation in the [Evals guide](https://posit-dev.github.io/chatlas/misc/evals.html).
2223

23-
2424
### Changes
2525

2626
* `ChatOpenAI()` and `ChatAzureOpenAI()` move from OpenAI's Completions API to [Responses API](https://platform.openai.com/docs/api-reference/responses). If this happens to break behavior, change `ChatOpenAI()` -> `ChatOpenAICompletions()` (or `ChatAzureOpenAI()` -> `ChatAzureOpenAICompletions()`). (#192)

chatlas/_provider_anthropic.py

Lines changed: 95 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@
4747
ToolParam,
4848
ToolUseBlock,
4949
)
50+
from anthropic.types.cache_control_ephemeral_param import CacheControlEphemeralParam
5051
from anthropic.types.document_block_param import DocumentBlockParam
5152
from anthropic.types.image_block_param import ImageBlockParam
5253
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
@@ -77,6 +78,7 @@ def ChatAnthropic(
7778
system_prompt: Optional[str] = None,
7879
model: "Optional[ModelParam]" = None,
7980
max_tokens: int = 4096,
81+
cache: Literal["5m", "1h", "none"] = "5m",
8082
reasoning: Optional["int | ThinkingConfigEnabledParam"] = None,
8183
api_key: Optional[str] = None,
8284
kwargs: Optional["ChatClientArgs"] = None,
@@ -127,6 +129,10 @@ def ChatAnthropic(
127129
choosing a model for all but the most casual use.
128130
max_tokens
129131
Maximum number of tokens to generate before stopping.
132+
cache
133+
How long to cache inputs? Defaults to "5m" (five minutes).
134+
Set to "none" to disable caching or "1h" to cache for one hour.
135+
See the Caching section for details.
130136
reasoning
131137
Determines how many tokens Claude can be allocated to reasoning. Must be
132138
≥1024 and less than `max_tokens`. Larger budgets can enable more
@@ -182,6 +188,46 @@ def ChatAnthropic(
182188
```shell
183189
export ANTHROPIC_API_KEY=...
184190
```
191+
192+
Caching
193+
-------
194+
195+
Caching with Claude is a bit more complicated than other providers but we
196+
believe that on average it will save you both money and time, so we have
197+
enabled it by default. With other providers, like OpenAI and Google,
198+
you only pay for cache reads, which cost 10% of the normal price. With
199+
Claude, you also pay for cache writes, which cost 125% of the normal price
200+
for 5 minute caching and 200% of the normal price for 1 hour caching.
201+
202+
How does this affect the total cost of a conversation? Imagine the first
203+
turn sends 1000 input tokens and receives 200 output tokens. The second
204+
turn must first send both the input and output from the previous turn
205+
(1200 tokens). It then sends a further 1000 tokens and receives 200 tokens
206+
back.
207+
208+
To compare the prices of these two approaches we can ignore the cost of
209+
output tokens, because they are the same for both. How much will the input
210+
tokens cost? If we don't use caching, we send 1000 tokens in the first turn
211+
and 2200 (1000 + 200 + 1000) tokens in the second turn for a total of 3200
212+
tokens. If we use caching, we'll send (the equivalent of) 1000 * 1.25 = 1250
213+
tokens in the first turn. In the second turn, 1000 of the input tokens will
214+
be cached so the total cost is 1000 * 0.1 + (200 + 1000) * 1.25 = 1600
215+
tokens. That makes a total of 2850 tokens, i.e. 11% fewer tokens,
216+
decreasing the overall cost.
217+
218+
Obviously, the details will vary from conversation to conversation, but
219+
if you have a large system prompt that you re-use many times you should
220+
expect to see larger savings. You can see exactly how many input and
221+
cache input tokens each turn uses, along with the total cost,
222+
with `chat.get_tokens()`. If you don't see savings for your use case, you can
223+
suppress caching with `cache="none"`.
224+
225+
Note: Claude will only cache longer prompts, with caching requiring at least
226+
1024-4096 tokens, depending on the model. So don't be surprised if you
227+
don't see any differences with caching if you have a short prompt.
228+
229+
See all the details at
230+
<https://docs.claude.com/en/docs/build-with-claude/prompt-caching>.
185231
"""
186232

187233
if model is None:
@@ -198,6 +244,7 @@ def ChatAnthropic(
198244
api_key=api_key,
199245
model=model,
200246
max_tokens=max_tokens,
247+
cache=cache,
201248
kwargs=kwargs,
202249
),
203250
system_prompt=system_prompt,
@@ -215,6 +262,7 @@ def __init__(
215262
model: str,
216263
api_key: Optional[str] = None,
217264
name: str = "Anthropic",
265+
cache: Literal["5m", "1h", "none"] = "5m",
218266
kwargs: Optional["ChatClientArgs"] = None,
219267
):
220268
super().__init__(name=name, model=model)
@@ -226,6 +274,7 @@ def __init__(
226274
"You can install it with 'pip install anthropic'."
227275
)
228276
self._max_tokens = max_tokens
277+
self._cache: Literal["5m", "1h", "none"] = cache
229278

230279
kwargs_full: "ChatClientArgs" = {
231280
"api_key": api_key,
@@ -385,7 +434,13 @@ def _structured_tool_call(**kwargs: Any):
385434

386435
if "system" not in kwargs_full:
387436
if len(turns) > 0 and turns[0].role == "system":
388-
kwargs_full["system"] = turns[0].text
437+
sys_param: "TextBlockParam" = {
438+
"type": "text",
439+
"text": turns[0].text,
440+
}
441+
if self._cache_control():
442+
sys_param["cache_control"] = self._cache_control()
443+
kwargs_full["system"] = [sys_param]
389444

390445
return kwargs_full
391446

@@ -447,11 +502,16 @@ def value_turn(self, completion, has_data_model) -> Turn:
447502

448503
def value_tokens(self, completion):
449504
usage = completion.usage
450-
# N.B. Currently, Anthropic doesn't cache by default and we currently do not support
451-
# manual caching in chatlas. Note also that this only tracks reads, NOT writes, which
452-
# have their own cost. To track that properly, we would need another caching category and per-token cost.
505+
input_tokens = completion.usage.input_tokens
506+
507+
# Account for cache writes by adjusting input tokens
508+
# Cache writes cost 125% for 5m and 200% for 1h
509+
# https://docs.claude.com/en/docs/build-with-claude/prompt-caching
510+
cache_input = usage.cache_creation_input_tokens or 0
511+
cache_mult = 2.0 if self._cache == "1h" else 1.25
512+
453513
return (
454-
completion.usage.input_tokens,
514+
input_tokens + int(cache_input * cache_mult),
455515
completion.usage.output_tokens,
456516
usage.cache_read_input_tokens if usage.cache_read_input_tokens else 0,
457517
)
@@ -539,13 +599,21 @@ def supported_model_params(self) -> set[StandardModelParamNames]:
539599

540600
def _as_message_params(self, turns: list[Turn]) -> list["MessageParam"]:
541601
messages: list["MessageParam"] = []
542-
for turn in turns:
602+
for i, turn in enumerate(turns):
543603
if turn.role == "system":
544604
continue # system prompt passed as separate arg
545605
if turn.role not in ["user", "assistant"]:
546606
raise ValueError(f"Unknown role {turn.role}")
547607

548608
content = [self._as_content_block(c) for c in turn.contents]
609+
610+
# Add cache control to the last content block in the last turn
611+
# https://docs.claude.com/en/docs/build-with-claude/prompt-caching#how-automatic-prefix-checking-works
612+
is_last_turn = i == len(turns) - 1
613+
if is_last_turn and len(content) > 0:
614+
if self._cache_control():
615+
content[-1]["cache_control"] = self._cache_control()
616+
549617
role = "user" if turn.role == "user" else "assistant"
550618
messages.append({"role": role, "content": content})
551619
return messages
@@ -787,11 +855,20 @@ def batch_result_turn(self, result, has_data_model: bool = False) -> Turn | None
787855
message = result.result.message
788856
return self._as_turn(message, has_data_model)
789857

858+
def _cache_control(self) -> "Optional[CacheControlEphemeralParam]":
859+
if self._cache == "none":
860+
return None
861+
return {
862+
"type": "ephemeral",
863+
"ttl": self._cache,
864+
}
865+
790866

791867
def ChatBedrockAnthropic(
792868
*,
793869
model: Optional[str] = None,
794870
max_tokens: int = 4096,
871+
cache: Literal["5m", "1h", "none"] = "5m",
795872
aws_secret_key: Optional[str] = None,
796873
aws_access_key: Optional[str] = None,
797874
aws_region: Optional[str] = None,
@@ -847,6 +924,10 @@ def ChatBedrockAnthropic(
847924
The model to use for the chat.
848925
max_tokens
849926
Maximum number of tokens to generate before stopping.
927+
cache
928+
How long to cache inputs? Defaults to "5m" (five minutes).
929+
Set to "none" to disable caching or "1h" to cache for one hour.
930+
See the Caching section of `ChatAnthropic` for details.
850931
aws_secret_key
851932
The AWS secret key to use for authentication.
852933
aws_access_key
@@ -928,6 +1009,7 @@ def ChatBedrockAnthropic(
9281009
provider=AnthropicBedrockProvider(
9291010
model=model,
9301011
max_tokens=max_tokens,
1012+
cache=cache,
9311013
aws_secret_key=aws_secret_key,
9321014
aws_access_key=aws_access_key,
9331015
aws_region=aws_region,
@@ -951,11 +1033,17 @@ def __init__(
9511033
aws_profile: str | None,
9521034
aws_session_token: str | None,
9531035
max_tokens: int = 4096,
1036+
cache: Literal["5m", "1h", "none"] = "5m",
9541037
base_url: str | None,
9551038
name: str = "AWS/Bedrock",
9561039
kwargs: Optional["ChatBedrockClientArgs"] = None,
9571040
):
958-
super().__init__(name=name, model=model, max_tokens=max_tokens)
1041+
super().__init__(
1042+
name=name,
1043+
model=model,
1044+
max_tokens=max_tokens,
1045+
cache=cache,
1046+
)
9591047

9601048
try:
9611049
from anthropic import AnthropicBedrock, AsyncAnthropicBedrock

0 commit comments

Comments
 (0)