server : improve context checkpoint logic #16440

ggerganov · 2025-10-06T07:39:27Z

Create context checkpoints shortly before processing the full prompt. This allows recurrent models to utilize better the context checkpoints.

ddh0 · 2025-10-07T03:42:51Z

I've done some testing of this branch (at 31046d48) compared to master (at 3df2244) and this branch is consistently better. On master it has to re-process the 2 most recent messages, on this branch it only needs to process new messages. If you click 🔄 re-generate it only has to process 64 tokens which is very quick.

Here is the full launch command I used (same for both runs):

exportv() { export "$1"="$2"; echo "env: $1 = $2"; }
exportv LLAMA_ARG_THREADS 8
exportv LLAMA_LOG_COLORS auto
exportv LLAMA_ARG_CTX_CHECKPOINTS 32
exportv LLAMA_ARG_CONTEXT_SHIFT 0
exportv LLAMA_ARG_CACHE_REUSE 0
exportv LLAMA_ARG_THREADS_HTTP $(nproc)
exportv LLAMA_ARG_ALIAS llama
exportv LLAMA_ARG_SSL_KEY_FILE ~/keys/server.key
exportv LLAMA_ARG_SSL_CERT_FILE ~/keys/server.crt
exportv LLAMA_API_KEY 89572
exportv LLAMA_ARG_HOST 0.0.0.0
exportv LLAMA_ARG_PORT 12800
exportv LLAMA_ARG_NO_MMAP 1
exportv LLAMA_ARG_MLOCK 0
exportv LLAMA_ARG_CHAT_TEMPLATE_FILE ~/jinja/jamba-1.7-fixed-chat-template.jinja # attached below
#exportv LLAMA_CHAT_TEMPLATE_KWARGS "{\"enable_thinking\": false}"
exportv LLAMA_ARG_JINJA 1
exportv LLAMA_ARG_MODEL ~/gguf/AI21-Jamba-Mini-1.7-Q8_0.gguf
exportv LLAMA_ARG_FLASH_ATTN auto
exportv LLAMA_ARG_CTX_SIZE 262144
exportv LLAMA_ARG_N_PARALLEL 1
exportv LLAMA_ARG_BATCH 2048
exportv LLAMA_ARG_UBATCH 2048
exportv LLAMA_ARG_N_GPU_LAYERS 999
time ~/llama.cpp/build/bin/llama-server -ot "ffn_(?:up|gate|down)_exps=CPU"

And here are the server logs from each run:
jamba-mini-master-3df2244d-fixed-jinja.log.txt
jamba-mini-gg-server-checkpoints-improve-31046d4-fixed-jinja.log.txt

One of the issues with context checkpoints in general is that the chat templates for some models include special logic that invalidates the created checkpoints at every turn. This happens with the built-in jinja templates for the following models (if you use --jinja):

For Jamba 1.7 specifically, I created this simplified version of the jinja template which doesn't break checkpoints. I also think that it slightly improves the quality of the output, but this is subjective and I haven't tested it rigorously.
jamba-1.7-fixed-chat-template.jinja.txt

Jamba 1.7's official jinja template (found here) is very messy and has a lot of extra whitespace.

ggerganov · 2025-10-07T05:28:34Z

One of the issues with context checkpoints in general is that the chat templates for some models include special logic that invalidates the created checkpoints at every turn.

Could you give an example of how the checkpoints are invalidated?

ddh0 · 2025-10-07T05:41:17Z

For example, as reported here: #16416 (comment)

In that case, the chat template for NVIDIA Nemotron 12B v2 looks like this:

main: chat template, chat_template: {%- set ns = namespace(enable_thinking=true) %}{%- for message in messages -%}{%- set content = message['content'] -%}{%- if message['role'] == 'user' or message['role'] == 'system' -%}{%- if '/think' in content -%}{%- set ns.enable_thinking = true -%}{%- elif '/no_think' in content -%}{%- set ns.enable_thinking = false -%}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if messages[0]['role'] != 'system' -%}{%- set ns.non_tool_system_content = '' -%}{{- '<SPECIAL_10>System
' -}}{%- else -%}{%- set ns.non_tool_system_content = messages[0]['content'].replace('/think', '').replace('/no_think', '').strip() -%}{{- '<SPECIAL_10>System
' + ns.non_tool_system_content }}{%- endif -%}{%- if tools -%}{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}{{- '

' -}}{%- endif -%}{{- 'You can use the following tools to assist the user if required:' -}}{{- '
<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>

' -}}{{- 'If you decide to call any tool(s), use the following format:
' -}}{{- '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, ' -}}{{- '{{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>

' -}}{{- 'The user will execute tool-calls and return responses from tool(s) in this format:
' -}}{{- '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>

' -}}{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}{%- endif -%}{{- '
' -}}{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}{%- if messages[-1]['role'] == 'assistant' -%}{%- set ns.last_turn_assistant_content = messages[-1]['content'].strip() -%}{%- set messages = messages[:-1] -%}{%- endif -%}{%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'user' -%}{{- '<SPECIAL_11>User
' + content.replace('/think', '').replace('/no_think', '').strip() + '
' }}{%- elif message['role'] == 'tool' -%}{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}{{- '<SPECIAL_11>User
' + '<TOOL_RESPONSE>[' }}{%- endif -%}{{- message['content'] -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}{{- ']</TOOL_RESPONSE>
' -}}{%- endif -%}{%- elif message['role'] == 'assistant' -%}{%- if '</think>' in content -%}{%- set content = content.split('</think>')[1].strip() %}{%- endif -%}{{- '<SPECIAL_11>Assistant
' + content.strip() }}{%- if message.tool_calls -%}{%- if content.strip() != '' -%}{{- '

' -}}{%- endif -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{"name": "' + fn.name + '", "arguments": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<SPECIAL_11>Assistant
' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>
' -}}{%- endif -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- ns.last_turn_assistant_content -}}{%- endif -%}{%- else -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- '<SPECIAL_11>Assistant
' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>
' -}}{%- endif -%}{{- ns.last_turn_assistant_content -}}{%- if continue_final_message is defined -%}{%- if continue_final_message is false -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- else -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- endif -%}{%- endif -%}, example_format: '<SPECIAL_10>System
You are a helpful assistant
<SPECIAL_11>User
Hello
<SPECIAL_11>Assistant
Hi there
<SPECIAL_12>
<SPECIAL_11>User
How are you?
<SPECIAL_11>Assistant
<think>
'

It's hard to tell which part of the template is the culprit, exactly, so I found it's easier to re-create a simpler jinja template from scratch that mimics the behaviour of the official template (but without some additional features - e.g. tool calling).

Vaasref · 2025-10-07T05:47:38Z

Hi, the saving the checkpoint before the last 64 tokens works well in some situations. But I why not use the value of the batch size or create a new setting to allow some control over it ?

With only 64 tokens, no RAG can be used at the end of the context.

ddh0 · 2025-10-07T05:51:42Z

Here are the logs from a short multi-turn session with Nemotron Nano 12B v2, using its default jinja template. The context checkpoints are always invalid.

nemotron-default-jinja-context-reprocessing.log.txt

ggerganov · 2025-10-07T06:33:04Z

With only 64 tokens, no RAG can be used at the end of the context.

@Vaasref Technically, you can force the creation of a checkpoint before the RAG by first sending a request without the RAG and n_predict = 0. After that, you submit the full prompt, including the RAG as usual.

I am open to suggestions about how to improve the checkpointing logic, so happy to hear ideas.

ggerganov · 2025-10-07T06:41:02Z

Here are the logs from a short multi-turn session with Nemotron Nano 12B v2, using its default jinja template. The context checkpoints are always invalid.

@ddh0 This is fixed in this PR. I think you are using master:

build: 6700 (3df2244d) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

ddh0 · 2025-10-07T06:45:33Z

This is fixed in this PR. I think you are using master

Oops, you are right. It is fixed with this PR.

* master: (113 commits) webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489) cpu : optimize the ggml NORM operation (ggml-org#15953) server : host-memory prompt caching (ggml-org#16391) No markdown in cot (ggml-org#16483) model-conversion : add support for SentenceTransformers (ggml-org#16387) ci: add ARM64 Kleidiai build and test support (ggml-org#16462) CANN: Improve ACL graph matching (ggml-org#16166) kleidiai: kernel interface refactoring (ggml-org#16460) [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472) model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367) refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394) Disable CUDA host buffers on integrated GPUs (ggml-org#16308) server : fix cancel pending task (ggml-org#16467) metal : mark FA blocks (ggml-org#16372) server : improve context checkpoint logic (ggml-org#16440) ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452) llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464) server : add `/v1/health` endpoint (ggml-org#16461) webui : added download action (ggml-org#13552) (ggml-org#16282) presets : fix pooling param for embedding models (ggml-org#16455) ...

ggerganov requested a review from ngxson as a code owner October 6, 2025 07:39

github-actions bot added examples server labels Oct 6, 2025

This comment was marked as spam.

Sign in to view

ddh0 mentioned this pull request Oct 6, 2025

Eval bug: Nemotron-H-47B-Reasoning always reprocesses prompt (even after #16382) #16416

Closed

server : improve context checkpoint logic

f5f0e8e

ggerganov force-pushed the gg/server-checkpoints-improve branch from 31046d4 to f5f0e8e Compare October 7, 2025 07:35

ggerganov mentioned this pull request Oct 7, 2025

server : host-memory prompt caching #16391

Merged

5 tasks

askmyteapot mentioned this pull request Oct 7, 2025

Granite 4 Tiny KV Cache issue LostRuins/koboldcpp#1781

Open

ggerganov merged commit 7fdd16b into master Oct 8, 2025
68 checks passed

ggerganov deleted the gg/server-checkpoints-improve branch October 8, 2025 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : improve context checkpoint logic #16440

server : improve context checkpoint logic #16440

ggerganov commented Oct 6, 2025

Uh oh!

This comment was marked as spam.

ddh0 commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 7, 2025

Uh oh!

ddh0 commented Oct 7, 2025 •

edited

Loading

Uh oh!

Vaasref commented Oct 7, 2025 •

edited

Loading

Uh oh!

ddh0 commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 7, 2025

Uh oh!

ddh0 commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

server : improve context checkpoint logic #16440

server : improve context checkpoint logic #16440

Conversation

ggerganov commented Oct 6, 2025

Uh oh!

This comment was marked as spam.

ddh0 commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 7, 2025

Uh oh!

ddh0 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vaasref commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddh0 commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 7, 2025

Uh oh!

ggerganov commented Oct 7, 2025

Uh oh!

ddh0 commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

ddh0 commented Oct 7, 2025 •

edited

Loading

Vaasref commented Oct 7, 2025 •

edited

Loading