Skip to content

Conversation

ggerganov
Copy link
Member

cont #16382
fix #16416

Create context checkpoints shortly before processing the full prompt. This allows recurrent models to utilize better the context checkpoints.

@tommarques56

This comment was marked as spam.

@ddh0
Copy link
Contributor

ddh0 commented Oct 7, 2025

I've done some testing of this branch (at 31046d48) compared to master (at 3df2244) and this branch is consistently better. On master it has to re-process the 2 most recent messages, on this branch it only needs to process new messages. If you click 🔄 re-generate it only has to process 64 tokens which is very quick.

Here is the full launch command I used (same for both runs):

exportv() { export "$1"="$2"; echo "env: $1 = $2"; }
exportv LLAMA_ARG_THREADS 8
exportv LLAMA_LOG_COLORS auto
exportv LLAMA_ARG_CTX_CHECKPOINTS 32
exportv LLAMA_ARG_CONTEXT_SHIFT 0
exportv LLAMA_ARG_CACHE_REUSE 0
exportv LLAMA_ARG_THREADS_HTTP $(nproc)
exportv LLAMA_ARG_ALIAS llama
exportv LLAMA_ARG_SSL_KEY_FILE ~/keys/server.key
exportv LLAMA_ARG_SSL_CERT_FILE ~/keys/server.crt
exportv LLAMA_API_KEY 89572
exportv LLAMA_ARG_HOST 0.0.0.0
exportv LLAMA_ARG_PORT 12800
exportv LLAMA_ARG_NO_MMAP 1
exportv LLAMA_ARG_MLOCK 0
exportv LLAMA_ARG_CHAT_TEMPLATE_FILE ~/jinja/jamba-1.7-fixed-chat-template.jinja # attached below
#exportv LLAMA_CHAT_TEMPLATE_KWARGS "{\"enable_thinking\": false}"
exportv LLAMA_ARG_JINJA 1
exportv LLAMA_ARG_MODEL ~/gguf/AI21-Jamba-Mini-1.7-Q8_0.gguf
exportv LLAMA_ARG_FLASH_ATTN auto
exportv LLAMA_ARG_CTX_SIZE 262144
exportv LLAMA_ARG_N_PARALLEL 1
exportv LLAMA_ARG_BATCH 2048
exportv LLAMA_ARG_UBATCH 2048
exportv LLAMA_ARG_N_GPU_LAYERS 999
time ~/llama.cpp/build/bin/llama-server -ot "ffn_(?:up|gate|down)_exps=CPU"

And here are the server logs from each run:
jamba-mini-master-3df2244d-fixed-jinja.log.txt
jamba-mini-gg-server-checkpoints-improve-31046d4-fixed-jinja.log.txt


One of the issues with context checkpoints in general is that the chat templates for some models include special logic that invalidates the created checkpoints at every turn. This happens with the built-in jinja templates for the following models (if you use --jinja):

For Jamba 1.7 specifically, I created this simplified version of the jinja template which doesn't break checkpoints. I also think that it slightly improves the quality of the output, but this is subjective and I haven't tested it rigorously.
jamba-1.7-fixed-chat-template.jinja.txt

Jamba 1.7's official jinja template (found here) is very messy and has a lot of extra whitespace.

@ggerganov
Copy link
Member Author

One of the issues with context checkpoints in general is that the chat templates for some models include special logic that invalidates the created checkpoints at every turn.

Could you give an example of how the checkpoints are invalidated?

@ddh0
Copy link
Contributor

ddh0 commented Oct 7, 2025

For example, as reported here: #16416 (comment)

In that case, the chat template for NVIDIA Nemotron 12B v2 looks like this:

main: chat template, chat_template: {%- set ns = namespace(enable_thinking=true) %}{%- for message in messages -%}{%- set content = message['content'] -%}{%- if message['role'] == 'user' or message['role'] == 'system' -%}{%- if '/think' in content -%}{%- set ns.enable_thinking = true -%}{%- elif '/no_think' in content -%}{%- set ns.enable_thinking = false -%}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if messages[0]['role'] != 'system' -%}{%- set ns.non_tool_system_content = '' -%}{{- '<SPECIAL_10>System
' -}}{%- else -%}{%- set ns.non_tool_system_content = messages[0]['content'].replace('/think', '').replace('/no_think', '').strip() -%}{{- '<SPECIAL_10>System
' + ns.non_tool_system_content }}{%- endif -%}{%- if tools -%}{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}{{- '

' -}}{%- endif -%}{{- 'You can use the following tools to assist the user if required:' -}}{{- '
<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>

' -}}{{- 'If you decide to call any tool(s), use the following format:
' -}}{{- '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, ' -}}{{- '{{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>

' -}}{{- 'The user will execute tool-calls and return responses from tool(s) in this format:
' -}}{{- '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>

' -}}{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}{%- endif -%}{{- '
' -}}{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}{%- if messages[-1]['role'] == 'assistant' -%}{%- set ns.last_turn_assistant_content = messages[-1]['content'].strip() -%}{%- set messages = messages[:-1] -%}{%- endif -%}{%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'user' -%}{{- '<SPECIAL_11>User
' + content.replace('/think', '').replace('/no_think', '').strip() + '
' }}{%- elif message['role'] == 'tool' -%}{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}{{- '<SPECIAL_11>User
' + '<TOOL_RESPONSE>[' }}{%- endif -%}{{- message['content'] -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}{{- ']</TOOL_RESPONSE>
' -}}{%- endif -%}{%- elif message['role'] == 'assistant' -%}{%- if '</think>' in content -%}{%- set content = content.split('</think>')[1].strip() %}{%- endif -%}{{- '<SPECIAL_11>Assistant
' + content.strip() }}{%- if message.tool_calls -%}{%- if content.strip() != '' -%}{{- '

' -}}{%- endif -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{"name": "' + fn.name + '", "arguments": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<SPECIAL_11>Assistant
' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>
' -}}{%- endif -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- ns.last_turn_assistant_content -}}{%- endif -%}{%- else -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- '<SPECIAL_11>Assistant
' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>
' -}}{%- endif -%}{{- ns.last_turn_assistant_content -}}{%- if continue_final_message is defined -%}{%- if continue_final_message is false -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- else -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- endif -%}{%- endif -%}, example_format: '<SPECIAL_10>System
You are a helpful assistant
<SPECIAL_11>User
Hello
<SPECIAL_11>Assistant
Hi there
<SPECIAL_12>
<SPECIAL_11>User
How are you?
<SPECIAL_11>Assistant
<think>
'

It's hard to tell which part of the template is the culprit, exactly, so I found it's easier to re-create a simpler jinja template from scratch that mimics the behaviour of the official template (but without some additional features - e.g. tool calling).

@Vaasref
Copy link

Vaasref commented Oct 7, 2025

Hi, the saving the checkpoint before the last 64 tokens works well in some situations. But I why not use the value of the batch size or create a new setting to allow some control over it ?

With only 64 tokens, no RAG can be used at the end of the context.

@ddh0
Copy link
Contributor

ddh0 commented Oct 7, 2025

Here are the logs from a short multi-turn session with Nemotron Nano 12B v2, using its default jinja template. The context checkpoints are always invalid.

nemotron-default-jinja-context-reprocessing.log.txt

@ggerganov
Copy link
Member Author

With only 64 tokens, no RAG can be used at the end of the context.

@Vaasref Technically, you can force the creation of a checkpoint before the RAG by first sending a request without the RAG and n_predict = 0. After that, you submit the full prompt, including the RAG as usual.

I am open to suggestions about how to improve the checkpointing logic, so happy to hear ideas.

@ggerganov
Copy link
Member Author

Here are the logs from a short multi-turn session with Nemotron Nano 12B v2, using its default jinja template. The context checkpoints are always invalid.

@ddh0 This is fixed in this PR. I think you are using master:

build: 6700 (3df2244d) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

@ddh0
Copy link
Contributor

ddh0 commented Oct 7, 2025

This is fixed in this PR. I think you are using master

Oops, you are right. It is fixed with this PR.

@ggerganov ggerganov force-pushed the gg/server-checkpoints-improve branch from 31046d4 to f5f0e8e Compare October 7, 2025 07:35
@ggerganov ggerganov merged commit 7fdd16b into master Oct 8, 2025
68 checks passed
@ggerganov ggerganov deleted the gg/server-checkpoints-improve branch October 8, 2025 07:57
anyshu pushed a commit to anyshu/llama.cpp that referenced this pull request Oct 10, 2025
* master: (113 commits)
  webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489)
  cpu : optimize the ggml NORM operation (ggml-org#15953)
  server : host-memory prompt caching (ggml-org#16391)
  No markdown in cot (ggml-org#16483)
  model-conversion : add support for SentenceTransformers (ggml-org#16387)
  ci: add ARM64 Kleidiai build and test support (ggml-org#16462)
  CANN: Improve ACL graph matching (ggml-org#16166)
  kleidiai: kernel interface refactoring (ggml-org#16460)
  [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472)
  model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367)
  refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394)
  Disable CUDA host buffers on integrated GPUs (ggml-org#16308)
  server : fix cancel pending task (ggml-org#16467)
  metal : mark FA blocks (ggml-org#16372)
  server : improve context checkpoint logic (ggml-org#16440)
  ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452)
  llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464)
  server : add `/v1/health` endpoint (ggml-org#16461)
  webui : added download action (ggml-org#13552) (ggml-org#16282)
  presets : fix pooling param for embedding models (ggml-org#16455)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: Nemotron-H-47B-Reasoning always reprocesses prompt (even after #16382)
4 participants