-
Notifications
You must be signed in to change notification settings - Fork 13.3k
server : improve context checkpoint logic #16440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as spam.
This comment was marked as spam.
I've done some testing of this branch (at Here is the full launch command I used (same for both runs): exportv() { export "$1"="$2"; echo "env: $1 = $2"; }
exportv LLAMA_ARG_THREADS 8
exportv LLAMA_LOG_COLORS auto
exportv LLAMA_ARG_CTX_CHECKPOINTS 32
exportv LLAMA_ARG_CONTEXT_SHIFT 0
exportv LLAMA_ARG_CACHE_REUSE 0
exportv LLAMA_ARG_THREADS_HTTP $(nproc)
exportv LLAMA_ARG_ALIAS llama
exportv LLAMA_ARG_SSL_KEY_FILE ~/keys/server.key
exportv LLAMA_ARG_SSL_CERT_FILE ~/keys/server.crt
exportv LLAMA_API_KEY 89572
exportv LLAMA_ARG_HOST 0.0.0.0
exportv LLAMA_ARG_PORT 12800
exportv LLAMA_ARG_NO_MMAP 1
exportv LLAMA_ARG_MLOCK 0
exportv LLAMA_ARG_CHAT_TEMPLATE_FILE ~/jinja/jamba-1.7-fixed-chat-template.jinja # attached below
#exportv LLAMA_CHAT_TEMPLATE_KWARGS "{\"enable_thinking\": false}"
exportv LLAMA_ARG_JINJA 1
exportv LLAMA_ARG_MODEL ~/gguf/AI21-Jamba-Mini-1.7-Q8_0.gguf
exportv LLAMA_ARG_FLASH_ATTN auto
exportv LLAMA_ARG_CTX_SIZE 262144
exportv LLAMA_ARG_N_PARALLEL 1
exportv LLAMA_ARG_BATCH 2048
exportv LLAMA_ARG_UBATCH 2048
exportv LLAMA_ARG_N_GPU_LAYERS 999
time ~/llama.cpp/build/bin/llama-server -ot "ffn_(?:up|gate|down)_exps=CPU" And here are the server logs from each run: One of the issues with context checkpoints in general is that the chat templates for some models include special logic that invalidates the created checkpoints at every turn. This happens with the built-in jinja templates for the following models (if you use
For Jamba 1.7 specifically, I created this simplified version of the jinja template which doesn't break checkpoints. I also think that it slightly improves the quality of the output, but this is subjective and I haven't tested it rigorously. Jamba 1.7's official jinja template (found here) is very messy and has a lot of extra whitespace. |
Could you give an example of how the checkpoints are invalidated? |
For example, as reported here: #16416 (comment) In that case, the chat template for NVIDIA Nemotron 12B v2 looks like this: main: chat template, chat_template: {%- set ns = namespace(enable_thinking=true) %}{%- for message in messages -%}{%- set content = message['content'] -%}{%- if message['role'] == 'user' or message['role'] == 'system' -%}{%- if '/think' in content -%}{%- set ns.enable_thinking = true -%}{%- elif '/no_think' in content -%}{%- set ns.enable_thinking = false -%}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if messages[0]['role'] != 'system' -%}{%- set ns.non_tool_system_content = '' -%}{{- '<SPECIAL_10>System
' -}}{%- else -%}{%- set ns.non_tool_system_content = messages[0]['content'].replace('/think', '').replace('/no_think', '').strip() -%}{{- '<SPECIAL_10>System
' + ns.non_tool_system_content }}{%- endif -%}{%- if tools -%}{%- if ns.non_tool_system_content is defined and ns.non_tool_system_content != '' -%}{{- '
' -}}{%- endif -%}{{- 'You can use the following tools to assist the user if required:' -}}{{- '
<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>
' -}}{{- 'If you decide to call any tool(s), use the following format:
' -}}{{- '<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, ' -}}{{- '{{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>
' -}}{{- 'The user will execute tool-calls and return responses from tool(s) in this format:
' -}}{{- '<TOOL_RESPONSE>[{{"tool_response1"}}, {{"tool_response2"}}]</TOOL_RESPONSE>
' -}}{{- 'Based on the tool responses, you can call additional tools if needed, correct tool calls if any errors are found, or just respond to the user.' -}}{%- endif -%}{{- '
' -}}{%- set messages = messages[1:] if messages[0]['role'] == 'system' else messages -%}{%- if messages[-1]['role'] == 'assistant' -%}{%- set ns.last_turn_assistant_content = messages[-1]['content'].strip() -%}{%- set messages = messages[:-1] -%}{%- endif -%}{%- for message in messages %}{%- set content = message['content'] %}{%- if message['role'] == 'user' -%}{{- '<SPECIAL_11>User
' + content.replace('/think', '').replace('/no_think', '').strip() + '
' }}{%- elif message['role'] == 'tool' -%}{%- if loop.first or (messages[loop.index0 - 1].role != 'tool') -%}{{- '<SPECIAL_11>User
' + '<TOOL_RESPONSE>[' }}{%- endif -%}{{- message['content'] -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == 'tool') else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != 'tool') -%}{{- ']</TOOL_RESPONSE>
' -}}{%- endif -%}{%- elif message['role'] == 'assistant' -%}{%- if '</think>' in content -%}{%- set content = content.split('</think>')[1].strip() %}{%- endif -%}{{- '<SPECIAL_11>Assistant
' + content.strip() }}{%- if message.tool_calls -%}{%- if content.strip() != '' -%}{{- '
' -}}{%- endif -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{"name": "' + fn.name + '", "arguments": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- '<SPECIAL_11>Assistant
' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>
' -}}{%- endif -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- ns.last_turn_assistant_content -}}{%- endif -%}{%- else -%}{%- if ns.last_turn_assistant_content is defined and ns.last_turn_assistant_content != '' -%}{{- '<SPECIAL_11>Assistant
' -}}{%- if ns.enable_thinking is defined and ns.enable_thinking is false -%}{{- '<think></think>' -}}{%- else -%}{{- '<think>
' -}}{%- endif -%}{{- ns.last_turn_assistant_content -}}{%- if continue_final_message is defined -%}{%- if continue_final_message is false -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- else -%}{{- '
<SPECIAL_12>
' -}}{%- endif -%}{%- endif -%}{%- endif -%}, example_format: '<SPECIAL_10>System
You are a helpful assistant
<SPECIAL_11>User
Hello
<SPECIAL_11>Assistant
Hi there
<SPECIAL_12>
<SPECIAL_11>User
How are you?
<SPECIAL_11>Assistant
<think>
' It's hard to tell which part of the template is the culprit, exactly, so I found it's easier to re-create a simpler jinja template from scratch that mimics the behaviour of the official template (but without some additional features - e.g. tool calling). |
Hi, the saving the checkpoint before the last 64 tokens works well in some situations. But I why not use the value of the batch size or create a new setting to allow some control over it ? With only 64 tokens, no RAG can be used at the end of the context. |
Here are the logs from a short multi-turn session with Nemotron Nano 12B v2, using its default jinja template. The context checkpoints are always invalid. |
@Vaasref Technically, you can force the creation of a checkpoint before the RAG by first sending a request without the RAG and I am open to suggestions about how to improve the checkpointing logic, so happy to hear ideas. |
@ddh0 This is fixed in this PR. I think you are using
|
Oops, you are right. It is fixed with this PR. |
31046d4
to
f5f0e8e
Compare
* master: (113 commits) webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489) cpu : optimize the ggml NORM operation (ggml-org#15953) server : host-memory prompt caching (ggml-org#16391) No markdown in cot (ggml-org#16483) model-conversion : add support for SentenceTransformers (ggml-org#16387) ci: add ARM64 Kleidiai build and test support (ggml-org#16462) CANN: Improve ACL graph matching (ggml-org#16166) kleidiai: kernel interface refactoring (ggml-org#16460) [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472) model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367) refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394) Disable CUDA host buffers on integrated GPUs (ggml-org#16308) server : fix cancel pending task (ggml-org#16467) metal : mark FA blocks (ggml-org#16372) server : improve context checkpoint logic (ggml-org#16440) ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452) llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464) server : add `/v1/health` endpoint (ggml-org#16461) webui : added download action (ggml-org#13552) (ggml-org#16282) presets : fix pooling param for embedding models (ggml-org#16455) ...
cont #16382
fix #16416
Create context checkpoints shortly before processing the full prompt. This allows recurrent models to utilize better the context checkpoints.