Skip to content

feat: add --raw-prompt flag to skip chat template wrapping#3

Open
Krigsexe wants to merge 1 commit into
xigh:mainfrom
Krigsexe:add-raw-prompt-flag
Open

feat: add --raw-prompt flag to skip chat template wrapping#3
Krigsexe wants to merge 1 commit into
xigh:mainfrom
Krigsexe:add-raw-prompt-flag

Conversation

@Krigsexe
Copy link
Copy Markdown

Summary

Add a --raw-prompt CLI flag that passes the prompt directly to the tokenizer without ChatML or Mistral instruct template wrapping.

Motivation

Models like PleIAs/Pleias-RAG-1B use custom special tokens (<|query_start|>, <|source_start|>, <|source_analysis_start|>, etc.) for structured RAG input/output. They do not define a chat_template in their tokenizer config.

Without --raw-prompt, herbert-rs wraps the prompt in ChatML (<|im_start|>user...), which nests the model's special tokens inside the chat template. The model cannot parse its own tokens and either refuses to answer or generates garbage.

With --raw-prompt, the special tokens are passed directly and the model processes them correctly: language detection, query analysis, source analysis, and grounded answers with citations.

Changes

  • crates/cli/src/main.rs: +15/-3 lines
    • Add --raw-prompt bool flag to Cli struct
    • Skip build_prompt() wrapping in single-shot mode when flag is set
    • Skip build_prompt() wrapping in reserve hint calculation when flag is set

Testing

Tested with PleIAs/Pleias-RAG-1B (Q4 backend, llama model_type via PR #2) on Ryzen 5 3600 (AVX2):

  • Without --raw-prompt: model wraps in ChatML, refuses to process sources ("insufficient information")
  • With --raw-prompt: model correctly generates <|language_start|>French<|language_end|>, <|query_analysis_start|>, <|source_analysis_start|>, <|query_report_start|>Answerable<|query_report_end|> structured output
  • Decode: 20.2 tok/s, Prefill: 94.8 tok/s

Some models (e.g. PleIAs/Pleias-RAG-1B) use custom special tokens for
structured input/output and do not define a chat_template in their
tokenizer config. The default ChatML wrapping (<|im_start|>user...)
breaks these models because their special tokens get nested inside
the chat template instead of being parsed directly.

--raw-prompt passes the prompt text directly to the tokenizer without
any chat template wrapping, enabling models with custom token protocols
to work correctly.

Tested with PleIAs/Pleias-RAG-1B (Q4 backend) on Ryzen 5 3600 (AVX2):
the model correctly processes <|query_start|>, <|source_start|>,
<|source_analysis_start|> tokens and generates structured RAG output
with language detection, query analysis, and source grounding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants