Skip to content

Modify llama tokenizer, llama modeling, qwen2 modeling.#677

Open
ZhangJinghe-AI wants to merge 2 commits into
UbiquitousLearning:mainfrom
ZhangJinghe-AI:modify_llama_tokenizer
Open

Modify llama tokenizer, llama modeling, qwen2 modeling.#677
ZhangJinghe-AI wants to merge 2 commits into
UbiquitousLearning:mainfrom
ZhangJinghe-AI:modify_llama_tokenizer

Conversation

@ZhangJinghe-AI

@ZhangJinghe-AI ZhangJinghe-AI commented May 19, 2026

Copy link
Copy Markdown

Fix implementation errors in LLaMA C++ tokenizer and runtime.
Correct issues in the LLaMA and Qwen2 modeling files in pymllm.
Additionally, improve the LLaMA calibration process for better performance.

Summary by CodeRabbit

  • New Features

    • Llama-3 tokenization with chat-template formatting and special-token handling
    • Interactive prompt mode for demos (prompt read from stdin)
  • Improvements

    • Dynamic end-of-sequence token selection based on model type
    • Updated quantization wiring and observer hookups for attention projections
    • Inference/calibration now use chat-template preprocessing and skip very short samples

@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3105568a-5a81-42be-9605-fb23888a473b

📥 Commits

Reviewing files that changed from the base of the PR and between 95d86fc and c0d9467.

📒 Files selected for processing (6)
  • examples/llama_qnn_aot/aot_run.cpp
  • mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp
  • mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
  • mllm/models/llama/tokenization_llama.hpp
  • pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py
  • pymllm/mobile/backends/qualcomm/transformers/llama/runner.py
💤 Files with no reviewable changes (1)
  • examples/llama_qnn_aot/aot_run.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
  • mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp
  • mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
  • pymllm/mobile/backends/qualcomm/transformers/llama/runner.py
  • pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py
  • mllm/models/llama/tokenization_llama.hpp

📝 Walkthrough

Walkthrough

This PR adds a Llama-3 tokenizer and integrates it into the AOT example, makes model-type-configurable EOS selection, and shifts attention quantization observation from input-QDQ to output-QDQ for Llama and Qwen2; runner calibration/generation wiring is updated accordingly.

Changes

Llama Tokenization and Integration

Layer / File(s) Summary
Llama tokenizer implementation
mllm/models/llama/tokenization_llama.hpp
Llama-3 tokenizer with regex splitting, byte↔unicode mapping, SentencePiece/BPE init, special-token trie, chat-template formatting, encode/decode, and convertMessage producing an id Tensor.
Model config and EOS selection
mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp, mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
Adds type to QnnAOTConfig (default "qwen3") and fills EOS id set based on config_.type ("llama3", "qwen2", fallback "qwen3").
AOT example integration
examples/llama_qnn_aot/aot_run.cpp
Uses LlamaTokenizer, sets config.type = "llama3", reads prompt from stdin, tokenizes via convertMessage, and calls runner.generate with config.context_len and a token-print callback; removes hardcoded sequence and --seq_len/--gen_len CLI args.

Quantization refactor (Llama & Qwen2)

Layer / File(s) Summary
Llama attention QDQ observation changes
pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py
Disables input QDQ modules for k_proj/v_proj, computes key/value from projections then applies output-QDQ, and switches RoPE observer wiring to use output-QDQ activation post-process hooks.
Llama runner: input/template and calibration changes
pymllm/mobile/backends/qualcomm/transformers/llama/runner.py
Formats prompts via tokenizer.apply_chat_template(...) before tokenization in infer; calibration trims short samples and gathers activation stats via direct model(**model_inputs) calls instead of generate.
Qwen2 attention QDQ restructuring
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py
Removes k_proj_input_qdq/v_proj_input_qdq usage; forms keys/values directly from projections then uses output-QDQ; updates RoPE concat observer and rotate paths to observe output-QDQ activation post-process.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • liang1232018
  • chenghuaWang
  • oreomaker

Poem

🐰 I nibble bytes and split the text with care,
BPE and trie weaving tokens in the air,
Prompts read kindly from the user's line,
EOS chosen by the model's sign,
Quant paths hop from in to out — tidy and fair.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is too vague and uses generic terms ('Modify') without specifying what the main change is; it reads like a file list rather than a clear summary of the primary objective. Replace with a more descriptive title that highlights the main objective, such as 'Add Llama tokenizer implementation and update configuration' or 'Integrate Llama tokenizer with AOT runtime support'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The pull request description covers the main objectives but does not follow the provided template structure and lacks detailed context on why these changes were made.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp

mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp:8:10: fatal error: 'mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp' file not found
8 | #include "mllm/backends/qnn/aot_rt/QnnAOTRuntime.hpp"
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/c0d946721003afb6673922326ba92bb93df00708-aab6a1586696c8d5/tmp/clang_command_.tmp.19dabf.txt
++Contents of '/tmp/coderabbit-infer/c0d946721003afb6673922326ba92bb93df00708-aab6a1586696c8d5/tmp/clang_command_.tmp.19dabf.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRI

... [truncated 1193 characters] ...

8/include"
"-internal-isystem" "/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/aab6a1586696c8d5/file.o" "-x" "c++"
"mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp" "-O0" "-fno-builtin"
"-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (1)

281-304: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

The new observer arguments are currently a no-op.

rotate_half(...) never reads its x_observer parameter, so swapping in q_proj_output_qdq.fake_quant.activation_post_process and k_proj_output_qdq.fake_quant.activation_post_process on Line 286 and Line 298 does not change runtime behavior. If this path is meant to use the output observer, wire it up inside rotate_half; otherwise remove the unused parameter so the refactor is not misleading.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around
lines 281 - 304, rotate_half currently ignores the x_observer argument, so the
new q_proj_output_qdq.fake_quant.activation_post_process and
k_proj_output_qdq.fake_quant.activation_post_process passed from the
q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused
x_observer parameter from rotate_half and from all calls in the q_rope/key_rope
code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites,
k_rope_mul_1_output_qdq call sites,
q_proj_output_qdq.fake_quant.activation_post_process,
k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside
rotate_half so the observer is actually applied to the tensor (e.g., call the
provided activation_post_process/observer on the input or output within
rotate_half) and keep the current call sites; update the rotate_half signature
and all its call sites accordingly to match the chosen approach.
🧹 Nitpick comments (1)
mllm/models/llama/tokenization_llama.hpp (1)

98-101: ⚡ Quick win

Document the public tokenizer API surface.

LlamaMessage and multiple public LlamaTokenizer methods are exported without clear API comments (expected inputs/outputs/error behavior). Please add concise API docs to reduce integration ambiguity.

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

Also applies to: 104-237

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/models/llama/tokenization_llama.hpp` around lines 98 - 101, Add concise
public API documentation for the exported symbols: add a doc comment above the
LlamaMessage struct describing its purpose and the meaning/expected format of
role and content, and add doc comments for each public LlamaTokenizer method
(e.g., any Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public
functions found in the file) that specify purpose, parameter expectations
(types, valid ranges, ownership/constness), return value semantics, and possible
error/exception behavior (including when methods may fail or return
empty/invalid results). Follow the project's docstyle (brief one-line summary +
short param/return/error bullets), mention thread-safety if relevant, and ensure
comments are placed immediately above the declaration so callers see the API
contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llama_qnn_aot/aot_run.cpp`:
- Around line 65-69: The prompt loop currently always sends input to
tokenizer.convertMessage; add an early-exit check after reading prompt_text that
returns or breaks when the user enters "exit" or "quit" (case-insensitive and
trimming surrounding whitespace), and also handle EOF (std::cin.eof())
similarly; implement this check immediately before calling
tokenizer.convertMessage so prompt_text is not processed when the user intends
to quit.
- Around line 41-44: The comment above the tokenizer instantiation is stale —
remove or update the lines that mention "Using Qwen3 tokenizer as a placeholder"
and instead document that LlamaTokenizer is being used; specifically edit the
comment that precedes the creation of
mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it correctly
reflects the use of LlamaTokenizer (or notes to implement a Llama3Tokenizer if
different) and remove any leftover references to Qwen3 to avoid confusion.

In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 5-11: The header tokenization_llama.hpp uses std::iswspace
(references at tokenization functions around lines using iswspace) but doesn't
include the wide-char ctype header; add an explicit `#include` <cwctype> to
tokenization_llama.hpp so std::iswspace is guaranteed to be declared (update the
top-of-file includes alongside <string>, <vector>, etc.), then rebuild to
confirm portability on platforms that don't provide it via transitive includes.
- Around line 121-127: The getSystemPromptPrefix() function uses std::localtime
(unsafe for concurrent calls from applyChatTemplate()); replace it with a
thread-safe alternative — on POSIX call std::localtime_r(&t, &tm_) or on Windows
std::localtime_s(&tm_, &t) (or use std::chrono to format the current date) so
that tm_ is populated safely per-thread; update the declaration/initialization
of tm_ accordingly and keep the existing formatting with std::put_time and
oss.str() untouched.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py`:
- Around line 253-259: The calibration path must use the same chat formatting as
inference: modify the calibrate(...) implementation (referencing calibrate and
its use of calibration samples) to build messages = [{"role": "user", "content":
sample}] and call self.tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True) for each sample (the same call/invocation used in
infer()), and use the returned formatted prompt when computing activation
statistics so calibration and infer share identical preprocessing.

---

Outside diff comments:
In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py`:
- Around line 281-304: rotate_half currently ignores the x_observer argument, so
the new q_proj_output_qdq.fake_quant.activation_post_process and
k_proj_output_qdq.fake_quant.activation_post_process passed from the
q_rope/key_rope call sites are no-ops; fix by either (A) removing the unused
x_observer parameter from rotate_half and from all calls in the q_rope/key_rope
code paths (symbols: rotate_half, q_rope_mul_1_output_qdq call sites,
k_rope_mul_1_output_qdq call sites,
q_proj_output_qdq.fake_quant.activation_post_process,
k_proj_output_qdq.fake_quant.activation_post_process) or (B) wire it up inside
rotate_half so the observer is actually applied to the tensor (e.g., call the
provided activation_post_process/observer on the input or output within
rotate_half) and keep the current call sites; update the rotate_half signature
and all its call sites accordingly to match the chosen approach.

---

Nitpick comments:
In `@mllm/models/llama/tokenization_llama.hpp`:
- Around line 98-101: Add concise public API documentation for the exported
symbols: add a doc comment above the LlamaMessage struct describing its purpose
and the meaning/expected format of role and content, and add doc comments for
each public LlamaTokenizer method (e.g., any
Encode/Decode/Tokenize/Detokenize/LoadVocab or similarly named public functions
found in the file) that specify purpose, parameter expectations (types, valid
ranges, ownership/constness), return value semantics, and possible
error/exception behavior (including when methods may fail or return
empty/invalid results). Follow the project's docstyle (brief one-line summary +
short param/return/error bullets), mention thread-safety if relevant, and ensure
comments are placed immediately above the declaration so callers see the API
contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c19ed8ee-bc79-4e4d-8767-91e2ceb8db20

📥 Commits

Reviewing files that changed from the base of the PR and between 729ca4c and 95d86fc.

📒 Files selected for processing (7)
  • examples/llama_qnn_aot/aot_run.cpp
  • mllm/backends/qnn/aot_rt/QnnAOTConfig.hpp
  • mllm/backends/qnn/aot_rt/QnnAOTRuntime.cpp
  • mllm/models/llama/tokenization_llama.hpp
  • pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py
  • pymllm/mobile/backends/qualcomm/transformers/llama/runner.py
  • pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py

Comment on lines 41 to +44
// Note: Using Qwen3 tokenizer as a placeholder.
// For production use, you should implement a Llama3Tokenizer or use
// the appropriate tokenizer for your model.
auto tokenizer = mllm::models::llama::TinyLlamaTokenizer(tokenizer_path.get());
auto tokenizer = mllm::models::llama::LlamaTokenizer(tokenizer_path.get());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove stale tokenizer comment.

Lines 41–43 still say Qwen3 tokenizer placeholder, but Line 44 instantiates LlamaTokenizer. Please update/remove this comment to avoid maintenance confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llama_qnn_aot/aot_run.cpp` around lines 41 - 44, The comment above
the tokenizer instantiation is stale — remove or update the lines that mention
"Using Qwen3 tokenizer as a placeholder" and instead document that
LlamaTokenizer is being used; specifically edit the comment that precedes the
creation of mllm::models::llama::LlamaTokenizer(tokenizer_path.get()) so it
correctly reflects the use of LlamaTokenizer (or notes to implement a
Llama3Tokenizer if different) and remove any leftover references to Qwen3 to
avoid confusion.

Comment on lines +65 to 69
fmt::print("💬 Prompt text (or 'exit/quit'): ");
std::getline(std::cin, prompt_text);

auto input_tensor = tokenizer.convertMessage({{.role = "user", .content = prompt_text}});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Implement the advertised exit/quit path.

The prompt text says exit/quit (Line 65), but input is always sent to generation. Add an early return when the user enters exit or quit.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llama_qnn_aot/aot_run.cpp` around lines 65 - 69, The prompt loop
currently always sends input to tokenizer.convertMessage; add an early-exit
check after reading prompt_text that returns or breaks when the user enters
"exit" or "quit" (case-insensitive and trimming surrounding whitespace), and
also handle EOF (std::cin.eof()) similarly; implement this check immediately
before calling tokenizer.convertMessage so prompt_text is not processed when the
user intends to quit.

Comment on lines +5 to +11
#include <vector>
#include <string>
#include <ctime>
#include <iomanip>
#include <sstream>
#include <unordered_map>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat -n mllm/models/llama/tokenization_llama.hpp | head -150

Repository: UbiquitousLearning/mllm

Length of output: 6309


Add the missing wide-char ctype header for std::iswspace.

std::iswspace is used at lines 62, 63, 75, and 77 but <cwctype> is not explicitly included. While the code may compile on some platforms due to transitive includes, this violates portability best practices and C++ idioms. All headers for functions used must be explicitly included.

Proposed fix
 `#include` <vector>
 `#include` <string>
+#include <cwctype>
 `#include` <ctime>
 `#include` <iomanip>
 `#include` <sstream>
 `#include` <unordered_map>
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <vector>
#include <string>
#include <ctime>
#include <iomanip>
#include <sstream>
#include <unordered_map>
`#include` <vector>
`#include` <string>
`#include` <cwctype>
`#include` <ctime>
`#include` <iomanip>
`#include` <sstream>
`#include` <unordered_map>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/models/llama/tokenization_llama.hpp` around lines 5 - 11, The header
tokenization_llama.hpp uses std::iswspace (references at tokenization functions
around lines using iswspace) but doesn't include the wide-char ctype header; add
an explicit `#include` <cwctype> to tokenization_llama.hpp so std::iswspace is
guaranteed to be declared (update the top-of-file includes alongside <string>,
<vector>, etc.), then rebuild to confirm portability on platforms that don't
provide it via transitive includes.

Comment on lines +121 to +127
std::string getSystemPromptPrefix() {
std::time_t t = std::time(nullptr);
std::tm tm_ = *std::localtime(&t);
std::ostringstream oss;
oss << std::put_time(&tm_, "%d %b %Y");
return "Cutting Knowledge Date: December 2023\nToday Date: " + oss.str() + "\n\n";
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

cat mllm/models/llama/tokenization_llama.hpp | head -150 | tail -50

Repository: UbiquitousLearning/mllm

Length of output: 2162


🏁 Script executed:

rg "getSystemPromptPrefix" --type hpp --type cpp -B 2 -A 2

Repository: UbiquitousLearning/mllm

Length of output: 95


🏁 Script executed:

# Check if there are any threading or concurrency patterns in the codebase
rg "std::thread|std::mutex|pthread|`#pragma` omp" mllm/models/llama/ | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Look for other usages of std::localtime or similar time functions
rg "std::localtime|localtime_r|localtime_s|std::chrono" mllm/models/llama/ --type hpp --type cpp

Repository: UbiquitousLearning/mllm

Length of output: 95


🏁 Script executed:

rg "getSystemPromptPrefix" --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 266


🏁 Script executed:

rg "applyChatTemplate" --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 1690


🏁 Script executed:

# Check if there's any concurrent usage or threading in the codebase
rg "std::thread|std::async|std::mutex|concurrent" --max-count 20

Repository: UbiquitousLearning/mllm

Length of output: 8668


🏁 Script executed:

# Check what C++ version is being used
find . -name "CMakeLists.txt" -o -name "*.cmake" | head -5 | xargs grep -l "cxx_standard\|CXX_STANDARD" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 165


🏁 Script executed:

# Look at the full tokenization_llama.hpp file to understand context better
wc -l mllm/models/llama/tokenization_llama.hpp

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

cat mllm/CMakeLists.txt | grep -i "cxx_standard\|CXX_STANDARD" -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check if there are any service/server implementations that might call applyChatTemplate concurrently
rg "LlamaTokenizer|applyChatTemplate" mllm/ -A 3 -B 3 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 4198


🏁 Script executed:

# Check the Service.hpp to understand threading model
head -100 mllm/engine/service/Service.hpp

Repository: UbiquitousLearning/mllm

Length of output: 2221


🌐 Web query:

std::localtime thread safety C++ std::chrono alternative

💡 Result:

The C-style function std::localtime is not thread-safe because it returns a pointer to a shared, statically allocated std::tm object, which can be overwritten by concurrent calls from other threads [1][2][3]. The modern, thread-safe alternative in C++ is the C++20 library, which provides robust support for time zones and calendar conversions [3][4][5]. Recommended C++20 Approach: Use std::chrono::zoned_time to handle local time conversions in a thread-safe manner [3][4][6]. Example: #include #include void print_local_time { auto now = std::chrono::system_clock::now; // zoned_time is thread-safe and handles time zone conversions auto local_now = std::chrono::zoned_time{std::chrono::current_zone, now}; // Use std::format (C++20) for thread-safe, type-safe formatting std::cout << std::format("{:%F %T %Z}", local_now) << std::endl; } Legacy Alternatives (Pre-C++20): If you cannot use C++20, you must use platform-specific thread-safe functions or manual synchronization [1][7][8]: 1. POSIX (Linux/macOS): Use localtime_r(const std::time_t*, std::tm*), which writes the result into a user-provided buffer [1][2][3]. 2. Windows (MSVC): Use localtime_s(std::tm*, const std::time_t*), which also uses a user-provided buffer [1][3][8]. 3. Manual Synchronization: Wrap calls to std::localtime in a std::mutex to prevent concurrent access [1][7][4]. Note: Avoid std::ctime and std::gmtime for the same reasons, as they also rely on shared static buffers [2][8][4].

Citations:


Use thread-safe time functions instead of std::localtime.

std::localtime is not thread-safe as it returns a pointer to a shared static buffer. Since getSystemPromptPrefix() is called from applyChatTemplate(), which is invoked by concurrent worker threads in the Service implementation, this creates a race condition. Replace with std::localtime_r (POSIX) or std::localtime_s (Windows), or use std::chrono for the modern C++ approach.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@mllm/models/llama/tokenization_llama.hpp` around lines 121 - 127, The
getSystemPromptPrefix() function uses std::localtime (unsafe for concurrent
calls from applyChatTemplate()); replace it with a thread-safe alternative — on
POSIX call std::localtime_r(&t, &tm_) or on Windows std::localtime_s(&tm_, &t)
(or use std::chrono to format the current date) so that tm_ is populated safely
per-thread; update the declaration/initialization of tm_ accordingly and keep
the existing formatting with std::put_time and oss.str() untouched.

Comment on lines 253 to +259
def infer(self, prompt: str):
messages = [{"role": "user", "content": prompt}]
prompt = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align calibration preprocessing with inference chat formatting.

infer() now uses apply_chat_template(...) (Line 255), but calibrate() still feeds raw text (Line 324). This distribution mismatch can produce inaccurate activation statistics for deployment. Use the same chat-template path in calibration samples.

Also applies to: 322-330

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/runner.py` around lines
253 - 259, The calibration path must use the same chat formatting as inference:
modify the calibrate(...) implementation (referencing calibrate and its use of
calibration samples) to build messages = [{"role": "user", "content": sample}]
and call self.tokenizer.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True) for each sample (the same call/invocation used in
infer()), and use the returned formatted prompt when computing activation
statistics so calibration and infer share identical preprocessing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant