tokenweaver formats Hugging Face datasets into tokenizer-aware chunks for fixed-length LLM training windows.
It is designed to help you:
- reduce token waste caused by badly sized document tails
- preserve row order and metadata from the source dataset
- export the formatted dataset to local disk, parquet, or the Hugging Face Hub
The CLI has three commands:
| Command | When to use it | Writes files? |
|---|---|---|
inspect |
Preview chunking stats before a real run | No |
format |
Run the full formatting pipeline | Yes, unless --dry-run |
report |
Generate a markdown or JSON summary | Only if you pass --output-path |
Recommended workflow:
- Run
inspectto preview the chunking behavior. - Run
format --dry-runto validate the full pipeline safely. - Run
formatagain with an output flag when you are happy with the setup.
pip install -e .[dev]After installation, you can use either command style:
tokenweaver --helppython -m tokenweaver.cli --helpIf tokenweaver is not on your shell PATH, use python -m tokenweaver.cli.
See the global help:
python -m tokenweaver.cli --helpSee help for the formatting command:
python -m tokenweaver.cli format --helpRun the safest smoke test:
python -m tokenweaver.cli format \
--dataset-id my-org/my-dataset \
--seq-len 2048 \
--dry-runThat command still loads the tokenizer, loads the dataset, and runs chunking, but it does not write any files.
These are the flags most users care about first:
| Flag | What it controls | Typical choice |
|---|---|---|
--dataset-id |
Source dataset on Hugging Face | Required |
--seq-len |
Final context window size, including special tokens | Required |
--tokenizer |
Tokenizer used to decide chunk boundaries | Usually your training model tokenizer |
--text-field |
Column that contains the source text | Usually text |
--split |
Process only one split instead of all splits | Leave empty unless needed |
--overlap |
Tokens repeated between chunks | auto is the safest default |
--min-tail-tokens |
Preferred minimum size of the last tail when overlap is auto | Keep default unless you know why to change it |
--max-overlap |
Upper bound for auto overlap | Keep default unless tuning |
--batch-size |
How many documents are processed together during mapping | Lower it on slower machines |
--dry-run |
Runs everything except writing outputs | Best first test |
--streaming |
Reads source data lazily instead of downloading it all upfront | Good for inspection, not for writing outputs |
The format command becomes a real write operation when you choose one of these:
| Flag | Result |
|---|---|
--output-dir |
Writes a local Hugging Face save_to_disk() directory |
--output-parquet-dir |
Writes one parquet file per split |
--push-to-hub + --output-dataset-id |
Publishes the formatted dataset to the Hugging Face Hub |
Use --output-parquet-dir when you want the easiest format to inspect, upload, or share.
Inspect a dataset without writing anything:
python -m tokenweaver.cli inspect \
--dataset-id my-org/my-dataset \
--seq-len 2048Format and save as parquet:
python -m tokenweaver.cli format \
--dataset-id my-org/my-dataset \
--seq-len 2048 \
--output-parquet-dir ./out/my-dataset-cpt-2048Format and save as a local datasets directory:
python -m tokenweaver.cli format \
--dataset-id my-org/my-dataset \
--seq-len 2048 \
--output-dir ./out/my-dataset-cpt-2048Format and publish to the Hub:
python -m tokenweaver.cli format \
--dataset-id my-org/private-dataset \
--hf-token "$HF_TOKEN" \
--seq-len 2048 \
--output-dataset-id my-org/private-dataset-cpt-2048 \
--push-to-hubWrite a markdown report:
python -m tokenweaver.cli report \
--dataset-id my-org/my-dataset \
--seq-len 2048 \
--format markdown \
--output-path ./report.mdWrite a JSON report:
python -m tokenweaver.cli report \
--dataset-id my-org/my-dataset \
--seq-len 2048 \
--format json \
--output-path ./report.jsonIf you are unsure, keep the defaults:
--overlap auto --min-tail-tokens 256 --max-overlap 256This means:
- the tool tries to avoid very short final tail chunks
- it can increase overlap automatically when needed
- it does not keep increasing overlap forever because
--max-overlapsets a cap
If you pass a manual overlap like --overlap 128, the CLI uses that fixed overlap directly.
Formatting large datasets can feel uneven when some source documents are much longer than others.
If progress feels jumpy or looks “stuck”, lower --batch-size:
python -m tokenweaver.cli format \
--dataset-id my-org/my-dataset \
--seq-len 2048 \
--batch-size 64 \
--dry-runTry even smaller values like 32 or 16 on weaker machines. This usually makes progress updates feel smoother, though total runtime can increase.
These flags change how the CLI executes the job, not how the chunks are computed.
| Flag | What it changes | Good default mental model | Trade-off |
|---|---|---|---|
--dry-run |
Runs the whole pipeline but skips every write step | Use this first before any long run | You validate everything except the final save/push |
--batch-size |
Number of source documents processed together during mapping | Lower it when the progress bar feels jumpy or memory is tight | Smaller batches feel smoother but can be slower |
--streaming |
Reads the source dataset lazily instead of downloading it all upfront | Use it for inspect, report, or format --dry-run |
In the CLI, treat it as read-only mode and remove it for real writes |
Practical starting points:
--dry-run: use on your first attempt for any new dataset--batch-size 64: good first adjustment on laptops or older machines--streaming: good when you want to preview behavior without materializing the full source dataset locally
Easy rule of thumb:
- If you want safety, add
--dry-run - If you want smoother progress, lower
--batch-size - If you want a real output folder or parquet export, do not use
--streaming
Use --hf-token when:
- the source dataset is private
- the source dataset is gated
- you want to push the formatted result back to the Hub
Example:
python -m tokenweaver.cli format \
--dataset-id my-org/private-dataset \
--hf-token "$HF_TOKEN" \
--seq-len 2048 \
--output-parquet-dir ./out/private-dataset-cpt-2048Useful global flags:
| Flag | What it does |
|---|---|
--plain |
Disables Rich formatting and prints stable plain text, useful for scripts and CI |
--debug |
Shows the full Python traceback after the friendly CLI error summary |
Example:
python -m tokenweaver.cli --plain inspect \
--dataset-id my-org/my-dataset \
--seq-len 2048from tokenweaver.formatting import format_dataset
formatted = format_dataset(
dataset_id="my-org/my-dataset",
seq_len=2048,
tokenizer_name_or_path="Qwen/Qwen3-8B",
overlap="auto",
min_tail_tokens=256,
max_overlap=256,
output_parquet_dir="./out/my-dataset-cpt-2048",
)The formatted output preserves your original columns and adds:
token_countchunk_indexchunk_total
Notes:
seq_lenalways includes the tokenizer special-token overhead.- The output
textfield is decoded back to plain text. token_countreflects the tokenizer's normal counting behavior, including special tokens.