tokenweaver

tokenweaver formats Hugging Face datasets into tokenizer-aware chunks for fixed-length LLM training windows.

It is designed to help you:

reduce token waste caused by badly sized document tails
preserve row order and metadata from the source dataset
export the formatted dataset to local disk, parquet, or the Hugging Face Hub

What the CLI does

The CLI has three commands:

Command	When to use it	Writes files?
`inspect`	Preview chunking stats before a real run	No
`format`	Run the full formatting pipeline	Yes, unless `--dry-run`
`report`	Generate a markdown or JSON summary	Only if you pass `--output-path`

Recommended workflow:

Run inspect to preview the chunking behavior.
Run format --dry-run to validate the full pipeline safely.
Run format again with an output flag when you are happy with the setup.

Installation

pip install -e .[dev]

After installation, you can use either command style:

tokenweaver --help

python -m tokenweaver.cli --help

If tokenweaver is not on your shell PATH, use python -m tokenweaver.cli.

First commands to try

See the global help:

python -m tokenweaver.cli --help

See help for the formatting command:

python -m tokenweaver.cli format --help

Run the safest smoke test:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --dry-run

That command still loads the tokenizer, loads the dataset, and runs chunking, but it does not write any files.

Most important flags

These are the flags most users care about first:

Flag	What it controls	Typical choice
`--dataset-id`	Source dataset on Hugging Face	Required
`--seq-len`	Final context window size, including special tokens	Required
`--tokenizer`	Tokenizer used to decide chunk boundaries	Usually your training model tokenizer
`--text-field`	Column that contains the source text	Usually `text`
`--split`	Process only one split instead of all splits	Leave empty unless needed
`--overlap`	Tokens repeated between chunks	`auto` is the safest default
`--min-tail-tokens`	Preferred minimum size of the last tail when overlap is auto	Keep default unless you know why to change it
`--max-overlap`	Upper bound for auto overlap	Keep default unless tuning
`--batch-size`	How many documents are processed together during mapping	Lower it on slower machines
`--dry-run`	Runs everything except writing outputs	Best first test
`--streaming`	Reads source data lazily instead of downloading it all upfront	Good for inspection, not for writing outputs

Output flags

The format command becomes a real write operation when you choose one of these:

Flag	Result
`--output-dir`	Writes a local Hugging Face `save_to_disk()` directory
`--output-parquet-dir`	Writes one parquet file per split
`--push-to-hub` + `--output-dataset-id`	Publishes the formatted dataset to the Hugging Face Hub

Use --output-parquet-dir when you want the easiest format to inspect, upload, or share.

Common command recipes

Inspect a dataset without writing anything:

python -m tokenweaver.cli inspect \
  --dataset-id my-org/my-dataset \
  --seq-len 2048

Format and save as parquet:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --output-parquet-dir ./out/my-dataset-cpt-2048

Format and save as a local datasets directory:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --output-dir ./out/my-dataset-cpt-2048

Format and publish to the Hub:

python -m tokenweaver.cli format \
  --dataset-id my-org/private-dataset \
  --hf-token "$HF_TOKEN" \
  --seq-len 2048 \
  --output-dataset-id my-org/private-dataset-cpt-2048 \
  --push-to-hub

Write a markdown report:

python -m tokenweaver.cli report \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --format markdown \
  --output-path ./report.md

Write a JSON report:

python -m tokenweaver.cli report \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --format json \
  --output-path ./report.json

Choosing `--overlap`, `--min-tail-tokens`, and `--max-overlap`

If you are unsure, keep the defaults:

--overlap auto --min-tail-tokens 256 --max-overlap 256

This means:

the tool tries to avoid very short final tail chunks
it can increase overlap automatically when needed
it does not keep increasing overlap forever because --max-overlap sets a cap

If you pass a manual overlap like --overlap 128, the CLI uses that fixed overlap directly.

Progress on slower machines

Formatting large datasets can feel uneven when some source documents are much longer than others.

If progress feels jumpy or looks “stuck”, lower --batch-size:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --batch-size 64 \
  --dry-run

Try even smaller values like 32 or 16 on weaker machines. This usually makes progress updates feel smoother, though total runtime can increase.

Runtime flags, without mystery

These flags change how the CLI executes the job, not how the chunks are computed.

Flag	What it changes	Good default mental model	Trade-off
`--dry-run`	Runs the whole pipeline but skips every write step	Use this first before any long run	You validate everything except the final save/push
`--batch-size`	Number of source documents processed together during mapping	Lower it when the progress bar feels jumpy or memory is tight	Smaller batches feel smoother but can be slower
`--streaming`	Reads the source dataset lazily instead of downloading it all upfront	Use it for `inspect`, `report`, or `format --dry-run`	In the CLI, treat it as read-only mode and remove it for real writes

Practical starting points:

--dry-run: use on your first attempt for any new dataset
--batch-size 64: good first adjustment on laptops or older machines
--streaming: good when you want to preview behavior without materializing the full source dataset locally

Easy rule of thumb:

If you want safety, add --dry-run
If you want smoother progress, lower --batch-size
If you want a real output folder or parquet export, do not use --streaming

Private or gated datasets

Use --hf-token when:

the source dataset is private
the source dataset is gated
you want to push the formatted result back to the Hub

Example:

python -m tokenweaver.cli format \
  --dataset-id my-org/private-dataset \
  --hf-token "$HF_TOKEN" \
  --seq-len 2048 \
  --output-parquet-dir ./out/private-dataset-cpt-2048

Plain output and debugging

Useful global flags:

Flag	What it does
`--plain`	Disables Rich formatting and prints stable plain text, useful for scripts and CI
`--debug`	Shows the full Python traceback after the friendly CLI error summary

Example:

python -m tokenweaver.cli --plain inspect \
  --dataset-id my-org/my-dataset \
  --seq-len 2048

Python API

from tokenweaver.formatting import format_dataset

formatted = format_dataset(
    dataset_id="my-org/my-dataset",
    seq_len=2048,
    tokenizer_name_or_path="Qwen/Qwen3-8B",
    overlap="auto",
    min_tail_tokens=256,
    max_overlap=256,
    output_parquet_dir="./out/my-dataset-cpt-2048",
)

What the formatted dataset contains

The formatted output preserves your original columns and adds:

token_count
chunk_index
chunk_total

Notes:

seq_len always includes the tokenizer special-token overhead.
The output text field is decoded back to plain text.
token_count reflects the tokenizer's normal counting behavior, including special tokens.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/tokenweaver		src/tokenweaver
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tokenweaver

What the CLI does

Installation

First commands to try

Most important flags

Output flags

Common command recipes

Choosing `--overlap`, `--min-tail-tokens`, and `--max-overlap`

Progress on slower machines

Runtime flags, without mystery

Private or gated datasets

Plain output and debugging

Python API

What the formatted dataset contains

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tokenweaver

What the CLI does

Installation

First commands to try

Most important flags

Output flags

Common command recipes

Choosing --overlap, --min-tail-tokens, and --max-overlap

Progress on slower machines

Runtime flags, without mystery

Private or gated datasets

Plain output and debugging

Python API

What the formatted dataset contains

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Choosing `--overlap`, `--min-tail-tokens`, and `--max-overlap`

Packages