Skip to content

C4AI/tokenweaver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenweaver

tokenweaver formats Hugging Face datasets into tokenizer-aware chunks for fixed-length LLM training windows.

It is designed to help you:

  • reduce token waste caused by badly sized document tails
  • preserve row order and metadata from the source dataset
  • export the formatted dataset to local disk, parquet, or the Hugging Face Hub

What the CLI does

The CLI has three commands:

Command When to use it Writes files?
inspect Preview chunking stats before a real run No
format Run the full formatting pipeline Yes, unless --dry-run
report Generate a markdown or JSON summary Only if you pass --output-path

Recommended workflow:

  1. Run inspect to preview the chunking behavior.
  2. Run format --dry-run to validate the full pipeline safely.
  3. Run format again with an output flag when you are happy with the setup.

Installation

pip install -e .[dev]

After installation, you can use either command style:

tokenweaver --help
python -m tokenweaver.cli --help

If tokenweaver is not on your shell PATH, use python -m tokenweaver.cli.

First commands to try

See the global help:

python -m tokenweaver.cli --help

See help for the formatting command:

python -m tokenweaver.cli format --help

Run the safest smoke test:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --dry-run

That command still loads the tokenizer, loads the dataset, and runs chunking, but it does not write any files.

Most important flags

These are the flags most users care about first:

Flag What it controls Typical choice
--dataset-id Source dataset on Hugging Face Required
--seq-len Final context window size, including special tokens Required
--tokenizer Tokenizer used to decide chunk boundaries Usually your training model tokenizer
--text-field Column that contains the source text Usually text
--split Process only one split instead of all splits Leave empty unless needed
--overlap Tokens repeated between chunks auto is the safest default
--min-tail-tokens Preferred minimum size of the last tail when overlap is auto Keep default unless you know why to change it
--max-overlap Upper bound for auto overlap Keep default unless tuning
--batch-size How many documents are processed together during mapping Lower it on slower machines
--dry-run Runs everything except writing outputs Best first test
--streaming Reads source data lazily instead of downloading it all upfront Good for inspection, not for writing outputs

Output flags

The format command becomes a real write operation when you choose one of these:

Flag Result
--output-dir Writes a local Hugging Face save_to_disk() directory
--output-parquet-dir Writes one parquet file per split
--push-to-hub + --output-dataset-id Publishes the formatted dataset to the Hugging Face Hub

Use --output-parquet-dir when you want the easiest format to inspect, upload, or share.

Common command recipes

Inspect a dataset without writing anything:

python -m tokenweaver.cli inspect \
  --dataset-id my-org/my-dataset \
  --seq-len 2048

Format and save as parquet:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --output-parquet-dir ./out/my-dataset-cpt-2048

Format and save as a local datasets directory:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --output-dir ./out/my-dataset-cpt-2048

Format and publish to the Hub:

python -m tokenweaver.cli format \
  --dataset-id my-org/private-dataset \
  --hf-token "$HF_TOKEN" \
  --seq-len 2048 \
  --output-dataset-id my-org/private-dataset-cpt-2048 \
  --push-to-hub

Write a markdown report:

python -m tokenweaver.cli report \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --format markdown \
  --output-path ./report.md

Write a JSON report:

python -m tokenweaver.cli report \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --format json \
  --output-path ./report.json

Choosing --overlap, --min-tail-tokens, and --max-overlap

If you are unsure, keep the defaults:

--overlap auto --min-tail-tokens 256 --max-overlap 256

This means:

  • the tool tries to avoid very short final tail chunks
  • it can increase overlap automatically when needed
  • it does not keep increasing overlap forever because --max-overlap sets a cap

If you pass a manual overlap like --overlap 128, the CLI uses that fixed overlap directly.

Progress on slower machines

Formatting large datasets can feel uneven when some source documents are much longer than others.

If progress feels jumpy or looks “stuck”, lower --batch-size:

python -m tokenweaver.cli format \
  --dataset-id my-org/my-dataset \
  --seq-len 2048 \
  --batch-size 64 \
  --dry-run

Try even smaller values like 32 or 16 on weaker machines. This usually makes progress updates feel smoother, though total runtime can increase.

Runtime flags, without mystery

These flags change how the CLI executes the job, not how the chunks are computed.

Flag What it changes Good default mental model Trade-off
--dry-run Runs the whole pipeline but skips every write step Use this first before any long run You validate everything except the final save/push
--batch-size Number of source documents processed together during mapping Lower it when the progress bar feels jumpy or memory is tight Smaller batches feel smoother but can be slower
--streaming Reads the source dataset lazily instead of downloading it all upfront Use it for inspect, report, or format --dry-run In the CLI, treat it as read-only mode and remove it for real writes

Practical starting points:

  • --dry-run: use on your first attempt for any new dataset
  • --batch-size 64: good first adjustment on laptops or older machines
  • --streaming: good when you want to preview behavior without materializing the full source dataset locally

Easy rule of thumb:

  • If you want safety, add --dry-run
  • If you want smoother progress, lower --batch-size
  • If you want a real output folder or parquet export, do not use --streaming

Private or gated datasets

Use --hf-token when:

  • the source dataset is private
  • the source dataset is gated
  • you want to push the formatted result back to the Hub

Example:

python -m tokenweaver.cli format \
  --dataset-id my-org/private-dataset \
  --hf-token "$HF_TOKEN" \
  --seq-len 2048 \
  --output-parquet-dir ./out/private-dataset-cpt-2048

Plain output and debugging

Useful global flags:

Flag What it does
--plain Disables Rich formatting and prints stable plain text, useful for scripts and CI
--debug Shows the full Python traceback after the friendly CLI error summary

Example:

python -m tokenweaver.cli --plain inspect \
  --dataset-id my-org/my-dataset \
  --seq-len 2048

Python API

from tokenweaver.formatting import format_dataset

formatted = format_dataset(
    dataset_id="my-org/my-dataset",
    seq_len=2048,
    tokenizer_name_or_path="Qwen/Qwen3-8B",
    overlap="auto",
    min_tail_tokens=256,
    max_overlap=256,
    output_parquet_dir="./out/my-dataset-cpt-2048",
)

What the formatted dataset contains

The formatted output preserves your original columns and adds:

  • token_count
  • chunk_index
  • chunk_total

Notes:

  • seq_len always includes the tokenizer special-token overhead.
  • The output text field is decoded back to plain text.
  • token_count reflects the tokenizer's normal counting behavior, including special tokens.

About

Library to format Hugging Face datasets into fixed-length token sequences tailored to hardware constraints.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages