Tokenizer-level experiments

This code base contains code to experiment with tokenizers in the xarch project. This repo eventually inherits from https://github.com/r-three/ca-merging.

Contributing

There is no restriction but in general try to add your scripts under xarch_tokenizers/scripts and create a corresponding config, see EvaluationConfig in experiment_config.
Please add all new dependencies with uv, e.g. uv add XXX

Set-up

We recommend using uv (install it with pip install uv if not already available).

On Killarney

On the Killarney cluster, you need to first load the following modules:

module load slurm/killarney/24.05.7 StdEnv/2023  gcc/13.3  openmpi/5.0.3 cuda/12.6 python/3.10.13

and for the first time you run the code, you need to install the packages to the system:

curl -LsSf https://astral.sh/uv/install.sh | sh

# If you don't have a virtual environment already, you can either
# 1. Install the packages to the system
uv pip install -e . --system

# 2. Create a venv with uv
# make sure to load cuda (locally built with cuda-12.4)
uv venv --python 3.10
source .venv/bin/activate
## First run
uv sync --extra build
uv sync --all-extras
# on machines w/o cuda
uv sync --all-extras --all-groups  --no-install-package flash-attn

If you have another uv venv, you can add this package to the original projects pyproject.toml as below and run uv sync --extra tokenizers in the main directory:

[project.optional-dependencies]
tokenizers = ["xarch-tokenizers"]

[tool.uv.sources]
xarch-tokenizers = { path = "../tokenizers", editable = true }

Sample Usage:

LM-Eval

uv run eval xarch_tokenizers/configs/mgsm/mgsm_eval_llama8B.yaml

Lm-Eval new datasets

Add new datasets under xarch_tokenizers/lm_eval_datasets, for local datasets follow tokenization_robustness, for HuggingFace datasets follow mgsm configs, create task/task_subset.yaml for datasets with subsets.

You can further override task specific settings, for an example see this config file.

You can also add custom metrics and filters for lm_eval_harness in lm_eval.py.

Tokenization Robustness Dataset

We have a custom dataset here, follow the format when adding new examples. To be able to run evaluation on this dataset, we need to convert its formatting see below.

Sample evaluation configs can be found here

python xarch_tokenizers/scripts/eval.py xarch_tokenizers/configs/tokenization_robustness/eval_llama8B.yaml
python xarch_tokenizers/scripts/eval.py xarch_tokenizers/configs/tokenization_robustness/eval_qwen_7B.yaml
python xarch_tokenizers/scripts/eval.py xarch_tokenizers/configs/tokenization_robustness/eval_qwen_05B.yaml

Convert dataset to HF format and upload to HF

python xarch_tokenizers/scripts/convert_dataset_to_hf_format.py xarch_tokenizers/configs/tokenization_robustness/convert_v102_to_hf.yaml

Other Functionality

You can upload custom datasets to huggingface with this script.
Token surgeon is our fork of the arcee's token surgeon script.

Converting Supertoken Models

model="gpt2"
tokenizer="gpt2"
model_name="craffel/supertoken_models"
model_path="$model_name/$model/"
tokenizer_name="blester125/supervocab-$tokenizer"
hf_model_path="$PROJECT/models/$model_name"
tokenizer_path="$PROJECT/tokenizers/$tokenizer"
hf_out_path="gsaltintas/supertoken_models-llama_$model"

# Create directories
mkdir -p "$hf_model_path"
mkdir -p "$hf_model_path"

huggingface-cli download $model_name --local-dir=$hf_model_path
huggingface-cli download $tokenizer_name --local-dir=$tokenizer_path
# Convert LLaMA weights to HuggingFace format
echo "Converting model weights to HuggingFace format..."
python -m xarch_tokenizers.scripts.convert_supertoken_models \
    --input_dir "$hf_model_path/$model" \
    --model_size 1B \
    --output_dir "$hf_model_path" \
    --llama_version 3 --tokenizer_version 3 \
    --tokenizer_path "$tokenizer_path" \
    --push_to_hub --output_dir $hf_out_path \
    --only_model --public

# Run lm_eval with converted model
echo "Running lm_eval..."
lm_eval \
--model hf --model_args "pretrained=$hf_out_path,tokenizer=$tokenizer" \
--device cuda \
--tasks tokenizer_robustness_code_technical_content,tokenizer_robustness_context-dependent_ambiguities,tokenizer_robustness_mathematical_scientific_notation,tokenizer_robustness_morphological_challenges,tokenizer_robustness_multi-linguality,tokenizer_robustness_named_entities,tokenizer_robustness_orthographic_variations,tokenizer_robustness_social_media_informal_text,tokenizer_robustness_structural_text_elements,tokenizer_robustness_temporal_expressions \
--log_samples \
--verbosity DEBUG \
--output_path "results/tokenization_robustness/v102-cleaned/supertoken/$model"

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
data		data
notebooks		notebooks
results		results
slurm_scripts		slurm_scripts
tests		tests
xarch_tokenizers		xarch_tokenizers
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tokenizer-level experiments

Contributing

Set-up

On Killarney

Sample Usage:

Lm-Eval new datasets

Tokenization Robustness Dataset

Convert dataset to HF format and upload to HF

Other Functionality

Converting Supertoken Models

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

r-three/Tokenizers

Folders and files

Latest commit

History

Repository files navigation

Tokenizer-level experiments

Contributing

Set-up

On Killarney

Sample Usage:

Lm-Eval new datasets

Tokenization Robustness Dataset

Convert dataset to HF format and upload to HF

Other Functionality

Converting Supertoken Models

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages