Skip to content

r-three/Tokenizers

Repository files navigation

Tokenizer-level experiments

This code base contains code to experiment with tokenizers in the xarch project. This repo eventually inherits from https://github.com/r-three/ca-merging.

Contributing

  • There is no restriction but in general try to add your scripts under xarch_tokenizers/scripts and create a corresponding config, see EvaluationConfig in experiment_config.
  • Please add all new dependencies with uv, e.g. uv add XXX

Set-up

We recommend using uv (install it with pip install uv if not already available).

On Killarney

On the Killarney cluster, you need to first load the following modules:

module load slurm/killarney/24.05.7 StdEnv/2023  gcc/13.3  openmpi/5.0.3 cuda/12.6 python/3.10.13

and for the first time you run the code, you need to install the packages to the system:

curl -LsSf https://astral.sh/uv/install.sh | sh
# If you don't have a virtual environment already, you can either
# 1. Install the packages to the system
uv pip install -e . --system

# 2. Create a venv with uv
# make sure to load cuda (locally built with cuda-12.4)
uv venv --python 3.10
source .venv/bin/activate
## First run
uv sync --extra build
uv sync --all-extras
# on machines w/o cuda
uv sync --all-extras --all-groups  --no-install-package flash-attn

If you have another uv venv, you can add this package to the original projects pyproject.toml as below and run uv sync --extra tokenizers in the main directory:

[project.optional-dependencies]
tokenizers = ["xarch-tokenizers"]

[tool.uv.sources]
xarch-tokenizers = { path = "../tokenizers", editable = true }

Sample Usage:

LM-Eval

uv run eval xarch_tokenizers/configs/mgsm/mgsm_eval_llama8B.yaml 

Lm-Eval new datasets

Add new datasets under xarch_tokenizers/lm_eval_datasets, for local datasets follow tokenization_robustness, for HuggingFace datasets follow mgsm configs, create task/task_subset.yaml for datasets with subsets.

You can further override task specific settings, for an example see this config file.

You can also add custom metrics and filters for lm_eval_harness in lm_eval.py.

Tokenization Robustness Dataset

We have a custom dataset here, follow the format when adding new examples. To be able to run evaluation on this dataset, we need to convert its formatting see below.

Sample evaluation configs can be found here

python xarch_tokenizers/scripts/eval.py xarch_tokenizers/configs/tokenization_robustness/eval_llama8B.yaml
python xarch_tokenizers/scripts/eval.py xarch_tokenizers/configs/tokenization_robustness/eval_qwen_7B.yaml
python xarch_tokenizers/scripts/eval.py xarch_tokenizers/configs/tokenization_robustness/eval_qwen_05B.yaml

Convert dataset to HF format and upload to HF

python xarch_tokenizers/scripts/convert_dataset_to_hf_format.py xarch_tokenizers/configs/tokenization_robustness/convert_v102_to_hf.yaml

Other Functionality

  • You can upload custom datasets to huggingface with this script.
  • Token surgeon is our fork of the arcee's token surgeon script.

Converting Supertoken Models

model="gpt2"
tokenizer="gpt2"
model_name="craffel/supertoken_models"
model_path="$model_name/$model/"
tokenizer_name="blester125/supervocab-$tokenizer"
hf_model_path="$PROJECT/models/$model_name"
tokenizer_path="$PROJECT/tokenizers/$tokenizer"
hf_out_path="gsaltintas/supertoken_models-llama_$model"

# Create directories
mkdir -p "$hf_model_path"
mkdir -p "$hf_model_path"

huggingface-cli download $model_name --local-dir=$hf_model_path
huggingface-cli download $tokenizer_name --local-dir=$tokenizer_path
# Convert LLaMA weights to HuggingFace format
echo "Converting model weights to HuggingFace format..."
python -m xarch_tokenizers.scripts.convert_supertoken_models \
    --input_dir "$hf_model_path/$model" \
    --model_size 1B \
    --output_dir "$hf_model_path" \
    --llama_version 3 --tokenizer_version 3 \
    --tokenizer_path "$tokenizer_path" \
    --push_to_hub --output_dir $hf_out_path \
    --only_model --public

# Run lm_eval with converted model
echo "Running lm_eval..."
lm_eval \
--model hf --model_args "pretrained=$hf_out_path,tokenizer=$tokenizer" \
--device cuda \
--tasks tokenizer_robustness_code_technical_content,tokenizer_robustness_context-dependent_ambiguities,tokenizer_robustness_mathematical_scientific_notation,tokenizer_robustness_morphological_challenges,tokenizer_robustness_multi-linguality,tokenizer_robustness_named_entities,tokenizer_robustness_orthographic_variations,tokenizer_robustness_social_media_informal_text,tokenizer_robustness_structural_text_elements,tokenizer_robustness_temporal_expressions \
--log_samples \
--verbosity DEBUG \
--output_path "results/tokenization_robustness/v102-cleaned/supertoken/$model"

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •