A high-performance BPE (Byte Pair Encoding) tokenizer with Python bindings and a header-only C++ implementation.
- π Fast C++ Core: Header-only C++ implementation for maximum performance
- π Python Bindings: Easy-to-use Python API with pybind11
- πΎ Serialization: Save and load trained tokenizer models
- π§ Header-Only: Simple integration into C++ projects
- π BPE Algorithm: Efficient subword tokenization
- π― Cross-Platform: Works on Windows, macOS, and Linux
Significant improvements in tokenization speed have been made in version 0.1.1 compared to the initial release 0.1.0.
| Tokens Processed | Time (v0.1.0) | Time (v0.1.1) | Speedup |
|---|---|---|---|
| 0 | 84,157 Β΅s | 5,502 Β΅s | ~15Γ |
| 100 | 6,977,301 Β΅s | 642,335 Β΅s | ~10.9Γ |
| 200 | 14,437,683 Β΅s | 1,370,924 Β΅s | ~10.5Γ |
| 300 | 20,902,067 Β΅s | 2,154,547 Β΅s | ~9.7Γ |
| 400 | 26,554,987 Β΅s | 2,967,434 Β΅s | ~8.9Γ |
| 500 | 32,350,267 Β΅s | 3,798,688 Β΅s | ~8.5Γ |
| 600 | 38,075,928 Β΅s | 4,630,268 Β΅s | ~8.2Γ |
| 700 | 43,831,217 Β΅s | 5,471,428 Β΅s | ~8.0Γ |
| 800 | 49,559,857 Β΅s | 6,316,320 Β΅s | ~7.8Γ |
| 900 | 56,149,850 Β΅s | 7,166,352 Β΅s | ~7.8Γ |
| 1000 | 62,877,499 Β΅s | (Pending) | (N/A) |
β‘ Overall, version 0.1.1 is 7β15Γ faster across the board due to internal optimizations and improved data structures.
π‘ Benchmark run with vocab size of 1000 tokens. Measurements are approximate and may vary slightly based on system specs.
You can visualize the performance improvements in the chart below:
pip install shatokenizergit https://github.com/shaheen-coder/shatokenizer.git
cd shatokenizer
pip install .from shatokenizer import ShaTokenizer
# Create tokenizer instance
tokenizer = ShaTokenizer()
# Train on your dataset
tokenizer.train('dataset.txt', 1000)
# Encode text to token IDs
tokens = tokenizer.encode('hello this tokenizer')
print(tokens) # [123, 43, 1211]
# Decode token IDs back to text
text = tokenizer.decode(tokens)
print(text) # "hello this tokenizer"
# Save trained model
tokenizer.save("shatokenizer.pkl")
# Load trained model
tokenizer2 = ShaTokenizer.load("shatokenizer.pkl")#include <iostream>
#include <shatokenizer/tokenizer.hpp>
int main() {
auto tokenizer = new ShaTokenizer();
tokenizer->train("data.txt", 1000);
auto ids = tokenizer->encode("hello");
std::cout << "decode: " << tokenizer->decode(ids) << std::endl;
delete tokenizer;
return 0;
}Creates a new tokenizer instance.
Trains the tokenizer on the provided dataset.
dataset_path: Path to the training text filevocab_size: Target vocabulary size
Encodes text into token IDs.
text: Input text to encode- Returns: List of token IDs
Decodes token IDs back to text.
tokens: List of token IDs- Returns: Decoded text string
Saves the trained tokenizer model.
path: Output file path
Loads a trained tokenizer model.
path: Path to saved model file- Returns: Loaded tokenizer instance
- Python 3.7+
- C++17 compatible compiler
- pybind11
- CMake (for C++ development)
# Clone the repository
git clone https://github.com/yourusername/shatokenizer.git
cd shatokenizer
# Install in development mode
pip install -e .We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install development dependencies:
pip install -e ".[dev]" - Run tests:
pytest
- Python code should follow PEP 8
- C++ code should follow Google C++ Style Guide
- Use
blackfor Python formatting - Use
clang-formatfor C++ formatting
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Ensure all tests pass:
pytest - Format your code:
black .andclang-format -i src/*.cpp src/*.hpp - Commit your changes:
git commit -m "Add feature" - Push to your fork:
git push origin feature-name - Open a Pull Request
ShaTokenizer is designed for high performance:
- C++ core implementation for speed-critical operations
- Minimal Python overhead with pybind11
- Efficient memory usage with header-only design
- Optimized BPE algorithm implementation
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with pybind11
- Inspired by modern tokenization libraries
- Thanks to all contributors
- π Documentation
- π Issue Tracker
- π¬ Discussions
Made with β€οΈ by Shaheen