Skip to content

shaheen-coder/shatokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ShaTokenizer 0.1.2

PyPI version Python versions License PyPI downloads C++ pybind11

A high-performance BPE (Byte Pair Encoding) tokenizer with Python bindings and a header-only C++ implementation.

Features

  • πŸš€ Fast C++ Core: Header-only C++ implementation for maximum performance
  • 🐍 Python Bindings: Easy-to-use Python API with pybind11
  • πŸ’Ύ Serialization: Save and load trained tokenizer models
  • πŸ”§ Header-Only: Simple integration into C++ projects
  • πŸ“Š BPE Algorithm: Efficient subword tokenization
  • 🎯 Cross-Platform: Works on Windows, macOS, and Linux

⏱️ Tokenization Time Benchmark

πŸ”„ Version Comparison: v0.1.1 vs v0.1.0

Significant improvements in tokenization speed have been made in version 0.1.1 compared to the initial release 0.1.0.

Tokens Processed Time (v0.1.0) Time (v0.1.1) Speedup
0 84,157 Β΅s 5,502 Β΅s ~15Γ—
100 6,977,301 Β΅s 642,335 Β΅s ~10.9Γ—
200 14,437,683 Β΅s 1,370,924 Β΅s ~10.5Γ—
300 20,902,067 Β΅s 2,154,547 Β΅s ~9.7Γ—
400 26,554,987 Β΅s 2,967,434 Β΅s ~8.9Γ—
500 32,350,267 Β΅s 3,798,688 Β΅s ~8.5Γ—
600 38,075,928 Β΅s 4,630,268 Β΅s ~8.2Γ—
700 43,831,217 Β΅s 5,471,428 Β΅s ~8.0Γ—
800 49,559,857 Β΅s 6,316,320 Β΅s ~7.8Γ—
900 56,149,850 Β΅s 7,166,352 Β΅s ~7.8Γ—
1000 62,877,499 Β΅s (Pending) (N/A)

⚑ Overall, version 0.1.1 is 7–15Γ— faster across the board due to internal optimizations and improved data structures.

πŸ’‘ Benchmark run with vocab size of 1000 tokens. Measurements are approximate and may vary slightly based on system specs.


πŸ“ˆ Visual Benchmark

You can visualize the performance improvements in the chart below:

Time Comparison (Lower is Better)

Tokenization Time Comparison


Installation

From PyPI

pip install shatokenizer

From Source

git https://github.com/shaheen-coder/shatokenizer.git
cd shatokenizer
pip install .

Quick Start

Python Usage

from shatokenizer import ShaTokenizer

# Create tokenizer instance
tokenizer = ShaTokenizer()

# Train on your dataset
tokenizer.train('dataset.txt', 1000)

# Encode text to token IDs
tokens = tokenizer.encode('hello this tokenizer')
print(tokens)  # [123, 43, 1211]

# Decode token IDs back to text
text = tokenizer.decode(tokens)
print(text)  # "hello this tokenizer"

# Save trained model
tokenizer.save("shatokenizer.pkl")

# Load trained model
tokenizer2 = ShaTokenizer.load("shatokenizer.pkl")

C++ Usage

#include <iostream>
#include <shatokenizer/tokenizer.hpp>

int main() {
    auto tokenizer = new ShaTokenizer();
    tokenizer->train("data.txt", 1000);
    
    auto ids = tokenizer->encode("hello");
    std::cout << "decode: " << tokenizer->decode(ids) << std::endl;
    
    delete tokenizer;
    return 0;
}

API Reference

Python API

ShaTokenizer()

Creates a new tokenizer instance.

train(dataset_path: str, vocab_size: int) -> None

Trains the tokenizer on the provided dataset.

  • dataset_path: Path to the training text file
  • vocab_size: Target vocabulary size

encode(text: str) -> List[int]

Encodes text into token IDs.

  • text: Input text to encode
  • Returns: List of token IDs

decode(tokens: List[int]) -> str

Decodes token IDs back to text.

  • tokens: List of token IDs
  • Returns: Decoded text string

save(path: str) -> None

Saves the trained tokenizer model.

  • path: Output file path

load(path: str) -> ShaTokenizer (static method)

Loads a trained tokenizer model.

  • path: Path to saved model file
  • Returns: Loaded tokenizer instance

Building from Source

Requirements

  • Python 3.7+
  • C++17 compatible compiler
  • pybind11
  • CMake (for C++ development)

Build Instructions

# Clone the repository
git clone https://github.com/yourusername/shatokenizer.git
cd shatokenizer

# Install in development mode
pip install -e .

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

  1. Fork the repository
  2. Create a virtual environment:
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install development dependencies:
    pip install -e ".[dev]"
  4. Run tests:
    pytest

Code Style

  • Python code should follow PEP 8
  • C++ code should follow Google C++ Style Guide
  • Use black for Python formatting
  • Use clang-format for C++ formatting

Submitting Changes

  1. Create a feature branch: git checkout -b feature-name
  2. Make your changes and add tests
  3. Ensure all tests pass: pytest
  4. Format your code: black . and clang-format -i src/*.cpp src/*.hpp
  5. Commit your changes: git commit -m "Add feature"
  6. Push to your fork: git push origin feature-name
  7. Open a Pull Request

Performance

ShaTokenizer is designed for high performance:

  • C++ core implementation for speed-critical operations
  • Minimal Python overhead with pybind11
  • Efficient memory usage with header-only design
  • Optimized BPE algorithm implementation

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with pybind11
  • Inspired by modern tokenization libraries
  • Thanks to all contributors

Support


Made with ❀️ by Shaheen

About

A high-performance BPE (Byte Pair Encoding) tokenizer with Python bindings and a header-only C++ implementation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors