Summary
This issue proposes adding encode(), decode(), and load() to BaseTokenizer as abstract methods and implementing them fully in BPETokenizer. This completes the BPE tokenizer
contract and ensures all future tokenizer implementations
are required to support these core operations.
Background
PR #17 introduced BaseTokenizer as an abstract interface and BPETokenizer as its first implementation.
The current BaseTokenizer defines:
train() ✅ defined
get_vocab_path() ✅ defined
get_merges_path() ✅ defined
encode() ❌ not defined
decode() ❌ not defined
load() ❌ not defined
Since encode(), decode() and load() are not part of the abstract interface, subclasses have no obligation to
implement them. This breaks the tokenizer contract and leaves BPETokenizer only half complete.
Problems in Current Implementation
1. BaseTokenizer does not enforce encode/decode/load
2. BPETokenizer cannot encode or decode
3. No way to reload BPETokenizer from disk
4. No input validation in BPETokenizer.train()
Proposed Changes
openverifiablellm/tokenizer/base.py
Add three abstract methods:
openverifiablellm/tokenizer/bpe_tokenizer.py
Implement encode():
Implement decode():
Implement load():
Harden train():
tests/test_bpe.py ← new file
With this PR:
- BaseTokenizer enforces full workflow
- BPETokenizer is fully usable
- encode → decode works
- Pipeline can progress to model training
Scope
This is a change touching only:
base.py
bpe_tokenizer.py
tests/test_bpe.py
No overlap with any existing open PRs.
No dependency on any other issue.
Related
Additional Context
No response
Code of Conduct
Summary
This issue proposes adding encode(), decode(), and load() to BaseTokenizer as abstract methods and implementing them fully in BPETokenizer. This completes the BPE tokenizer
contract and ensures all future tokenizer implementations
are required to support these core operations.
Background
PR #17 introduced
BaseTokenizeras an abstract interface andBPETokenizeras its first implementation.The current
BaseTokenizerdefines:Since encode(), decode() and load() are not part of the abstract interface, subclasses have no obligation to
implement them. This breaks the tokenizer contract and leaves BPETokenizer only half complete.
Problems in Current Implementation
1. BaseTokenizer does not enforce encode/decode/load
2. BPETokenizer cannot encode or decode
3. No way to reload BPETokenizer from disk
4. No input validation in BPETokenizer.train()
Proposed Changes
openverifiablellm/tokenizer/base.pyAdd three abstract methods:
openverifiablellm/tokenizer/bpe_tokenizer.pyImplement encode():
Implement decode():
Implement load():
Harden train():
tests/test_bpe.py← new fileWith this PR:
Scope
This is a change touching only:
base.pybpe_tokenizer.pytests/test_bpe.pyNo overlap with any existing open PRs.
No dependency on any other issue.
Related
Additional Context
No response
Code of Conduct