deterministic dataset encoding pipeline and a minimal implementation of a LLaMA architecture.
-
The dataset is currently processed into wiki_clean.txt and a tokenizer has been trained. We need to implement a memory-efficient script to encode the entire dataset into binary format for training.
-
Implement a minimal PyTorch LLaMA-style architecture to ensure deterministic behavior and full control over initialization.
-
Read the dataset in chunks to avoid high memory usage.
-
Use the trained tokenizer (BPE/SentencePiece) to convert text into token IDs.
-
Stream token IDs into a binary dataset file (.bin, uint16 or similar).
-
Compute a SHA256 hash of the resulting file.
Verification Criteria
must output the exact same initial parameter hashes.
Additional Context
No response
Code of Conduct
deterministic dataset encoding pipeline and a minimal implementation of a LLaMA architecture.
The dataset is currently processed into wiki_clean.txt and a tokenizer has been trained. We need to implement a memory-efficient script to encode the entire dataset into binary format for training.
Implement a minimal PyTorch LLaMA-style architecture to ensure deterministic behavior and full control over initialization.
Read the dataset in chunks to avoid high memory usage.
Use the trained tokenizer (BPE/SentencePiece) to convert text into token IDs.
Stream token IDs into a binary dataset file (.bin, uint16 or similar).
Compute a SHA256 hash of the resulting file.
Verification Criteria
Running the dataset encoding pipeline twice should produce identical binary files and SHA256 hashes.
Initializing the model twice with the same seed should produce identical parameter hashes.
must output the exact same initial parameter hashes.
Additional Context
No response
Code of Conduct