A reproduction of the GPT2 compeletely from scratch, using purely pytorch code.
We recreated the pre-training, instruction fine-tuning and inference code, with RLHF coming soon.
Code files:
root
|___model.py # the gpt2 model file
|___dataloader.py # the dataloader for the gpt pre-training
|___trainer.py # the training code for the pre-training
|___pretrain_data/
|___fineweb_downloader.py # downloader of the pretraining data
We followed the model architecture of GPT2 and set the model parameters in the model.GPTConfig dataclass:
@dataclass
class GPTConfig:
block_size: int = 1024 # context window, no more than 1024 tokens
vocab_size: int = 50257 # vocabulary size, based on the GPT-2 tokenizer
n_layer: int = 12 # number of transformer layers
n_head: int = 12 # number of attention heads
n_embd: int = 768 # the hidden dimension sizeWith this setup, the model has 124M parameters.
We downloaded the Fineweb-10B dataset and pre-trained our GPT model using 10B English tokens. Here we reused the GPT-2 tokenizer, from tiktoken library, to convert the english text into tokens. We utilize the pytorch DistributedDataParallel library to train our model.
The training config:
- Global Batch Size: 524288 (2^19) tokens per step
- GPU Count: 8 RTX-5090
- Micro Batch Size: 32 (the batch number for each gpu in one step)
- Gradient Accumulation Steps: 2. Since we can't fit the entire batch with even 8 GPUs, we have to use gradient accumulation. With 8 GPUs, 32 micro batch size, and a sequence length of 1024, we need to do set accumulate step to 524288 / (32 * 8 * 1024) = 2
You can adjust the GPU count and micro batch size, the gradient accumulation step will be computed in the code. Just make sure you don't get CUDA out of memory error :)
During pretraining, we saturate the sequence length with 1024 tokens for every sequence. We trained the model with one epoch, with linear warmup and cosine decay, and get the following loss graph:
Code files:
root
|___finetune.py # finetune training code
|___finetune_data/
|___aplaca_downaloder.py # downloader of the finetune data
We used the Stanford Alpaca dataset, a 52K question-answer pair, to fine-tune our pretrained GPT model, and train the model for 2 epoch.
During pre-training, we saturate our context window by using 1024 tokens per sequence for every sequence. In finetuning, however, each question-answer pair has different lengths, and to form a batch, we will use a padding tokens 0 at the end of the shorter sequences. We concatenate the question and answer to form one input sequence, and for its target, we left shift the original sequence by 1, and mask all parts except the answer tokens with -100. So the data looks like this:
input: <question tokens><answer tokens><0, 0, 0>
target: <-100, -100, -100><answer tokens><-100, -100, -100>
Setting the target to -100 will instruct the pytorch cross entropy function to ignore this index when calculating the loss. Therefore we only train the model to generate the correct answer given the question. Below is the loss graph:
After finetuning, our small GPT model will act like a chatbot
Before finetune:
prompt: "What are the places I can visit in New York?"
response: "The most popular locations are the South Side Pier, the Highlevard and the Manhattan Bridge: New York.\
How many places in New York are there? \
The New York City area has a population of 591,974. New York is the largest city in New York. \
What is New York City in terms of population? \
New York City ..." # [keep generating]
prompt: "Who is the first president of the United States?"
response: "The U.S. Constitution was written by Thomas Jefferson in 1790. The primary purpose of the Constitution was to secure the independence of the people. \
What are the 4 types of government? \
In its most basic form ..." # [keep generating]After finetune:
prompt: "What are the places I can visit in New York?"
response: "The most popular locations to visit in New York include the Empire \
State Building, the White House, the Statue of Liberty, and iconic New York \
skyline landmarks like the Crocker Bridge, the Empire State Building, and the \
Central Park."
prompt: "Who is the first president of the United States?"
response: "The first president of the United States was George Washington."The model will generate response for the question and generate "<|endoftext|>" token timely.
Code files:
root
|___inference.py # the inference code
The main inference funciton is the generate function in the file, we loosely follow the huggingface API design, with the generate function defined as:
def generate(
model, # the GPT model
input_ids, # the input token batch, pytorch tensor, shape: [B, T]
attention_masks, # the input mask, pytorch tensor, shape: [B, T]
temperature, # generation temperature
max_steps # maximum generation steps
) The input_ids is a batch of token list representing the prompts, short prompts will be padded with 0. attention_masks is the input mask, with the same shape as the input_ids, and will assign 1 if the corresponding index has real token, and 0 for the padding token. You can get them by calling the tokenize_batch_input function with a list of prompt strings.
Example:
input_ids: [
[134, 567, 34, 11, 34],
[89, 32, 22, 0, 0]
]
attention_masks: [
[1, 1, 1, 1, 1],
[1, 1, 1, 0, 0]
]
The generate function will invoke the forward pass of the model in the auto-regressive fashion, and sample the next token randomly, scaled by the temperature parameter. The generate function will keep invoking the model until we hit the '<|endoftext|>' token for all input prompts, or if we reach the max_steps.
Since our model is small, you can run this inference directly on CPU.
Coming soon

