A transformer-based language model built from scratch in PyTorch with custom tokenization and training pipeline.
Built a complete GPT-style language model from scratch, implementing all core transformer components including multi-head attention, positional encoding, and causal masking. The model supports both text generation and classification tasks.
The model follows the standard transformer decoder architecture with 12 layers, 768 hidden dimensions, and 12 attention heads. Trained on a diverse text corpus with AdamW optimizer.
Implemented custom training loop with gradient clipping, learning rate scheduling, and checkpointing. Used cross-entropy loss with label smoothing for better generalization.
class GPTModel(nn.Module):
def __init__(self, vocab_size, d_model, n_heads, n_layers):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model)
self.transformer_blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads)
for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size)
def forward(self, x):
x = self.embedding(x) + self.pos_encoding(x)
for block in self.transformer_blocks:
x = block(x)
x = self.ln_f(x)
return self.head(x)