llm.c is a minimalist implementation of Large Language Models (LLMs) in pure C and CUDA, designed to be lightweight and efficient without the overhead of frameworks like PyTorch or cPython. The project focuses on pretraining, specifically reproducing the GPT-2 and GPT-3 models, and includes a parallel PyTorch reference implementation for comparison.
Key Features:
- Pure C/CUDA Implementation: No dependency on large frameworks like PyTorch (245MB) or cPython (107MB).
- Performance: Currently about 7% faster than PyTorch Nightly.
- Educational Focus: Well-documented code and kernels in the
dev/cuda
folder for learning purposes. - Multi-GPU Support: Includes scripts for multi-node training using OpenMPI, shared file systems, or TCP sockets.
- Flash Attention: Optional cuDNN integration for improved performance (disabled by default due to longer compile times).
Getting Started:
- Reproducing GPT-2 (124M): Detailed steps are provided in Discussion #481.
- CPU/GPU Training: Includes simple reference implementations for both CPU (fp32) and GPU (CUDA) training.
- Starter Pack: A script (
download_starter_pack.sh
) to quickly download necessary files like pretrained weights and tokenized datasets. - Unit Testing: Compare C/CUDA implementations with PyTorch for accuracy.
Use Cases:
- Education: Learn how LLMs are implemented from scratch in C/CUDA.
- Research: Reproduce GPT-2/GPT-3 training runs with minimal overhead.
- Performance Optimization: Experiment with hand-written kernels and compare against library implementations like cuBLAS.
Community Contributions:
The project welcomes ports to other languages (C++, Rust, Java, etc.) and encourages discussions via GitHub Issues, PRs, or Discord channels (#llmc
on Zero to Hero or #llmdotc
on GPU MODE).
License:
MIT