llm.c

LLMs in simple, pure C/CUDA

2024-04-18

llm.c is a minimalist implementation of Large Language Models (LLMs) in pure C and CUDA, designed to be lightweight and efficient without the overhead of frameworks like PyTorch or cPython. The project focuses on pretraining, specifically reproducing the GPT-2 and GPT-3 models, and includes a parallel PyTorch reference implementation for comparison.

Key Features:

  • Pure C/CUDA Implementation: No dependency on large frameworks like PyTorch (245MB) or cPython (107MB).
  • Performance: Currently about 7% faster than PyTorch Nightly.
  • Educational Focus: Well-documented code and kernels in the dev/cuda folder for learning purposes.
  • Multi-GPU Support: Includes scripts for multi-node training using OpenMPI, shared file systems, or TCP sockets.
  • Flash Attention: Optional cuDNN integration for improved performance (disabled by default due to longer compile times).

Getting Started:

  1. Reproducing GPT-2 (124M): Detailed steps are provided in Discussion #481.
  2. CPU/GPU Training: Includes simple reference implementations for both CPU (fp32) and GPU (CUDA) training.
  3. Starter Pack: A script (download_starter_pack.sh) to quickly download necessary files like pretrained weights and tokenized datasets.
  4. Unit Testing: Compare C/CUDA implementations with PyTorch for accuracy.

Use Cases:

  • Education: Learn how LLMs are implemented from scratch in C/CUDA.
  • Research: Reproduce GPT-2/GPT-3 training runs with minimal overhead.
  • Performance Optimization: Experiment with hand-written kernels and compare against library implementations like cuBLAS.

Community Contributions:

The project welcomes ports to other languages (C++, Rust, Java, etc.) and encourages discussions via GitHub Issues, PRs, or Discord channels (#llmc on Zero to Hero or #llmdotc on GPU MODE).

License:

MIT

Artificial Intelligence Machine Learning CUDA C Programming GPT