llm.c

llm.c is a minimalist implementation of Large Language Models (LLMs) in pure C and CUDA, designed to be lightweight and efficient without the overhead of frameworks like PyTorch or cPython. The project focuses on pretraining, specifically reproducing the GPT-2 and GPT-3 models, and includes a parallel PyTorch reference implementation for comparison.

Key Features:

Pure C/CUDA Implementation: No dependency on large frameworks like PyTorch (245MB) or cPython (107MB).
Performance: Currently about 7% faster than PyTorch Nightly.
Educational Focus: Well-documented code and kernels in the dev/cuda folder for learning purposes.
Multi-GPU Support: Includes scripts for multi-node training using OpenMPI, shared file systems, or TCP sockets.
Flash Attention: Optional cuDNN integration for improved performance (disabled by default due to longer compile times).

Getting Started:

Reproducing GPT-2 (124M): Detailed steps are provided in Discussion #481.
CPU/GPU Training: Includes simple reference implementations for both CPU (fp32) and GPU (CUDA) training.
Starter Pack: A script (download_starter_pack.sh) to quickly download necessary files like pretrained weights and tokenized datasets.
Unit Testing: Compare C/CUDA implementations with PyTorch for accuracy.

Use Cases:

Education: Learn how LLMs are implemented from scratch in C/CUDA.
Research: Reproduce GPT-2/GPT-3 training runs with minimal overhead.
Performance Optimization: Experiment with hand-written kernels and compare against library implementations like cuBLAS.

Community Contributions:

The project welcomes ports to other languages (C++, Rust, Java, etc.) and encourages discussions via GitHub Issues, PRs, or Discord channels (#llmc on Zero to Hero or #llmdotc on GPU MODE).

License:

MIT

llm.c

LLMs in simple, pure C/CUDA

Key Features:

Getting Started:

Use Cases:

Community Contributions:

License:

Dify

Llama3