Grok-1 is a state-of-the-art open-weights large language model developed by xAI. With a massive 314 billion parameters, it stands as one of the most powerful models available to the public. The model employs a Mixture of Experts (MoE) architecture, specifically utilizing 8 experts with 2 experts engaged per token, allowing for efficient and scalable processing.
Key technical specifications include:
- Architecture: 64 layers with 48 attention heads for queries and 8 for keys/values.
- Embedding Size: 6,144 dimensions for rich token representations.
- Tokenization: Utilizes a SentencePiece tokenizer supporting 131,072 unique tokens.
- Advanced Features: Incorporates Rotary embeddings (RoPE), supports activation sharding, and 8-bit quantization for optimized performance.
- Context Length: Handles sequences up to 8,192 tokens.
The repository provides JAX-based example code for loading and running the Grok-1 model. Due to its size, running the model requires significant GPU resources. The implementation prioritizes correctness over efficiency, especially in the MoE layer, avoiding custom kernels for validation purposes.
To get started, users need to download the model weights either via a provided magnet link or through the HuggingFace Hub. The project is licensed under Apache 2.0, covering both the source code and the model weights, making it accessible for a wide range of applications and further development.