
Four unique takeaways from Deepseek v3
Detailed explainer of innovations behind the recent Deepseek model
- several trillion tokens for training data
- SFT followed by RL (GRPO) for training sequence
- Great performance at common benchmarks
1. Use FP8
2. Multhead latent attention compression
3. Mixture of Experts
4. Multi-token lookahead
- Fine-Grained Quantization: — DeepSeek-V3 introduces a fine-grained quantization strategy to address common challenges like overflows and underflows in FP8 due to its limited dynamic range. Specifically: — Activations are grouped and scaled on a 1x128 tile basis (per token per 128 channels). — Weights are grouped and scaled on a 128x128 block basis. — This granularity allows better handling of outliers compared to standard tensor-wise quantization.
- Increased Accumulation Precision: — The model improves the precision of FP8 General Matrix Multiplication (GEMM) by promoting partial results to FP32 registers at specific intervals during accumulation. This mitigates errors caused by limited bit-width accumulation in Tensor Cores.
- Unified E4M3 Format: — Unlike prior work that uses hybrid FP8 formats (e.g., E4M3 for forward pass and E5M2 for backward pass), DeepSeek-V3 adopts the E4M3 format universally. This is enabled by its fine-grained quantization, which effectively shares exponent bits among grouped elements.
- Online Quantization: — DeepSeek-V3 employs online quantization, calculating scaling factors dynamically for each 1x128 activation tile or 128x128 weight block during training. This eliminates the need for delayed quantization methods that rely on historical maximum values, simplifying the framework and improving accuracy.
- Low-Precision Storage and Communication: — Cached activations and optimizer states are stored in lower-precision formats (e.g., BF16 for optimizer states and FP8 for activations). Special care is taken for sensitive operators, such as attention layers, where customized formats like E5M6 are used.
- General Matrix Multiplication (GEMM) Operations: — All three GEMMs associated with the Linear operator — Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass) — are implemented in FP8 precision. This design significantly accelerates training and reduces memory consumption. — FP8 Wgrad GEMM allows activations to be stored in FP8 for the backward pass, further optimizing memory usage.
- Cached Activations: — Activations are cached in FP8 format for the backward pass of the Linear operator to reduce memory consumption.
- MoE Training Communication: — Activations before MoE up-projections are quantized into FP8 for dispatch components, aligning with FP8 Fprop in MoE up-projections. — Activation gradients before MoE down-projections are also quantized into FP8.
- Embedding Module and Output Head: — These components retain their original precision (e.g., BF16 or FP32) due to their sensitivity to low-precision computations.
- MoE Gating Modules: — The gating modules are maintained in higher precision for stability.
- Normalization Operators: — Normalization operations, such as RMSNorm, remain in BF16 or FP32 to ensure numerical stability.
- Attention Operators: — Attention operations are also performed in higher precision formats (BF16 or FP32), as they are sensitive to precision loss.
- Optimizer States: — While the first and second moments in the AdamW optimizer are stored in BF16, master weights and weight gradients are retained in FP32 for numerical stability.
- Special Cases for Linear Inputs: — Inputs of the Linear operator after attention are stored in a customized E5M6 format instead of FP8 due to their sensitivity. — Inputs of the SwiGLU operator in MoE are cached in FP8 but recomputed during the backward pass for accuracy.
- Previous studies on FP8 training, such as those referenced in the paper (e.g., Dettmers et al., 2022; Kalamkar et al., 2019), primarily focused on inference quantization or tensor-wise scaling methods.
- NVIDIA’s H100 GPUs introduced support for FP8 but faced challenges with limited accumulation precision and sensitivity to outliers. DeepSeek-V3 addresses these issues through its fine-grained quantization strategy and improved GEMM precision.
- Hybrid FP8 formats like E4M3/E5M2 were commonly used in earlier frameworks to balance dynamic range and precision across different stages of training. DeepSeek-V3’s unified use of E4M3 represents a departure from this convention.
- In standard MHA, each attention head maintains its own set of key-value pairs, which results in a memory requirement proportional to the number of heads. MLA reduces this by compressing keys and values into a shared latent representation as discussed above, lowering the memory requirement.
- Only the compressed latent vectors need to be cached during inference.
- The smaller KV cache allows processing longer sequences or larger batches without exceeding hardware memory limits.
- Despite compression, MLA achieves comparable performance to standard MHA by dynamically reconstructing keys and values during computation (which is the whole point!)
- DeepSeekMoE builds upon GShard but introduces several innovations to improve efficiency and performance. One key difference lies in its auxiliary-loss-free load balancing strategy. Instead of relying on auxiliary losses, DeepSeekMoE dynamically adjusts bias terms for each expert during training to achieve balanced utilization without impairing model performance. This approach not only simplifies training but also avoids potential degradation caused by auxiliary losses.
- Another significant improvement is DeepSeekMoE’s fine-grained expert segmentation. Unlike traditional architectures where each expert operates independently, DeepSeekMoE divides experts into smaller units and allows more flexible combinations of activated experts per token. This segmentation enhances diversity and specialization among experts while maintaining computational efficiency. Additionally, DeepSeekMoE introduces shared experts that capture common knowledge across tasks, reducing redundancy and improving overall efficiency.
- The sparse activation strategy in DeepSeekMoE is particularly noteworthy for its efficiency. In DeepSeek-V3, only 37 billion parameters out of a total 671 billion are activated per token. This sparse activation drastically reduces memory usage and computational costs compared to dense models or traditional MoE frameworks like GShard. By activating only the most relevant parameters for each token, DeepSeekMoE achieves high scalability while maintaining state-of-the-art performance across various benchmarks.
- Enhanced Data Efficiency — By predicting multiple tokens at once, MTP densifies training signals. This means that for each input sequence, more predictions are made, leading to better utilization of training data. This contrasts with next-token prediction, where only one prediction per token is made, leaving much of the sequence’s potential predictive power untapped.
- Improved Representation Planning — MTP encourages the model to pre-plan its internal representations to account for longer-term dependencies. By requiring predictions for multiple future tokens simultaneously, MTP forces the model to encode richer contextual information at each position. This aligns more closely with how humans process language, as studies suggest that human cognition often anticipates multiple upcoming words when reading or listening.
- Broader Generalization — The ability to predict multiple tokens improves generalization across tasks that require reasoning over longer contexts or generating coherent sequences. For example, DeepSeek-V3 demonstrates superior performance on benchmarks like HumanEval (coding tasks) and GSM8K (math reasoning), where long-term planning and multi-step reasoning are critical.
- Speculative Decoding Potential — The MTP objective can also be repurposed during inference for speculative decoding, where predictions for multiple tokens are generated in parallel instead of sequentially. This can significantly accelerate inference by reducing latency.
- DeepSeek-V3/DeepSeek_V3.pdf at main · deepseek-ai/DeepSeek-V3 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
- Efficient Post-training Quantization with FP8 Formats https://arxiv.org/abs/2309.14592
- Performance Improvements in Quantization Aware Training and Appreciation of Low Precision Computation in Deep Learning https://www.semanticscholar.org/paper/de439b972009b2920465c1c0e01b9405f74f8d89
- Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs [https://arxiv.org/abs/2411.087196] Reply to the letter from Manferdelli et al.: ‘Muscle O2 diffusion capacity by NIRS: a new approach in the air’ https://pubmed.ncbi.nlm.nih.gov/36335427/
- Editorial: DNA-based population screening for precision public health https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9639452/
- Standalone 16-bit Training: Missing Study for Hardware-Limited Deep Learning Practitioners https://arxiv.org/abs/2305.10947
- Better Schedules for Low Precision Training of Deep Neural Networks https://arxiv.org/abs/2403.02243
- Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point https://arxiv.org/abs/2407.02610
- Efficient low-precision training for deep learning accelerators https://www.semanticscholar.org/paper/0351a9f6702a05cd34afc100f3488edc1ff4921a
- Future Token Prediction — Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction [https://arxiv.org/abs/2410.181603]
- RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement
https://www.semanticscholar.org/paper/5f90745ddcff92520fd87bc2181676c19b8e3dea - DeepSeek-V3/DeepSeek_V3.pdf at main · deepseek-ai/DeepSeek-V3 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
- Better & Faster Large Language Models via Multi-token Prediction https://arxiv.org/abs/2404.19737
- MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models https://arxiv.org/abs/2402.01620
- Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model https://arxiv.org/abs/2408.11039
- Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation [https://arxiv.org/abs/2402.032689] Transformers Can Navigate Mazes With Multi-Step Prediction https://arxiv.org/abs/2412.05117
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.