Four unique takeaways from Deepseek v3

Deepseek V3 (https://www.deepseek.com/) technical report (https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf) had several things that were not surprising

several trillion tokens for training data
SFT followed by RL (GRPO) for training sequence
Great performance at common benchmarks

However, the following 4 innovations stand out:

1. Use FP8

2. Multhead latent attention compression

3. Mixture of Experts

4. Multi-token lookahead

Here is a detailed walkthrough of these innovations to keep in mind when you are training your next LLM.

1. Use FP8!

The DeepSeek-V3 paper highlights several novel uses of FP8 (8-bit floating point) in training large-scale language models, distinguishing its approach from prior work. Here’s why these uses are new, based on the paper’s claims and comparisons with existing practices:

Fine-Grained Quantization: — DeepSeek-V3 introduces a fine-grained quantization strategy to address common challenges like overflows and underflows in FP8 due to its limited dynamic range. Specifically: — Activations are grouped and scaled on a 1x128 tile basis (per token per 128 channels). — Weights are grouped and scaled on a 128x128 block basis. — This granularity allows better handling of outliers compared to standard tensor-wise quantization.
Increased Accumulation Precision: — The model improves the precision of FP8 General Matrix Multiplication (GEMM) by promoting partial results to FP32 registers at specific intervals during accumulation. This mitigates errors caused by limited bit-width accumulation in Tensor Cores.
Unified E4M3 Format: — Unlike prior work that uses hybrid FP8 formats (e.g., E4M3 for forward pass and E5M2 for backward pass), DeepSeek-V3 adopts the E4M3 format universally. This is enabled by its fine-grained quantization, which effectively shares exponent bits among grouped elements.
Online Quantization: — DeepSeek-V3 employs online quantization, calculating scaling factors dynamically for each 1x128 activation tile or 128x128 weight block during training. This eliminates the need for delayed quantization methods that rely on historical maximum values, simplifying the framework and improving accuracy.
Low-Precision Storage and Communication: — Cached activations and optimizer states are stored in lower-precision formats (e.g., BF16 for optimizer states and FP8 for activations). Special care is taken for sensitive operators, such as attention layers, where customized formats like E5M6 are used.

Where to use/not to use fp8

The choice of operations using or avoiding FP8 is guided by a trade-off between computational efficiency and numerical stability. Components sensitive to low-precision computations or critical for maintaining training dynamics are kept at higher precision, while others leverage FP8 for its efficiency benefits.

In DeepSeek-V3, the use of FP8 is carefully targeted to balance efficiency and numerical stability. Below is a breakdown of operations that use FP8 and those that do not, based on the technical report:

Operations Using FP8

General Matrix Multiplication (GEMM) Operations: — All three GEMMs associated with the Linear operator — Fprop (forward pass), Dgrad (activation backward pass), and Wgrad (weight backward pass) — are implemented in FP8 precision. This design significantly accelerates training and reduces memory consumption. — FP8 Wgrad GEMM allows activations to be stored in FP8 for the backward pass, further optimizing memory usage.
Cached Activations: — Activations are cached in FP8 format for the backward pass of the Linear operator to reduce memory consumption.
MoE Training Communication: — Activations before MoE up-projections are quantized into FP8 for dispatch components, aligning with FP8 Fprop in MoE up-projections. — Activation gradients before MoE down-projections are also quantized into FP8.

Operations Not Using FP8

Embedding Module and Output Head: — These components retain their original precision (e.g., BF16 or FP32) due to their sensitivity to low-precision computations.
MoE Gating Modules: — The gating modules are maintained in higher precision for stability.
Normalization Operators: — Normalization operations, such as RMSNorm, remain in BF16 or FP32 to ensure numerical stability.
Attention Operators: — Attention operations are also performed in higher precision formats (BF16 or FP32), as they are sensitive to precision loss.
Optimizer States: — While the first and second moments in the AdamW optimizer are stored in BF16, master weights and weight gradients are retained in FP32 for numerical stability.
Special Cases for Linear Inputs: — Inputs of the Linear operator after attention are stored in a customized E5M6 format instead of FP8 due to their sensitivity. — Inputs of the SwiGLU operator in MoE are cached in FP8 but recomputed during the backward pass for accuracy.

FP8 and comparison with Prior Work

Previous studies on FP8 training, such as those referenced in the paper (e.g., Dettmers et al., 2022; Kalamkar et al., 2019), primarily focused on inference quantization or tensor-wise scaling methods.
NVIDIA’s H100 GPUs introduced support for FP8 but faced challenges with limited accumulation precision and sensitivity to outliers. DeepSeek-V3 addresses these issues through its fine-grained quantization strategy and improved GEMM precision.
Hybrid FP8 formats like E4M3/E5M2 were commonly used in earlier frameworks to balance dynamic range and precision across different stages of training. DeepSeek-V3’s unified use of E4M3 represents a departure from this convention.

2. Multihead ‘latent’ attention compression works just as well… generally

The Multi-Head Latent Attention (MLA) mechanism in DeepSeek-V3 introduces a low-rank joint compression for attention keys and values, significantly reducing the Key-Value (KV) cache size during inference while maintaining performance comparable to standard Multi-Head Attention (MHA). The mathematical formulation of MLA is as follows:

How MLA Improves KV Cache Efficiency

In standard MHA, each attention head maintains its own set of key-value pairs, which results in a memory requirement proportional to the number of heads. MLA reduces this by compressing keys and values into a shared latent representation as discussed above, lowering the memory requirement.
Only the compressed latent vectors need to be cached during inference.
The smaller KV cache allows processing longer sequences or larger batches without exceeding hardware memory limits.
Despite compression, MLA achieves comparable performance to standard MHA by dynamically reconstructing keys and values during computation (which is the whole point!)

But wait, aren’t the new projection matrices learned? Don’t these introduce more number of learnable parameters?

Yes, the down-projection matrix in Multi-Head Latent Attention (MLA) is learned, and it introduces additional learnable parameters. As discussed above, the down-projection matrix is used to compress the input embeddings into a lower-dimensional latent space. This compression reduces the dimensionality of the key-value cache but still requires learning the parameters of the projection matrix.

The introduction of this matrix does add learnable parameters to the model. However, the overall parameter increase is relatively small compared to the total number of parameters in the model because , the compression dimension, is much smaller than , the total dimension of keys and values in standard multi-head attention. This trade-off allows for significant memory savings during inference without substantially increasing the model’s complexity or training cost.

3. DeepSeek MoE

The GShard framework is one of the most foundational and widely used Mixture-of-Experts (MoE) architectures, designed to enable sparse activation in large-scale models. This architecture intersperses Transformer layers with MoE layers, where each MoE layer contains multiple feed-forward networks (FFNs), referred to as “experts.” The key innovation in GShard is its gating mechanism, which determines which experts are activated for processing each token. The gating network computes an affinity score for each expert and token pair, based on the token’s hidden representation and a learnable gating weight matrix. Using these scores, only the top-K experts with the highest affinity are selected for activation. The outputs of the selected experts are then combined, weighted by their gating values.

While GShard introduced sparse activation and efficient scaling, it has limitations such as reliance on auxiliary losses for load balancing, which can degrade model performance, and challenges in ensuring expert specialization. These limitations have motivated advancements like DeepSeekMoE.

DeepSeekMoE builds upon GShard but introduces several innovations to improve efficiency and performance. One key difference lies in its auxiliary-loss-free load balancing strategy. Instead of relying on auxiliary losses, DeepSeekMoE dynamically adjusts bias terms for each expert during training to achieve balanced utilization without impairing model performance. This approach not only simplifies training but also avoids potential degradation caused by auxiliary losses.
Another significant improvement is DeepSeekMoE’s fine-grained expert segmentation. Unlike traditional architectures where each expert operates independently, DeepSeekMoE divides experts into smaller units and allows more flexible combinations of activated experts per token. This segmentation enhances diversity and specialization among experts while maintaining computational efficiency. Additionally, DeepSeekMoE introduces shared experts that capture common knowledge across tasks, reducing redundancy and improving overall efficiency.
The sparse activation strategy in DeepSeekMoE is particularly noteworthy for its efficiency. In DeepSeek-V3, only 37 billion parameters out of a total 671 billion are activated per token. This sparse activation drastically reduces memory usage and computational costs compared to dense models or traditional MoE frameworks like GShard. By activating only the most relevant parameters for each token, DeepSeekMoE achieves high scalability while maintaining state-of-the-art performance across various benchmarks.

4. Use Multi-token lookahead

Traditional architectures like GPT or LLaMA rely solely on next-token prediction objectives. While effective for many tasks, this approach has limitations. It focuses narrowly on immediate dependencies without explicitly modeling longer-term relationships. Training signals are sparse since only one token is predicted per input position.Representations at each position are highly localized and do not capture broader sequence-level meaning.

The Multi-Token Prediction (MTP) objective introduced in DeepSeek-V3 represents a significant departure from the traditional next-token prediction paradigm commonly used in large language models like GPT. Instead of predicting only the immediate next token, MTP trains the model to predict multiple future tokens at each position in the sequence. This approach is unique because it densifies training signals, improves data efficiency, and enables the model to better pre-plan its representations for future predictions. Below is a detailed explanation of why this method is novel and how it differs from standard approaches.

Why Multi-Token Prediction is Unique

Enhanced Data Efficiency — By predicting multiple tokens at once, MTP densifies training signals. This means that for each input sequence, more predictions are made, leading to better utilization of training data. This contrasts with next-token prediction, where only one prediction per token is made, leaving much of the sequence’s potential predictive power untapped.
Improved Representation Planning — MTP encourages the model to pre-plan its internal representations to account for longer-term dependencies. By requiring predictions for multiple future tokens simultaneously, MTP forces the model to encode richer contextual information at each position. This aligns more closely with how humans process language, as studies suggest that human cognition often anticipates multiple upcoming words when reading or listening.
Broader Generalization — The ability to predict multiple tokens improves generalization across tasks that require reasoning over longer contexts or generating coherent sequences. For example, DeepSeek-V3 demonstrates superior performance on benchmarks like HumanEval (coding tasks) and GSM8K (math reasoning), where long-term planning and multi-step reasoning are critical.
Speculative Decoding Potential — The MTP objective can also be repurposed during inference for speculative decoding, where predictions for multiple tokens are generated in parallel instead of sequentially. This can significantly accelerate inference by reducing latency.

Try it out on https://chat.deepseek.com/ !

References:

DeepSeek-V3/DeepSeek_V3.pdf at main · deepseek-ai/DeepSeek-V3 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
Efficient Post-training Quantization with FP8 Formats https://arxiv.org/abs/2309.14592
Performance Improvements in Quantization Aware Training and Appreciation of Low Precision Computation in Deep Learning https://www.semanticscholar.org/paper/de439b972009b2920465c1c0e01b9405f74f8d89
Balancing Speed and Stability: The Trade-offs of FP8 vs. BF16 Training in LLMs [https://arxiv.org/abs/2411.08719 6] Reply to the letter from Manferdelli et al.: ‘Muscle O2 diffusion capacity by NIRS: a new approach in the air’ https://pubmed.ncbi.nlm.nih.gov/36335427/
Editorial: DNA-based population screening for precision public health https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9639452/
Standalone 16-bit Training: Missing Study for Hardware-Limited Deep Learning Practitioners https://arxiv.org/abs/2305.10947
Better Schedules for Low Precision Training of Deep Neural Networks https://arxiv.org/abs/2403.02243
Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point https://arxiv.org/abs/2407.02610
Efficient low-precision training for deep learning accelerators https://www.semanticscholar.org/paper/0351a9f6702a05cd34afc100f3488edc1ff4921a
Future Token Prediction — Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction [https://arxiv.org/abs/2410.18160 3]
RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement
https://www.semanticscholar.org/paper/5f90745ddcff92520fd87bc2181676c19b8e3dea
DeepSeek-V3/DeepSeek_V3.pdf at main · deepseek-ai/DeepSeek-V3 https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
Better & Faster Large Language Models via Multi-token Prediction https://arxiv.org/abs/2404.19737
MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models https://arxiv.org/abs/2402.01620
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model https://arxiv.org/abs/2408.11039
Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation [https://arxiv.org/abs/2402.03268 9] Transformers Can Navigate Mazes With Multi-Step Prediction https://arxiv.org/abs/2412.05117

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.