DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
๐ Abstract
The article introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with an additional 6 trillion tokens, substantially enhancing its coding and mathematical reasoning capabilities while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K.
๐ Q&A
[01] Introduction
1. What are the key contributions of the DeepSeek-Coder-V2 series?
- DeepSeek-Coder-V2 is built upon the DeepSeekMoE framework and has 16B and 236B parameter versions, with only 2.4B and 21B activation parameters respectively, efficiently supporting diverse computational and application needs.
- DeepSeek-Coder-V2 supports 338 programming languages and a maximum context length of 128K tokens.
- DeepSeek-Coder-V2 is the first attempt to develop an open-source hundred-billion-parameter code model, outperforming state-of-the-art closed-source models in both coding and mathematics tasks.
- DeepSeek-Coder-V2 models are released publicly under a permissive license, allowing for both research and unrestricted commercial use.
2. How does the performance of DeepSeek-Coder-V2 compare to other open-source and closed-source models?
- In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.
- DeepSeek-Coder-V2 achieves a 90.2% score on HumanEval, a 76.2% score on MBPP, and a 43.4% score on LiveCodeBench, outperforming other open-source models.
- On the MATH benchmark, DeepSeek-Coder-V2 attains an accuracy of 75.7%, nearly matching the state-of-the-art accuracy of 76.6% achieved by GPT-4o.
[02] Data Collection
1. How was the code corpus for DeepSeek-Coder-V2 collected and filtered?
- The code corpus was collected from public repositories on GitHub, applying the same filtering rules and near-deduplication as used in DeepSeek-Coder to remove lower-quality and duplicated source code.
- The final code corpus consists of 1,170B code-related tokens sourced from GitHub and CommonCrawl, covering 338 programming languages.
- Ablation studies show that the new code corpus is superior to the one used to train DeepSeek-Coder, leading to improvements of 6.7% and 9.4% in accuracy on the HumanEval and MBPP benchmarks, respectively.
2. How was the math corpus for DeepSeek-Coder-V2 collected?
- The math corpus was collected from CommonCrawl using the same pipeline as DeepSeekMath, sourcing 221B math-related tokens.
- This math corpus is approximately double the size of the 120B DeepSeekMath corpus.
[03] Training Policy
1. What are the key training objectives and strategies used for DeepSeek-Coder-V2?
- DeepSeek-Coder-V2 16B uses both Next-Token-Prediction and Fill-In-Middle (FIM) training objectives, while DeepSeek-Coder-V2 236B only uses the Next-Token-Prediction objective.
- The FIM training approach is utilized at a rate of 0.5, structuring the content reconstruction in the sequence: Prefix, Suffix, and Middle.
- The model architecture aligns with DeepSeek-V2, with hyperparameter settings of 16B and 236B corresponding to DeepSeek-V2-Lite and DeepSeek-V2, respectively.
- The training uses the AdamW optimizer with a cosine decay learning rate schedule.
- DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2, resulting in a total exposure of 10.2T high-quality tokens.
2. How was the context length extended for DeepSeek-Coder-V2?
- The context length of DeepSeek-Coder-V2 is extended to 128K using the YARN technique, following the approach used in DeepSeek-V2.
- The model is further trained in two stages, first with a sequence length of 32K and then with a sequence length of 128K, to enhance its capability for handling long contexts.
[04] Experimental Results
1. How does DeepSeek-Coder-V2 perform on code generation benchmarks?
- On the HumanEval benchmark, DeepSeek-Coder-V2 achieves a score of 90.2%, outperforming all open-source models and performing on par with leading closed-source models.
- On the MBPP benchmark, DeepSeek-Coder-V2 achieves a score of 76.2%, establishing a new state-of-the-art result.
- On the LiveCodeBench benchmark, DeepSeek-Coder-V2 achieves a score of 43.4%, surpassing all other models.
- DeepSeek-Coder-V2 is the first open-source model to surpass a score of 10% on the SWEBench benchmark.
2. How does DeepSeek-Coder-V2 perform on mathematical reasoning benchmarks?
- On the MATH benchmark, DeepSeek-Coder-V2 achieves an accuracy of 75.7%, nearly matching the state-of-the-art accuracy of 76.6% achieved by GPT-4o.
- On the AIME 2024 competition, DeepSeek-Coder-V2 surpasses the performance of closed-source models like GPT-4o, Gemini 1.5 Pro, and Claude 3 Opus.
- DeepSeek-Coder-V2 also exhibits strong performance on the GSM8K, AIME, and Math Odyssey benchmarks, rivaling top closed-source models.
3. How does DeepSeek-Coder-V2 perform on general natural language tasks?
- DeepSeek-Coder-V2 maintains comparable general language performance to DeepSeek-V2, achieving 79.2% on the MMLU benchmark and scores comparable to GPT-4 on subjective evaluation tasks like Arena-Hard, MT-Bench, and AlignBench.
- The 236B DeepSeek-Coder-V2 model exhibits greater strength in reasoning benchmarks, particularly in Arena-Hard, while the DeepSeek-V2 Chat model demonstrates slightly better results in some general-purpose benchmarks.