Apple Intelligence Foundation Language Models
๐ Abstract
The article introduces the foundation language models developed by Apple to power their Apple Intelligence features, including a 3 billion parameter on-device model (AFM-on-device) and a larger server-based model (AFM-server). It covers the model architecture, training process, optimization for inference, and evaluation results. The article also highlights Apple's focus on Responsible AI principles applied throughout the model development.
๐ Q&A
[01] Architecture
1. What are the key design choices for the AFM base models? The AFM base models are dense decoder-only models that build on the Transformer architecture, with the following key design choices:
- A shared input/output embedding matrix to reduce memory usage
- Query/key normalization to improve training stability
- Grouped-query attention (GQA) with 8 key-value heads to reduce the KV-cache memory footprint
- The SwiGLU activation for higher efficiency
- RoPE positional embeddings with the base frequency set for long-context support
2. What are the dimensions of the AFM-on-device model? The AFM-on-device model has 3 billion non-embedding parameters and 0.5 billion embedding parameters, with 8 query heads and 8 key/value heads.
[02] Pre-training
1. What are the key components of the AFM pre-training dataset? The AFM pre-training dataset consists of a diverse mixture of data, including:
- Web pages crawled by Applebot, with safety and quality filtering
- Licensed datasets from publishers
- Code data from open source repositories
- Math data from web pages and Q&A sites
- Selected public datasets
2. How is the pre-training process broken down into different stages? The AFM pre-training is broken down into three stages:
- Core pre-training on 6.3T tokens
- Continued pre-training at longer sequence length, with more focus on math and code data
- Context-lengthening pre-training, including synthetic long-context data
3. What optimizer and training infrastructure are used for pre-training? The pre-training uses a variant of RMSProp with momentum as the optimizer, and is conducted on v4 and v5p Cloud TPU clusters using the AXLearn framework.
[03] Post-Training
1. What are the key components of the post-training process? The post-training process consists of two main stages:
- Supervised fine-tuning (SFT) on a mixture of human-annotated and synthetic data
- Reinforcement learning from human feedback (RLHF), using a novel iterative teaching committee (iTeC) algorithm and a mirror descent policy optimization (MDLOO) algorithm
2. How is safety and alignment with Apple's Responsible AI principles addressed in post-training? Safety and alignment are incorporated throughout the post-training process, including:
- Curating adversarial data for SFT and RLHF
- Developing feature-specific safety policies and evaluation
- Conducting extensive red teaming to identify potential safety issues
[04] Powering Apple Intelligence features
1. How do the foundation models enable specialized features? The foundation models use a runtime-swappable adapter architecture, where task-specific adapters can be fine-tuned and dynamically loaded to specialize the model for different features like summarization.
2. What optimizations are applied to enable efficient on-device deployment? Techniques like model quantization, accuracy-recovery adapters, and mixed-precision quantization are used to reduce the memory footprint and inference cost of the on-device model while maintaining performance.
[05] Evaluation
1. How do the AFM models perform on general capability benchmarks? On benchmarks like MMLU, GSM8K, and OpenBookQA, the AFM models outperform open-source and commercial models, even the smaller AFM-on-device model compared to larger models.
2. How do the AFM models perform on safety evaluations? The AFM models demonstrate significantly lower violation rates on safety-focused human evaluations compared to other models, and are preferred by human raters for their safety and helpfulness.