Introducing Apple’s On-Device and Server Foundation Models
🌈 Abstract
The article discusses the development of Apple's foundation language models that power the new Apple Intelligence system, which is integrated into iOS 18, iPadOS 18, and macOS Sequoia. It covers the key aspects of the model development, including:
- Pre-training: Using the open-source AXLearn framework to train the models on licensed and publicly available data, with filters to remove personally identifiable information and low-quality content.
- Post-training: Utilizing novel algorithms like rejection sampling fine-tuning and reinforcement learning from human feedback to improve the models' instruction-following quality.
- Optimization: Applying techniques like grouped-query-attention, low-bit palletization, and activation/embedding quantization to optimize the models for speed and efficiency on-device and on Apple's private cloud.
- Model Adaptation: Using adapters to fine-tune the foundation models for specific user tasks and activities, allowing dynamic specialization.
- Performance Evaluation: Focusing on human evaluation to assess the models' capabilities, safety, and helpfulness across a range of tasks and prompts, and comparing them to open-source and commercial models.
The article also discusses Apple's principles for responsible AI development, which guide the design and deployment of these models.
🙋 Q&A
[01] Pre-Training
1. What framework does Apple use to train its foundation models? Apple uses the open-source AXLearn framework, which builds on JAX and XLA, to train its foundation models with high efficiency and scalability on various hardware and cloud platforms.
2. What data sources does Apple use to train the models? Apple trains the models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by their web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training.
3. How does Apple ensure privacy in the training data? Apple never uses users' private personal data or user interactions when training the foundation models. They also apply filters to remove personally identifiable information and profanity or other low-quality content from the training corpus.
[02] Post-Training
1. What novel algorithms does Apple use in the post-training phase? Apple has developed two novel algorithms in the post-training phase: (1) a rejection sampling fine-tuning algorithm with teacher committee, and (2) a reinforcement learning from human feedback (RLHF) algorithm with mirror descent policy optimization and a leave-one-out advantage estimator. These algorithms lead to significant improvement in the models' instruction-following quality.
2. Why is data quality essential to model success? Apple finds that data quality is essential to model success, so they utilize a hybrid data strategy in their training pipeline, incorporating both human-annotated and synthetic data, and conduct thorough data curation and filtering procedures.
[03] Optimization
1. What techniques does Apple use to optimize the models for speed and efficiency? Apple has applied an extensive set of optimizations for both first token and extended token inference performance, including:
- Grouped-query-attention
- Shared input and output vocab embedding tables
- Low-bit palletization (3.7 bits-per-weight on average)
- Activation quantization and embedding quantization
- Efficient Key-Value (KV) cache update on neural engines
2. What kind of performance do the optimized models achieve on-device? On the iPhone 15 Pro, the on-device model achieves a time-to-first-token latency of about 0.6 milliseconds per prompt token and a generation rate of 30 tokens per second, before employing token speculation techniques.
[04] Model Adaptation
1. How do Apple's foundation models adapt to specific user tasks and activities? Apple uses adapters, small neural network modules that can be plugged into various layers of the pre-trained model, to fine-tune the models for specific tasks. By fine-tuning only the adapter layers, the original parameters of the base pre-trained model remain unchanged, preserving the general knowledge while tailoring the adapter layers to support specific tasks.
2. How do Apple manage the memory and responsiveness of the adapted models? The adapter models can be dynamically loaded, temporarily cached in memory, and swapped, allowing the foundation model to specialize itself on the fly for the task at hand while efficiently managing memory and guaranteeing the operating system's responsiveness.
[05] Performance and Evaluation
1. What is Apple's focus when benchmarking the models? Apple's focus is on delivering generative models that can enable users to communicate, work, express themselves, and get things done across their Apple products. They focus on human evaluation, as they find these results are highly correlated to user experience in their products.
2. How does Apple evaluate the performance of the summarization adapter? Apple fine-tunes accuracy-recovery low-rank (LoRA) adapters on top of the palletized model to meet the specific requirements for summarizing emails, messages, and notifications. They use a set of 750 responses carefully sampled for each use case to evaluate the product-specific summarization performance.
3. How do Apple's models perform compared to open-source and commercial models? Apple's on-device model (∼3B parameters) outperforms larger open-source models, including Phi-3-mini, Mistral-7B, Gemma-7B, and Llama-3-8B, on a comprehensive evaluation set of real-world prompts. The server model compares favorably to DBRX-Instruct, Mixtral-8x22B, GPT-3.5, and Llama-3-70B while being highly efficient.
4. How do Apple's models perform on safety and adversarial prompts? Both the on-device and server models are robust when faced with adversarial prompts, achieving lower violation rates than open-source and commercial models. However, Apple acknowledges the limitation of their safety benchmark and is actively conducting manual and automatic red-teaming to continue evaluating the models' safety.