Summarize by Aili

Apple Intelligence Foundation Language Models

🌈 Abstract

The article introduces the foundation language models developed by Apple to power their Apple Intelligence features, including a 3 billion parameter on-device model (AFM-on-device) and a larger server-based model (AFM-server). It covers the model architecture, training process, optimization for inference, and evaluation results. The article also highlights Apple's focus on Responsible AI principles applied throughout the model development.

🙋 Q&A

[01] Architecture

1. What are the key design choices for the AFM base models? The AFM base models are dense decoder-only models that build on the Transformer architecture, with the following key design choices:

A shared input/output embedding matrix to reduce memory usage
Query/key normalization to improve training stability
Grouped-query attention (GQA) with 8 key-value heads to reduce the KV-cache memory footprint
The SwiGLU activation for higher efficiency
RoPE positional embeddings with the base frequency set for long-context support

2. What are the dimensions of the AFM-on-device model? The AFM-on-device model has 3 billion non-embedding parameters and 0.5 billion embedding parameters, with 8 query heads and 8 key/value heads.

[02] Pre-training

1. What are the key components of the AFM pre-training dataset? The AFM pre-training dataset consists of a diverse mixture of data, including:

Web pages crawled by Applebot, with safety and quality filtering
Licensed datasets from publishers
Code data from open source repositories
Math data from web pages and Q&A sites
Selected public datasets

2. How is the pre-training process broken down into different stages? The AFM pre-training is broken down into three stages:

Core pre-training on 6.3T tokens
Continued pre-training at longer sequence length, with more focus on math and code data
Context-lengthening pre-training, including synthetic long-context data

3. What optimizer and training infrastructure are used for pre-training? The pre-training uses a variant of RMSProp with momentum as the optimizer, and is conducted on v4 and v5p Cloud TPU clusters using the AXLearn framework.

[03] Post-Training

1. What are the key components of the post-training process? The post-training process consists of two main stages:

Supervised fine-tuning (SFT) on a mixture of human-annotated and synthetic data
Reinforcement learning from human feedback (RLHF), using a novel iterative teaching committee (iTeC) algorithm and a mirror descent policy optimization (MDLOO) algorithm

2. How is safety and alignment with Apple's Responsible AI principles addressed in post-training? Safety and alignment are incorporated throughout the post-training process, including:

Curating adversarial data for SFT and RLHF
Developing feature-specific safety policies and evaluation
Conducting extensive red teaming to identify potential safety issues

[04] Powering Apple Intelligence features

1. How do the foundation models enable specialized features? The foundation models use a runtime-swappable adapter architecture, where task-specific adapters can be fine-tuned and dynamically loaded to specialize the model for different features like summarization.

2. What optimizations are applied to enable efficient on-device deployment? Techniques like model quantization, accuracy-recovery adapters, and mixed-precision quantization are used to reduce the memory footprint and inference cost of the on-device model while maintaining performance.

[05] Evaluation

1. How do the AFM models perform on general capability benchmarks? On benchmarks like MMLU, GSM8K, and OpenBookQA, the AFM models outperform open-source and commercial models, even the smaller AFM-on-device model compared to larger models.

2. How do the AFM models perform on safety evaluations? The AFM models demonstrate significantly lower violation rates on safety-focused human evaluations compared to other models, and are preferred by human raters for their safety and helpfulness.

Shared by Daniel Chen ·

Install fromChrome Web Store