The Platonic Representation Hypothesis
๐ Abstract
The article argues that representations in AI models, particularly deep networks, are converging. It surveys many examples of convergence in the literature, where over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. The article demonstrates convergence across data modalities, showing that as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. The authors hypothesize that this convergence is driving toward a shared statistical model of reality, which they term the "platonic representation". They discuss several possible selective pressures toward this representation and the implications of these trends.
๐ Q&A
[01] Representations are converging
1. What evidence is provided that representations in different neural network models are converging?
 Studies have shown that vision models trained on different datasets (e.g. ImageNet and Places365) can be aligned by learning a simple stitching layer, indicating compatibility of their representations.
 Larger models exhibit greater alignment with each other compared to smaller models, suggesting that model alignment increases with scale.
 Models with high transfer performance form a tightly clustered set of representations, while weaker performing models have more variable representations.
 There is evidence that representations are converging across modalities as well, with language models aligning with vision models through simple linear projections.
2. How does the alignment between models increase as they become larger and more capable? The article suggests that as models scale up in size and performance, the set of solutions that satisfy all the data constraints becomes relatively small. This is because each training datapoint and objective places an additional constraint on the model, and as data and tasks scale, the volume of representations that satisfy these constraints must proportionately grow smaller. This leads to convergence toward a shared representation.
3. What is the evidence that representations are converging across modalities like vision and language? The article cites several works showing that vision and language models can be stitched together using simple linear projections, indicating alignment of their representations. It also discusses how language models trained only on text can exhibit rich knowledge of visual structures, and how joint training of vision and language models can improve performance on language tasks.
[02] Why are representations converging?
1. What are the potential factors driving the convergence of representations? The article discusses three potential factors:

Convergence via the training data and objectives: As the amount of training data and number of tasks scale, the set of solutions that satisfy all the constraints becomes smaller, leading to convergence.

Convergence via model capacity: Larger models are better able to find the globally optimal representation, even if trained with different architectures and objectives.

Convergence via simplicity bias: Deep networks naturally adhere to Occam's razor and prefer simpler representations, which can drive convergence.
2. How does the simplicity bias in deep networks contribute to representational convergence? The article suggests that even in the absence of explicit regularization, deep networks naturally prefer simpler representations that fit the data. This simplicity bias can drive different models toward similar, less complex solutions, leading to convergence.
[03] What representation are we converging to?
1. What is the authors' hypothesis about the "platonic representation" that models are converging to? The authors hypothesize that the representation models are converging to is a statistical model of the underlying reality that generates the observed data. They formalize this as a representation that captures the pointwise mutual information between cooccurring observations, which can be shown to exactly represent the true underlying distribution under certain assumptions.
2. How does the contrastive learning objective relate to the platonic representation? The authors show that certain contrastive learning objectives, such as those based on noise contrastive estimation, are minimized by a representation whose kernel is proportional to the pointwise mutual information. This suggests that contrastive learners are recovering the platonic representation.
[04] Implications of convergence
1. What are some of the key implications of representational convergence discussed in the article?
 Scaling is sufficient but not necessarily efficient for reaching high levels of intelligence  different methods can scale with different levels of efficiency.
 Training data can be shared across modalities, as the platonic representation should be useful for both vision and language models.
 Ease of translation and adaptation across modalities, as the aligned representations should enable simple mappings between them.
 Potential reduction in hallucination and bias as models converge to a more accurate model of reality.
2. How does the convergence of representations impact the transferability and adaptability of models across modalities? The article suggests that when representations are aligned across modalities, transitioning from one to the other should be a simple function that is easily obtained. This could underlie the success of techniques like unpaired translation, where models can find mappings between domains without paired data. The authors also note that language models may achieve some notion of grounding in the visual domain even without explicit crossmodal training.