Sapiens: Foundation for Human Vision Models
๐ Abstract
The paper presents Sapiens, a family of models for four fundamental human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. The models are pretrained on a large-scale dataset of over 300 million in-the-wild human images, which significantly boosts their performance on these tasks compared to existing methods. The resulting models exhibit strong generalization to real-world data, even with limited labeled data or synthetic supervision. The paper also introduces new, more detailed annotations for pose estimation and body-part segmentation to better capture the nuances of human appearance and movement.
๐ Q&A
[01] Pretraining at Scale
1. What are the key insights from prior work that the authors leverage for their approach? The authors note that leveraging large datasets and scalable model architectures is key for generalization, as shown in prior work. They also adopt the pretrain-then-finetune approach to enable adaptation to specific tasks with minimal adjustments.
2. What is the critical question the authors aim to address regarding pretraining data? Given computational limits, the authors investigate whether it is more effective to pretrain on a large number of general images or a curated set of human-specific images for downstream human-centric tasks.
3. How does the authors' pretraining dataset, Humans-300M, differ from prior large-scale datasets? Humans-300M is a proprietary dataset of approximately 300 million in-the-wild human images, which the authors use to pretrain their Sapiens models. This dataset is focused exclusively on human images, unlike more general datasets used in prior work.
[02] Model Architecture and Training
1. What pretraining approach do the authors use, and why? The authors use the masked autoencoder (MAE) approach for pretraining, as it has a single-pass inference model compared to contrastive or multi-inference strategies, allowing them to process a larger volume of images with the same computational resources.
2. How do the authors' models differ from prior work in terms of input resolution and model scale? The authors' models are pretrained at a native resolution of 1K pixels, which is a significant increase compared to the typical 224-256 pixels used in prior work. They also scale their models up to 2 billion parameters, which is larger than commonly used vision backbones.
[03] Human-Centric Vision Tasks
1. What are the four key human-centric vision tasks the authors focus on? The four tasks are: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.
2. How do the authors' annotations for pose estimation and body-part segmentation differ from prior datasets? For pose estimation, the authors introduce a more comprehensive set of keypoints, including the body, hands, feet, surface, and face. For body-part segmentation, they expand the class vocabulary to 100 classes, covering more detailed parts such as hair, tongue, teeth, and upper/lower lip.
3. How do the authors leverage synthetic data for depth and normal estimation? The authors use high-resolution photogrammetry human scans from the RenderPeople dataset to generate synthetic images with ground-truth depth maps and surface normals, which they use to train their depth and normal estimation models.
[04] Experimental Results
1. What are the key findings from the authors' benchmarking across the four tasks? The authors' Sapiens models consistently outperform existing state-of-the-art methods across the 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction tasks. They attribute this to the combination of large-scale pretraining on human-specific data and the use of high-quality or synthetic annotations.
2. How do the authors' models demonstrate generalization to in-the-wild settings? The authors show that their models, despite being fine-tuned on studio or synthetic data, are able to generalize well to real-world, unconstrained environments, exhibiting robust performance.
3. What are the limitations of the authors' models, and how do they envision future improvements? The authors note that their models still struggle with complex or rare poses, heavy occlusion, and crowded scenes. They suggest that a detect-and-crop strategy and human-in-the-loop supervision could help address these limitations in future work.