Summarize by Aili

Taming the tail utilization of ads inference at Meta scale

https://engineering.fb.com/2024/07/10/production-engineering/tail-utilization-ads-inference-meta/?utm_source=tldrai

🌈 Abstract

The article discusses the challenges and solutions around improving tail utilization in Meta's ads delivery system, which relies on sophisticated machine learning models. It covers the infrastructure requirements, system optimizations, and best practices implemented to address tail utilization and improve the overall performance and reliability of the ads inference service.

🙋 Q&A

[01] Inference Platforms and Infrastructure Requirements

1. What are the key infrastructure components required to serve Meta's sophisticated machine learning models for ads delivery?

The inference platforms require significant infrastructure capacity across CPUs, GPUs, storage, networking, and databases.

2. Why is improving tail utilization (the utilization level of the top 5% of servers) important for Meta's ads delivery system?

Improving tail utilization is imperative to operate the infrastructure efficiently and sustainably, as the growing complexity and computational intensity of the models, as well as the strict latency and throughput requirements, put a lot of pressure on the system.

3. What were the key outcomes of the solutions implemented for the ads inference service?

The solutions implemented led to a 35% increase in work output without additional resources, a two-thirds decrease in timeout error rates, and a 50% reduction in tail latency at p99.

[02] Ads Inference Service Architecture and Load Balancing

1. How does the ads inference service architecture work?

Client requests are routed to the inference service to get predictions, with a single request typically resulting in multiple model inferences being requested.
The inference service leverages Meta's infrastructure capabilities, such as ServiceRouter for service discovery and load balancing, and is set up as a sharded service where each model is a shard and multiple models are hosted in a single host.

2. What are the two main approaches to load balancing in the ads inference service?

The first approach involves leveraging the service mesh (ServiceRouter) and its load balancing capabilities, such as the "power of two choices" algorithm and tuning various load balancing parameters.
The second approach focuses on placement load balancing, where the Shard Manager's load balancing configurations are tuned to make the utilization distribution tighter across the fleet.

3. What were some of the key challenges and insights related to load balancing in the ads inference service?

Challenges included issues like CPU spikes due to memory latency, imbalance of replica load due to the host-level consolidated load counter, and the impact of snapshot transitions on utilization.
Insights included the importance of considering memory bandwidth as a resource during replica placement, the benefits of a per-model load counter, and the use of "outstanding examples CPU" as a promising load counter.

[03] Extending Load Balancing to Multiple Services

1. How did the team extend the load balancing optimizations to multiple services?

The team changed the load balancing calculation to use the compute capacity of the hosts instead of the host number, which helped achieve a more balanced load across different hardware tiers.
They also added a utilization balancing feedback controller to adjust traffic routing percentages and achieve balance between the different hardware tiers.

2. What other optimization did the team implement to address the challenges of reactive auto-scaling?

The team designed a simple predictive replica estimation system that predicts future resource usage based on current and past usage patterns, which yielded significant improvements in failure rate during peak periods.

[04] Applying Learnings to New System Architectures

1. How does the team plan to apply the learnings around tail utilization to new system architectures and platforms?

The team is actively working to apply the utilization optimizations discussed to IPnext, Meta's next-generation unified platform for managing the entire lifecycle of machine learning model deployments, in order to deliver these benefits to a broader range of expanding machine learning inference use cases at Meta.

Shared by Daniel Chen ·

Install fromChrome Web Store