Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads
Meta continues to lead the industry in utilizing groundbreaking AI Recommendation Systems (RecSys) to deliver better experiences for people, and better results for advertisers. To reach the next frontier of performance, we are scaling Metaās Ads Recommender runtime models to LLM-scale & complexity to further a deeper understanding of peopleās interests and intent. This increase [...] Read More... The post Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads appeared first on Engineering at Meta .

Meta, a leading technology company, has been at the forefront of utilizing advanced AI Recommendation Systems (RecSys) to enhance user experiences and improve advertising outcomes. To push the boundaries of performance even further, Meta is scaling its Ads Recommender runtime models to LLM-scale complexity. This increase in scale and complexity, however, presents a significant challenge known as the "inference trilemma." This trilemma refers to the difficulty of balancing the increased model complexity and the associated need for compute and memory with the low latency and cost efficiency required for a global service that serves billions of people.
To address this challenge, Meta has developed the Meta Adaptive Ranking Model, which effectively bends the inference scaling curve with high return on investment (ROI) and industry-leading efficiency. The Adaptive Ranking Model replaces the traditional "one-size-fits-all" inference approach with intelligent request routing. By dynamically aligning model complexity with a rich understanding of a person's context and intent, the system ensures that every request is served by the most effective and efficient model. This approach allows Meta Ads to maintain the strict, sub-second latency that the platform relies on while providing a high-quality experience for every user.
Serving LLM-scale models at Meta's scale required a fundamental rethinking of the inference stack. Three key innovations have driven this transformation:
1. **Inference-Efficient Model Scaling**: By shifting to a request-centric architecture, the Adaptive Ranking Model serves a LLM-scale and complexity model at sub-second latency. This enables a more sophisticated understanding of a person's interests and intent without compromising the user experience.
2. **Model/System Co-Design**: By developing hardware-aware model architectures that align model design with the underlying hardware system and silicon's capabilities and limitations, the Adaptive Ranking Model significantly improves hardware utilization in heterogeneous hardware environments.
3. **Reimagining the Inference Pipeline**: The Adaptive Ranking Model also involves a reimagining of the inference pipeline to optimize performance and efficiency. This includes advancements in model quantization, parallelization, and runtime optimization techniques.
The Meta Adaptive Ranking Model represents a significant leap forward in the development of AI Recommendation Systems. By addressing the inference trilemma and leveraging innovative architectural and system design approaches, Meta is able to serve LLM-scale models with high efficiency and low latency, ultimately delivering a better experience for users and advertisers alike. This development underscores Meta's commitment to pushing the boundaries of AI technology and its impact on industries and people's lives.










