Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads
Meta continues to lead the industry in utilizing groundbreaking AI Recommendation Systems (RecSys) to deliver better experiences for people, and better results for advertisers. To reach the next frontier of performance, we are scaling Metaās Ads Recommender runtime models to LLM-scale & complexity to further a deeper understanding of peopleās interests and intent. This increase [...] Read More... The post Meta Adaptive Ranking Model: Bending the Inference Scaling Curve to Serve LLM-Scale Models for Ads appeared first on Engineering at Meta .

Meta, a leading technology company, has been at the forefront of utilizing advanced AI Recommendation Systems (RecSys) to enhance user experiences and improve advertising outcomes. To push the boundaries of performance even further, Meta is scaling its Ads Recommender runtime models to LLM-scale complexity, allowing for a deeper understanding of users' interests and intent. This scaling, however, presents a significant challenge known as the "inference trilemma." This refers to the difficulty of balancing the increased model complexity and the associated compute and memory requirements with the low latency and cost efficiency necessary for a global service serving billions of people.
To address this challenge, Meta has developed the Meta Adaptive Ranking Model, which effectively bends the inference scaling curve with high return on investment (ROI) and industry-leading efficiency. The Adaptive Ranking Model replaces the traditional "one-size-fits-all" inference approach with intelligent request routing. By dynamically aligning model complexity with a rich understanding of a person's context and intent, the system ensures that every request is served by the most effective and efficient model. This approach allows Meta Ads to maintain the strict, sub-second latency required for the platform while providing a high-quality experience for every user.
Serving LLM-scale models at Meta's scale required a fundamental rethinking of the inference stack. Three key innovations have driven this transformation:
1. **Inference-Efficient Model Scaling**: By shifting to a request-centric architecture, the Adaptive Ranking Model serves a LLM-scale and complexity model at sub-second latency. This enables a more sophisticated understanding of a person's interests and intent without compromising the user experience.
2. **Model/System Co-Design**: The development of hardware-aware model architectures that align model design with the underlying hardware system and silicon capabilities and limitations has significantly improved hardware utilization in heterogeneous hardware environments.
3. **Reimagining the Inference Pipeline**: The Adaptive Ranking Model introduces a new approach to the inference pipeline, optimizing resource allocation and ensuring that the most suitable model variant is selected for each request based on the user's context and intent.
These innovations have allowed Meta to overcome the inference trilemma and serve LLM-scale models with the required latency and efficiency. By leveraging the Meta Adaptive Ranking Model, the company continues to push the boundaries of AI-driven recommendation systems, delivering better experiences for users and more effective advertising solutions for advertisers. This commitment to innovation underscores Meta's dedication to leading the industry in harnessing the power of AI for global-scale applications.










