Home TechnologyBuilding better AI benchmarks: How many raters are...
Technology⭐ Featured

Building better AI benchmarks: How many raters are enough?

Algorithms & Theory

7 April 2026 at 09:37 am
1 views
Building better AI benchmarks: How many raters are enough?

In recent years, the field of artificial intelligence (AI) has witnessed rapid advancements, driven by the development of increasingly sophisticated models and algorithms. As these models become more complex, the need for robust and reliable benchmarks to evaluate their performance has become crucial. One of the key challenges in creating effective benchmarks is determining the optimal number of raters required to ensure accurate and consistent evaluations. This issue is particularly relevant in domains where human judgment is still necessary, such as in natural language processing (NLP) tasks, where models are evaluated by human annotators.

The problem of determining the right number of raters stems from the inherent subjectivity in human evaluation. Unlike automated metrics, human raters may have varying levels of expertise, biases, or interpretations of the task at hand. This variability can lead to inconsistent ratings and, consequently, unreliable benchmark results. To address this, researchers have explored different strategies for determining the optimal number of raters needed to achieve a desired level of agreement and reliability.

One common approach is to use inter-rater reliability (IRR) measures, such as Cohen's kappa or Fleiss' kappa, to assess the degree of agreement among raters. These metrics account for the possibility of agreement occurring by chance, providing a more nuanced understanding of the raters' consistency. By calculating IRR, researchers can identify the point at which additional raters do not significantly improve the reliability of the ratings.

However, the number of raters required to achieve a satisfactory level of IRR can vary depending on the specific task and the nature of the evaluation. For example, in tasks that are more straightforward or have clear-cut criteria, a smaller number of raters may suffice. In contrast, tasks that are more subjective or require specialized knowledge might necessitate a larger number of raters to ensure a diverse range of perspectives and reduce individual biases.

Another factor to consider is the cost and time associated with involving more raters. In many cases, especially in academic research, the resources available for human evaluation are limited. Therefore, there is a need to balance the desire for high reliability with the practical constraints of the evaluation process. Some researchers have proposed using crowdsourcing platforms, such as Amazon Mechanical Turk, to recruit a larger pool of raters at a lower cost. However, this approach also introduces its own set of challenges, such as ensuring the quality of the ratings and managing the potential for low-effort or spam responses.

In addition to the number of raters, the training and calibration of the raters play a significant role in achieving reliable benchmark results. Providing clear guidelines, training materials, and opportunities for discussion among raters can help reduce variability and improve the consistency of the ratings. Some benchmarking initiatives have adopted a multi-stage evaluation process, where raters are first trained and then progressively involved in more complex tasks, ensuring that only those with a solid understanding of the criteria participate in the final evaluation.

Recent advancements in AI have also led to the development of automated evaluation metrics that can supplement or even replace human raters in certain cases. These metrics, such as BLEU or ROUGE in NLP, can provide quick and scalable assessments of model performance. However, they often struggle with capturing the nuances of human language and may not fully align with human judgments. As a result, there is ongoing debate about the role of human raters in AI benchmarking and the extent to which automated metrics can be trusted.

In conclusion, determining the optimal number of raters for AI benchmarks is a complex task that requires careful consideration of various factors, including the nature of the evaluation task, the resources available, and the desired level of reliability. While inter-rater reliability measures provide valuable insights, the ultimate goal is to strike a balance between accuracy, practicality, and cost-effectiveness. As AI continues to evolve, so too must the methods used to evaluate its performance, ensuring that benchmarks remain relevant and meaningful in the pursuit of progress in this rapidly changing field.

šŸ“° Related News
Ekaya Banaras Founder Palak Shah’s ₹40 Lakh Billboard Mistake Became a Masterclass in Startup Marketing
Ekaya Banaras Founder Palak Shah’s ₹40 Lakh Billboard Mistake Became a Masterclass in Startup Marketing
Ekaya Banaras founder Palak Shah recently opened up about one of the most expensive mistakes she made while building her luxury textile brand. During the early years of the company, Shah rented a premium billboard near Delhi’s DLF Emporio to increase brand visibility. However, after forgetting to cancel the campaign, the hoarding reportedly continued running for months — resulting in losses of nearly ₹40 lakh. The incident has now become a viral example of how small operational oversights can turn into costly business lessons for startups and entrepreneurs.
28 May
Betting On AI: Jensen Huang And NVIDIA’s Rise To The Top
Betting On AI: Jensen Huang And NVIDIA’s Rise To The Top
Before AI was inevitable, it was a gamble—and Jensen Huang went all in.
14 Apr
Red Hat OpenShift sandboxed containers 1.12 and Red Hat build of Trustee 1.1 bring confidential computing to bare metal and AI workloads
Red Hat OpenShift sandboxed containers 1.12 and Red Hat build of Trustee 1.1 bring confidential computing to bare metal and AI workloads
Red Hat is excited to announce the release of Red Hat OpenShift sandboxed containers 1.12 and Red Hat build of Trustee 1.1, marking a major leap forward in our confidential computing journey. These releases graduate confidential containers on bare metal from …
14 Apr
Large AI firms hoovering maximum funding, not enough for smaller startups: Y Combinator’s Ankit Gupta
Large AI firms hoovering maximum funding, not enough for smaller startups: Y Combinator’s Ankit Gupta
YC Startup School: India’s talent pool across colleges and universities are key for building next-gen startups, which is what YC is looking to tap into. It wants to target entrepreneurs building for global markets, focussed on fintech, consumer, B2B, and ecom…
14 Apr
TSMC likely to book fourth straight quarter of record profit onĀ insatiable AI demand
TSMC likely to book fourth straight quarter of record profit onĀ insatiable AI demand
TSMC-RESULTS/ (PREVIEW, PIX):PREVIEW-TSMC likely to book fourth straight quarter of record profit onĀ insatiable AI demand
14 Apr
TSMC likely to book fourth straight quarter of record profit onĀ insatiable AI demand
TSMC likely to book fourth straight quarter of record profit onĀ insatiable AI demand
Any profit result ā€Œabove T$505.7 billion would mark the company's highest-ever quarterly net income ​and its ninth consecutive quarter of profit growth
14 Apr
TSMC likely to book fourth straight quarter of record profit on insatiable AI demand
TSMC likely to book fourth straight quarter of record profit on insatiable AI demand
On Thursday, ​TSMC is expected to report a net profit of $17.1 billion for the quarter, according to an LSEG SmartEstimate compiled from 19 analysts. The war in the Middle East threatens to disrupt the supply of production materials for semiconductors such as…
14 Apr
If we can’t kick the habit, how do we manage AI’s energy needs?
If we can’t kick the habit, how do we manage AI’s energy needs?
One can only hope that OpenAI’s Sam Altman was joking when he sought to justify the immense energy consumption of artificial intelligence
14 Apr
What caused Nvidia Blackwell GPU prices to spike? #tech
What caused Nvidia Blackwell GPU prices to spike? #tech
Blackwell GPU hourly ā€œrentā€ surges on agentic AI demand A compute pricing index tracking hourly costs for Nvidia Blackwell GPUs shows a sharp climb: hourly rental hit $4.08 , up 48% from $2.75 just two months earlier. The reported driver is rising demand tied…
14 Apr
Anthropic Releases Claude Mythos Preview with Cybersecurity Capabilities but Withholds Public Access
Anthropic Releases Claude Mythos Preview with Cybersecurity Capabilities but Withholds Public Access
Anthropic has introduced Claude Mythos Preview, its most advanced AI model, improving significantly in reasoning, coding, and cybersecurity. Unlike previous releases, it will not be publicly available. Access is limited to a consortium of tech companies throu…
14 Apr