The Challenges of Building Effective LLM Benchmarks

Current state of LLM evaluation and gaps that need filling with comprehensive and high quality leaderboards

May 30, 2024

Get a list of personally curated and freely accessible ML, NLP, and computer vision resources for FREE on newsletter sign-up.

Consider sharing this with someone who wants to know more about machine learning.

Evaluating Large Language Models (LLMs) is a critical aspect of their development. With many models being released rapidly, keeping track of their performance across various tasks is challenging. Benchmarks like LMSYS have been instrumental, but they face growing criticism for certain limitations.

Hot off the press, Scale AI's SEAL, a new private leaderboard aims to provide a more robust, leak-proof evaluation system. In this article, we delve into the current state of LLM evaluation, discuss the gaps that need filling, and explore how SEAL could address these issues.

1. The Challenges of Building Effective Benchmarks

Creating robust benchmarks is challenging. They must be comprehensive, representative, and capable of measuring a model's quality in a quantitative way that matches the model’s “goodness”. Moreover, preventing test sets from leaking into training data is a significant challenge, particularly as models grow more advanced and multimodal.

Complexity and Time Investment

Developing good evaluations is a time-consuming and complex process. For instance, Tesla spends a significant portion of its time on data collection, evaluation design, and alignment of qualitative and quantitative assessments.

Evaluations need to be:

Comprehensive: Covering a wide range of tasks and capabilities.
Representative: Reflecting real-world use cases and challenges.
High Quality: Ensuring that the evaluation criteria are robust and fair.

Data Leakage and Memorization

Preventing test set seepage into training data is difficult. Even with best efforts to filter out exact matches and n-gram overlaps, synthetic data rewrites and related online discussions can still leak into training data emphasizing the challenge of maintaining a leak-proof evaluation.

Data leakage leads to overestimating a model’s performance than what it truly is.

Human Evaluation Challenges

Not all LLM tasks are automatically evaluative (e.g., summarization). Human involvement introduces variables like attention to detail, answer length, style, and treatment of refusals. Controlling for these variables is crucial but challenging.

The idea is for leaderboards to evolve and learn from past shortcomings to release newer versions. If they were striving for a perfect leaderboard in the first version itself, we know this would only be a pipe dream because perfect is the enemy of good.

Implementation Details

As shown in [5], the same models can result in completely different quantitative performance metrics. This can be partially attributed to how the whole evaluation is set up in the first place and implementation details can play a crucial role in separating the winners from the losers.

Imagine you want an LLM to respond to a question that has 4 choices to pick an answer from, here we look at an example from [5].

A model’s response can vary depending on the implementation details. [5]

For the same question, different implementations generate the model’s answer in varying ways. This can be anywhere from looking at the probability of the 4 letters in the choices: A, B, C, and D or something more complex. If you would like to see how this can affect the evaluation in more detail, I encourage you to check out the original post [5].

3 different ways to generate the answer to the given question. [5]

Not sure where to begin? You can read the post from my recent Transformers Series and how Transformers brought in LLMs:

"Attention, Please!": A Visual Guide To The Attention Mechanism [Transformers Series]

CodeCompass

May 2, 2024

Read full story

Transformers and the Power of Positional Encoding [Transformers Series]

CodeCompass

May 9, 2024

Read full story

2. The State of LLM Evaluation: LMSYS

What is LMSYS?

LMSYS is a crowdsourced platform designed to evaluate LLM-based chatbots. It uses pairwise comparisons and ELO ratings to rank models based on their performance in various tasks. The Chatbot Arena, part of LMSYS, has collected over a million human comparisons to rank these models.

Limitations of LMSYS

While LMSYS has been widely adopted, it has notable limitations:

User Bias: The benchmark heavily relies on user-generated questions and answers, which introduces variability based on users' ability to differentiate model capabilities. This can result in inconsistent evaluations.
Gaming the System: Models can be optimized to produce direct, uncensored responses that score well on usability metrics but may not reflect true intelligence or capability i.e. focusing on creating fun, direct responses, rather than demonstrating deep understanding or advanced capabilities.
Human Constraints: The reliance on human judgments means that the benchmark is limited by human biases and subjectivity. The better models become, the less useful the benchmark gets, as it primarily measures which model provides the most pleasing and direct answers rather than the most capable and intelligent ones.
Scalability Issues: As the models improve and more are added, the process of human evaluation can become slow and costly, potentially affecting the timeliness and accuracy of the leaderboard updates.

3. LMSYS’s Arena-Hard

LMSYS’s Arena-Hard is a more recent benchmark designed to improve model evaluation.

Data curation used to create the prompts used for evaluation in Arena-Hard.

Better Separability: Improved separability between models and improved quantitative evaluations, which match better with qualitative evaluations.
Higher Agreement: Agreement with Chatbot Arena rankings.
Cost-Effective: Runs are fast and cheap.
Frequent Updates: Uses live data for frequent updates.

Arena-Hard uses confidence intervals via bootstrapping to measure:

Agreement with Human Preference: High agreement indicates the benchmark aligns with human judgment.
Separability: Measures the ability to confidently distinguish between models.

Despite these improvements, Arena-Hard still faces challenges similar to other open benchmarks, such as data leakage and the inherent biases of human evaluators.

Arena Hard (with the green bars) can separate the models much better compared to the original LMSYS leaderboard (in gray). [2]

4. SEAL: Scale AI's Approach

What is SEAL?

SEAL is a private, expert-driven evaluation platform designed to address the shortcomings of existing benchmarks like LMSYS. It aims to provide robust, trustworthy evaluations that are resistant to overfitting and gaming as seen in the GSM1k work.

5. Head-to-Head Battle: LMSYS vs. SEAL

Advantages of SEAL

Leak-Proof Evaluations: SEAL's private datasets mitigate the risk of data leakage, which is a significant issue with open benchmarks.
High-Quality Evaluations: By employing domain “experts”, SEAL ensures that evaluations are consistent and reliable.
Regular Updates: Continuous updates with fresh data prevent the benchmark from becoming stale or gamed.

Disadvantages of SEAL

While SEAL addresses many of the issues present in LMSYS, it also has its own set of disadvantages:

Lack of Transparency: The private nature of SEAL's evaluations means that the specific datasets and evaluation criteria are not publicly available. This lack of transparency can lead to questions about the fairness and reproducibility of the results.
Trust Issues: Because SEAL is managed by a company, that can have ties to various LLM developers, there could be perceived or actual biases in the evaluations. This centralization of control can lead to concerns about impartiality.
Exclusivity: SEAL decides which models to evaluate, potentially overlooking significant models from other regions, such as China. Models that are not included in SEAL's evaluations, could limit the comprehensiveness of the leaderboard.
Static Benchmarks: Although SEAL promises continuous updates, there is still a risk that the benchmarks may not evolve quickly enough to keep pace with the rapid development of new models and capabilities. This can result in outdated evaluations that do not reflect current model performance.
Reproducibility Concerns: Without public access to the evaluation datasets and criteria, other researchers cannot reproduce the results independently. This reduces the scientific rigor and community trust in the evaluation outcomes.

6. Looking Forward In The Fast-Moving World of LLMs

While SEAL represents a significant step forward, the landscape of LLM evaluation will continue to evolve. High-quality, trustworthy benchmarks are essential for advancing the field and ensuring that the best models are recognized and utilized.

Outro

The introduction of SEAL by Scale AI marks a promising development in the realm of LLM evaluation. By addressing key limitations of existing benchmarks like LMSYS, SEAL aims to provide more accurate, reliable, and unbiased assessments of LLM capabilities. However, it is essential to balance transparency and trustworthiness to ensure the evaluations are both credible and reproducible. As the field of LLMs continues to grow, ongoing investment in high-quality benchmarks will be crucial to understanding and harnessing the full potential of these powerful models.

Consider subscribing to get it straight into your mailbox: