Understanding CRAG: Meta's Comprehensive Benchmark for Retrieval Augmented Generation

4 minutes

CRAG, or the Comprehensive RAG Benchmark, is Meta’s newest benchmark to evaluate AI performance. CRAG's role and significance in the field of retrieval augmented generation (RAG) is up for debate. In our latest episode of RAG Masters, we explore what CRAG is, its components, its importance, and how it is evaluated.

Is Meta's CRAG Any Good? We Dissect the new RAG Benchmark for AI Engineers

Watch the latest episode of RAG Masters where we dive into Meta's new benchmark: CRAG

First: What is RAG?

Quick refresher: RAG (Retrieval Augmented Generation) overview with an example question. Credit: Meta

Let's start with a brief overview of Retrieval Augmented Generation (RAG). Think of RAG as a two-step process:

  1. Retrieval: Imagine you have a digital librarian who searches through a vast library of documents to find the most relevant information related to your question.
  2. Generation: Once the relevant documents are found, RAG summarizes this information to generate a coherent and accurate response to your question.

For example, you ask a question like "What is the significance of CRAG?" The RAG system first retrieves the most relevant documents on CRAG. Then, it summarizes the key points from these documents to create a detailed and informative answer.

By combining retrieval and generation, RAG ensures that the responses are both accurate and contextually relevant.

What is CRAG?

CRAG (Comprehensive RAG Benchmark) was developed by Meta, who introduced the concept of RAG back in 2020 in an original research paper. CRAG aims to address some of the existing problems in RAG approaches and the tendency for large language models to hallucinate.

CRAG consists of a vast set of questions across more than 4,000 question answer pairs spread across various domains and question types. The primary objective of CRAG is to take existing issues with RAG systems and push the state of the art forward by providing a comprehensive dataset for performance evaluation.

Components of CRAG

CRAG's structure is designed to challenge RAG systems by using different types of questions that are known to cause problems. When Meta was developing CRAG, they looked at a host of data, then created a series of questions and answers around that data. Now, they have a significant source of data that can be used to answer those questions.

The benchmark is broken down into sections where engineers are given various retrieval tasks:

  1. For every question, five HTML pages are given as potential source material.
  2. In another section, 50 HTML pages are provided.
  3. The third section includes a knowledge graph with several million entities.

This comprehensive approach ensures that CRAG can test a wide range of retrieval scenarios, pushing the limits of current RAG technology.

Here's what's in CRAG:

  • 4,409 Question and Answer Pairs
  • 5 Domains: Finance, Sports, Music, Movie and Open
  • 7 Question types: Conditions, Comparison, Aggregation, Multi-hop, Set queries, Post-processing-heavy, and False-premise
  • Data: Mock APIs, 50 HTML pages per question, 2.6M entity knowledge graph

Importance of CRAG in the RAG Community

CRAG is the latest benchmark for both researchers and engineers in the RAG community who are working to improve the effectiveness and accuracy of RAG. CRAG takes the potential challenges a RAG system can encounter when searching for accurate answers and standardizes it into a holistic question set that can provide a standard benchmark.

The primary significance of CRAG lies in its ability to provide a structured and holistic dataset that addresses the robustness and performance of RAG systems. This lets the RAG community benchmark and improve their systems effectively.

Evaluation Methods in CRAG

Evaluating the performance of RAG systems using CRAG involves several sophisticated methods. One of the notable aspects is the use of automatic evaluation techniques. There are question answer pairs that are first curated by humans, and then the RAG system comes up with its own answer. The evaluation then asks a language model, 'Hey, is this answer right?'

This approach typically uses multiple language models to verify the accuracy of answers to ensure a high level of performance assessment. This avoids something called the ‘self preference’ problem, which is where a language model outputs something, and then you ask the same language model, “Did you do a good job?” These models have a tendency to just like what they have said. So bringing multiple models into the mix is critical for an effective assessment.

One of the most powerful parts of the CRAG evaluation method is the diversity of question types. Meta has created question types designed to trip up RAG in a variety of ways, whether with temporal questions that require an answer that changes a lot, or multi-hop which needs several RAG searches to answer a single question. There are also false premise questions that inject a false statement into a question along with post-processing heavy questions that need reasoning or processing to reach the correct answer.

Here's a table with additional examples of CRAG question types:

Table 2: Definition of CRAG question types.
Question type Definition
Simple Questions asking for simple facts that are unlikely to change overtime, such as the birth date of a person and the authors of a book.
Simple w. Condition Questions asking for simple facts with some given conditions, such as stock prices on a certain date and a director’s recent movies in a certain genre.
Set Questions that expect a set of entities or objects as the answer (e.g., “what are the continents in the southern hemisphere?”).
Comparison Questions that compare two entities (e.g., “who started performing earlier, Adele or Ed Sheeran?”).
Aggregation Questions that require aggregation of retrieval results to answer (e.g., “how many Oscar awards did Meryl Streep win?”).
Multi-hop Questions that require chaining multiple pieces of information to compose the answer (e.g., “who acted in Ang Lee’s latest movie?”).
Post-processing heavy Questions that need reasoning or processing of the retrieved information to obtain the answer (e.g., “how many days did Thurgood Marshall serve as a Supreme Court justice?”).
False Premise Questions that have a false preposition or assumption (e.g., “What’s the name of Taylor Swift’s rap album before she transitioned to pop?” (Taylor Swift has not yet released any rap album)).

Credit: Meta

Real-World Applications and Challenges

Applying CRAG in real-world scenarios has its own set of challenges that come along with the benefits. Industry and academia have different perspectives on RAG, making real-world applications more challenging.

Real-world applications often involve diverse and unstructured data sources, such as random PDF files and YouTube videos. CRAG, while comprehensive, may not fully encompass the messiness of real-world data. However, it does provide a solid foundation for engineers to experiment and benchmark their RAG systems in a structured environment.

Future of CRAG and RAG

Looking ahead, the future of CRAG and RAG benchmarking is promising. The continuous development of benchmarks like CRAG is essential for advancing the field. As co-host of the RAG Masters show Daniel Warfield notes, "I'm just looking forward to whatever they name this next paper. Maybe DRAG, that would probably be a good one, but it's going to be something bad like FRAG."

The focus will likely shift towards more complex and realistic data scenarios, addressing the practical challenges faced by engineers. The evolution of CRAG and similar benchmarks is a key step towards enhancing the robustness and intelligence of RAG systems.


CRAG stands as a significant milestone in the field of retrieval augmented generation. With its release, engineers now have a comprehensive and challenging benchmark. This framework will enable researchers and engineers to push the boundaries of current RAG systems. 

You can watch the full episode of RAG Masters and stay tuned for more updates and discussions in our community and on future posts.

More news


World's Most Accurate RAG? Langchain/Pinecone, LlamaIndex and EyeLevel Duke it Out

Winner achieved 98% accuracy across 1,000+ pages of complex documents

Read Article

Find out what the buzz is about. Learn to build AI you can trust.