RAG: Retrieval Augmented Generation | Scorecard AI

Retrieval Augmented Generation (RAG) is the use of retrieval methods (e.g. via search and vector stores) to provide generative models with additional context, or grounding. Grounding makes sure that your LLM has access to information past the knowledge cutoff, or specific to your organization. Some companies also refer to RAG as “finetuning”, since it allows you to customize the LLM to your data (”chat with your data”).

A production RAG system might look something like this:

In this diagram, we start by loading our data into the retrieval system — this process usually consists of chunking documents into chunks that can fit in the context window of the model. The retrieval system can consist of only a vector store, or a compilation of several components, such as a traditional reverse index and/or a reranking model. During inference time, we send the user query to the retrieval system, which sends the retrieved chunks to the generation model (i.e. LLM) that uses the chunks as context in generating a response.

For our purposes, we’ll simplify the diagram into these basic components:

Simplified Schema of a Production RAG System

Retrieval-Only LLM Testing

The testing process for RAG models is slightly different than the ones for other LLM use cases. In the following, we explain tips and give guidance an how to implement retrieval-only LLM testing.

Creating a Ground Truth Dataset

To create a dataset for retrieval-only testing, first define the queries that you want to test the retrieval model on. Then, gather a set of relevant documents or chunks for each query and annotate them accordingly.

Define Retrieval-Only Metrics

In retrieval-only tests, precision and recall are commonly used metrics to evaluate the performance of a retrieval model. Precision measures how many of the retrieved items are actually relevant to the query, while recall measures how many of the relevant items in the dataset were returned by the retrieval model. Both metrics can be combined into an F1 score, which is the harmonic mean of precision and recall.

Another commonly used metric for retrieval models is Mean Reciprocal Rank (MRR), which measures the rank of the first relevant item in the retrieved list. The reciprocal of this rank is averaged over all queries in the Testset.

Normalized Discounted Cumulative Gain (NDCG) is another metric that takes into account the relevance of the retrieved items. It measures the quality of the ranked list of retrieved items, by comparing it to an ideal ranking of the relevant items.

Overall, the choice of metrics used to evaluate a retrieval model depends on the specific application and the goals of the model.

Running the Tests

Now that we have a dataset and metrics defined, we can now test the performance of your RAG system.

🚧 Under construction…

Generation-Only Tests & End-to-End Tests

The testing process for RAG models is slightly different than the ones for other LLM use cases. In the following, we explain tips and give guidance an how to test the quality of the generation step only and implement generation-only and end-to-end tests.

Let’s see what all these different types of testing look like altogether:

Different Types of Testing in a RAG System

We can implement all of these testing types using Scorecard!

Step Up Your RAG-Testing Game With Scorecard 🚀

When testing your RAG LLM application, Scorecard helps you in three ways:

Scorecard Specialists are available to provide ground truth labels for both retrieval and generation tests. These can either be used in a continuous process, or used to bootstrap an automated evaluation process. If you would like to have your results labeled by a Scorecard Specialist, please contact us at support@getscorecard.ai
The Scorecard UI provides a customizable dashboard to monitor and track the performance of your RAG models, including retrieval-only, generation-only, and end-to-end tests.
Scorecard allows you to easily compare different RAG models and configurations in a production setting, so you can identify the best performing models and make informed decisions about which ones to deploy. Scorecard also provides insights into the specific areas where your RAG models may need improvement, so you can finetune your models and optimize their performance.

Scorecard works alongside RAG solutions such as LlamaIndex and Langchain.