Automate LLM Evaluation with AI-Powered Scoring
Increasing Complexity When Manually Evaluating Large Testsets
When introducing a framework to evaluate a Large Language Model (LLM), the process typically begins on a small scale by manually evaluating a few Testcases. These initial Testcases form the basis of the first Testset, and subject-matter experts (SMEs) evaluate the LLM outputs of these Testcases. As the evaluation process iterates, the number of Testcases grows, allowing testing across more dimensions using different metrics. However, as the number of Testcases increases, the time required to evaluate the metrics increases proportionally. Since SMEs have limited resources, the evaluation of large Testsets can quickly become a bottleneck, hindering the progress of the LLM evaluation process.
AI-Supported Scoring With Scorecard
Scorecard offers a solution to this problem by outsourcing the evaluation part to an AI model. With Scorecard’s AI-powered scoring, valuable time can be saved and SMEs can focus on complex edge cases instead of well-defined and easy-to-score metrics. This streamlines the LLM evaluation process, enabling teams to iterate and deliver faster.
Define Metrics to Score With AI
When defining a new metric to be used for your LLM evaluation, you can specify the evaluation type. Choose the evaluation type “AI” to have a metric automatically scored by an AI model.
In the metric guidelines, make sure to accurately describe how the AI model should score the metric.
Score Your Testset With a Single Click
After you have run a Testset against your LLM and the model has generated responses for each Testcase, the Scorecard UI displays the status of “Awaiting Scoring”.
Simply click the “Run Scoring” button and Scorecard will use AI to score all metrics that have AI scoring enabled.
Understand the AI Scores With Transparent Explanations
Scorecard not only automates the scoring process, it also provides explanations of why the AI model scored the metrics in a certain way. This helps you understand the reasons of the AI scores, allowing you to re-evaluate whether the evaluations make sense.