A/B Comparison of Evaluation Runs — Scorecard AI

During the development of LLM applications, it is common practice to iteratively adjust the system to find the optimal setup that produces the best results. For example, tweaking model parameters, trying different model versions, and developing and improving prompts will all affect the performance of the model’s responses. However, with multiple iterations and improvements, it becomes difficult to accurately quantify and compare the effectiveness of the changes made. Relying on gut feelings alone may not be enough, and gaining more confidence in your results is essential.

Compare Runs Easily With Scorecard’s A/B Comparison Feature 📊

Scorecard addresses this need by providing an A/B Comparison feature for runs. This feature allows you to easily compare different runs using the same metrics, ensuring a clear understanding of the impact of your changes.

Only runs that are using the same set of evaluation metrics or the same Scoring Config can be compared with each other and are eligible for the A/B Comparison feature.

To use this feature:

Check the results of the run you would like to compare.

Click the “Add Comparison” button.

Select the Run ID of the run you want to compare with. If you do not know the Run ID, go to the “Runs & Results” tab in the Scorecard UI and retrieve the ID from the run overview table.

Compare Metric Performance

After specifying a comparison run, the graphs are updated to show the aggregated results for both the base run and the compared test run side by side. Investigate which run, and therefore which LLM setup, provides better results for which metrics.

Compare Run Performance

Benefits of A/B Comparisons

Using A/B comparisons ensures that you make data-driven decisions, optimize the performance of your LLM, and continually improve its capabilities with confidence. Amongst others, the benefits of A/B comparisons include the ability to:

Easily compare experiments and gain deeper insights than ever before.
Evaluate how changes to production systems, models, and prompts result in different outputs and metrics.
Ensure that iterative improvements are truly effective.