Review Your Evaluation Runs and Analyze Results | Scorecard AI

You have defined individual Testcases and grouped them into a Testset. Additionally, you have established several metrics to evaluate your LLM application from different dimensions and selected them manually for scoring or grouped them into a Scoring Config. After running the automated scoring, the question now is: What’s next? Let’s review our runs and analyze the results!

Inspect Scoring Results in Scorecard

You can find an overview of all your past runs in the “Runs & Results” tab of the Scorecard UI. The following information is displayed for each run:

Run ID
Timestamp of Run Creation
Run Status
- Awaiting Scoring: Model has generated responses for each Testcase that have not yet been scored.
- Awaiting Human Scoring: AI-powered scoring has been completed, and subject-matter experts still need to score the manually scored metrics.
- Completed: All Testcases have been scored for all metrics.
Used Testset
Model parameter set

Runs & Results Overview in the Scorecard UI

Inspect Metric Results

By clicking on the “Results” button of a run, the results of the scored metrics are displayed directly in individual metric visualizations. In addition to bar charts showing the distribution of the scores, certain out-of-the-box statistics such as mean and median are calculated.

Visualized Metric Results in Run Details

Filter Metric Scores

If you want to examine Testcases that performed poorly or very well, you can automatically filter the test cases by clicking on a bar in the bar charts.

Inspect Individual Testcase Results

The individual Testcase results show each input and output, the scores for each metric, and certain model debug information (e.g., latency and cost).

Individual Testcase Scoring Results in Run Details

Inspect Run Performance

In addition to the metrics, the “Run performance” tab visually displays the performance of each run.

View Run Details

Click the “Show Details” button to inspect the details of the selected run.