Review Your Evaluation Runs and Analyze Results
You have defined individual Testcases and grouped them into a Testset. Additionally, you have established several metrics to evaluate your LLM application from different dimensions and selected them manually for scoring or grouped them into a Scoring Config. After running the automated scoring, the question now is: What’s next? Let’s review our runs and analyze the results!
Inspect Scoring Results in Scorecard
You can find an overview of all your past runs in the “Runs & Results” tab of the Scorecard UI. The following information is displayed for each run:
- Run ID
- Timestamp of Run Creation
- Run Status
- Awaiting Scoring: Model has generated responses for each Testcase that have not yet been scored.
- Awaiting Human Scoring: AI-powered scoring has been completed, and subject-matter experts still need to score the manually scored metrics.
- Completed: All Testcases have been scored for all metrics.
- Used Testset
- Model parameter set
data:image/s3,"s3://crabby-images/97be2/97be21847ad648a1c619c6829654b64d5fae513d" alt="Runs & Results Overview in the Scorecard UI"
Inspect Metric Results
By clicking on the “Results” button of a run, the results of the scored metrics are displayed directly in individual metric visualizations. In addition to bar charts showing the distribution of the scores, certain out-of-the-box statistics such as mean and median are calculated.
data:image/s3,"s3://crabby-images/35d1b/35d1bf0b0732c6b3c2baea5a81005e4fcb9a9f64" alt="Visualized Metric Results in Run Details"
Filter Metric Scores
If you want to examine Testcases that performed poorly or very well, you can automatically filter the test cases by clicking on a bar in the bar charts.
data:image/s3,"s3://crabby-images/846dc/846dc994fe8b8c23bb71d695e5eb26e94205eab9" alt="Filtered Results in Run Details"
Inspect Individual Testcase Results
The individual Testcase results show each input and output, the scores for each metric, and certain model debug information (e.g., latency and cost).
data:image/s3,"s3://crabby-images/eaf4a/eaf4ac08fc91a346da1108acd5152bc80ff2771c" alt="Individual Testcase Scoring Results in Run Details"
Inspect Run Performance
In addition to the metrics, the “Run performance” tab visually displays the performance of each run.
data:image/s3,"s3://crabby-images/ec2eb/ec2ebd42d40d0fbd00dc4579a0a48dd17bfefc0e" alt="Run Performance Visualized in Scorecard"
View Run Details
Click the “Show Details” button to inspect the details of the selected run.
data:image/s3,"s3://crabby-images/41d8f/41d8f2d878d9884e71423c344c3ec712b65d6267" alt="Run Details"