Create and Manage Testsets Easily

A Scorecard Testset is a collection of Testcases used to evaluate the performance of an LLM application across a variety of inputs and scenarios. A Testset usually belongs to a central theme, e.g. “Core Functionality”, “Edge Cases”, or “Adversarial Tests”. Testsets allow developer teams to systematically assess the functionality, accuracy, and reliability of their LLM applications before deployment. By grouping related Testcases into Testsets, Scorecard enables a structured approach to testing and improving LLM applications. But what are Testcases exactly?

A Scorecard Testcase is an individual input to an LLM that is used for scoring. It consists of:

  • User Query: The user query to the LLM. Inputs can range from simple text prompts to complex structured data.
  • Context (optional): To include a document or context information (e.g. dialogue history) additionally to the user query.
  • Ideal Response (optional): Also referred to as ground-truth response, this defines what the ideal correct response from the LLM application should be for the given user query. Ideal outputs can be specific text, structured data, or criteria that the output needs to meet.
  • Further Custom Fields (see the API reference for more information)

In addition to the above, you can add additional fields to each individual Testcase by defining a custom_schema object (check out the API call).

Easy Testset Creation and Modification via Scorecard

Follow our step-by-step guide in creating and modifying a Testset! Head over to the Scorecard Guides and find the guide for a Testset creation via the Scorecard UI or via the Scorecard SDK.

Testsets View in the Scorecard UI
Testsets View in the Scorecard UI

The Scorecard team consists of experts in LLM evaluation that have experiences in evaluation and deploying large-scale AI applications at some of the world’s leading companies. Based on their experience, the Scorecard team recommends the following best practices for your Testsets:

  • Regularly Review Testcases: As your LLM application evolves, so too should your Testcases. Regularly review your Testcases and update them to ensure they remain relevant and cover new functionalities. Currently, modifying an existing Testcase is only possible via the Scorecard UI by clicking on the Testcase and either uploading a new or modifying the existing text.
  • Use Diverse Inputs: Ensure your Testcases cover a wide range of inputs, including edge cases, to thoroughly test your LLM application’s capabilities.
  • Collaborate and Share: Encourage collaboration among team members when creating and reviewing Testsets and results to ensure coverage and common agreement on where it’s important to improve your LLM app’s performance.

From Small to Big: Iterate Your Testsets

With Scorecard, we can start to operationalize the testing process, so that you don’t have to remember your favorite queries or copy-and-paste responses into a spreadsheet. Depending on the maturity of your LLM application, whether it is still a demo application or deployed in production, or depending on the urgency of the Testset, you want to use different types of Testsets. Here is an overview:

Hillclimbing Testsets

  • Description: We recommend this as the first Testset you create by including your favorite prompts. Purpose: Hillclimbing Testsets are powerful for making incremental improvements. That type of Testset focuses on targeted examples of prompts seen in production, and it can be used to iteratively improve your LLM app.
  • Size: Small set (5-20 Testcases)

Regression Testsets

  • Description: Regression Testsets capture Testcases of ‘good’ performance that are regularly run (typically nightly).
  • Purpose: Regression Testsets are crucial to ensure new changes/features to your LLM App do not degrade existing functionalities. This type of Testset includes prompts or queries that are representative of your LLM application’s use case.
  • Size: Moderately sized (50-100 Testcases)

Launch Evaluation Testsets

  • Description: Utilized for comprehensive evaluation before a significant launch or update.
  • Purpose: Launch Evaluation Testsets are valuable for ensuring broad coverage and confidence in LLM performance.
  • Size: Large set (100+ Testcases)

Must-Pass Testsets

  • Description: Testsets that are intentionally focused on only high-precision Testcases.
  • Purpose: Must-Pass Testsets are powerful to catch major bugs or regressions. Each deployment or PR should pass this set as an early check, before running more comprehensive or specialized Testsets.
  • Size: N/A