Python — Scorecard AI

Scorecard helps you evaluate the performance of your LLM app to ship faster with more confidence! In this quickstart we will:

Setup Scorecard
Create a Testset
Create an example LLM app with OpenAI
Define the evaluation setup
Score the LLM app with the Testset
Review evaluation results in the Scorecard UI

Steps

Setup

First let’s create a Scorecard account and find the SCORECARD_API_KEY in the settings. Since this example creates a simple LLM application using OpenAI, get an OpenAI API key. Set both API keys as environment variables as shown below. Additionally, install the Scorecard and OpenAI Python libraries:

Python Setup

$ export SCORECARD_API_KEY="SCORECARD_API_KEY"
> export OPENAI_API_KEY="OPENAI_API_KEY"
> pip install scorecard-ai
> pip install openai

Create a Testset and Add Testcases

A Testset is a collection of Testcases used to evaluate the performance of an LLM application across a variety of inputs and scenarios. A Testcase is a single input to an LLM that is used for scoring. Now let’s create a Testset and add some Testcases using the Scorecard Python SDK:

Create a Testset and Add Testcases

1 from scorecard.client import Scorecard
2 
3 client = Scorecard(
4    api_key=SCORECARD_API_KEY
5 )
6 
7 # Create a Testset
8 demo_testset = client.testset.create(
9    name="Demo Testset",
10    description="Demo Testset created via Scorecard Python SDK",
11 )
12 
13 # Retrieve the ID of the demo_testset
14 demo_testset_id = demo_testset.id
15 
16 # Add three Testcases
17 client.testcase.create(
18    testset_id=demo_testset_id,
19    user_query="The amount of access cabinet secretaries have to the president is most likely to be controlled by the"
20 )
21 client.testcase.create(
22    testset_id=demo_testset_id,
23    user_query="The exclusionary rule was established to"
24 )
25 client.testcase.create(
26    testset_id=demo_testset_id,
27    user_query="Ruled unconstitutional in 1983, the legislative veto had allowed"
28 )
29 
30 print("Visit the Scorecard UI to view your Testset:")
31 print(f"https://app.getscorecard.ai/view-dataset/{demo_testset_id}")

Create a Simple LLM App

Next let’s create a simple LLM application which we will be evaluating using Scorecard. This LLM application is represented with the following function that sends a request with a user-defined input to the OpenAI API Here we’ve created a system that will be a helpful assistant in response to our input user query.

Mock system

1 def answer_query(user_topic: str) -> str:
2   client = OpenAI(api_key=OPENAI_API_KEY)
3   response = client.chat.completions.create(
4     model="gpt-3.5-turbo",  
5     messages=[
6         {"role": "system", "content": "You are a helpful assistant."},
7         {"role": "user", "content": user_topic},
8     ]
9   )
10   return response.choices[0].message.content

Create Metrics

Now that we have a system that answers questions from the MMLU dataset, let’s build a metric to understand how relevent the system responses are to our user query. Let’s go to the Scoring Lab and select “New Metric”

From here let’s create a metric for answer relevency:

You can evaluate your LLM systems with one or multiple metrics. A good practice is to routinely test the LLM system with the same metrics for a specific use case. For this, Scorecard offers to define Scoring Configs, collection of metrics to be used to consistently evaluate LLM use cases with the same metrics. For the quick start, we will create a Scoring Config just including the previously created Answer Relevancy metric. Let’s head over to the “Scoring Config” tab in the Scoring Lab and create this Scoring Config. Let’s grab that Scoring Config ID for later:

Create Test System

Now let’s use our mock system and run our Testset against it replacing the Testset id below with the Testset from before and the Scoring Config ID above:

Run_tests.py

1 from openai import OpenAI
2 from scorecard.client import Scorecard
3 
4 # Mock system to test
5 def answer_query(user_topic: str) -> str:
6   client = OpenAI(api_key=OPENAI_API_KEY)
7   response = client.chat.completions.create(
8     model="gpt-3.5-turbo",  
9     messages=[
10         {"role": "system", "content": "You are a helpful assistant."},
11         {"role": "user", "content": user_topic},
12     ]
13   )
14 
15   return response.choices[0].message.content
16 
17 client = Scorecard(
18   api_key=SCORECARD_API_KEY
19 )
20 
21 client.run_tests(
22   input_testset_id=123,
23   scoring_config_id=456,
24   model_invocation=lambda prompt: answer_query(prompt), # Replace with your system
25 )
26 
27 print("Visit the Scorecard UI to view your Run:")
28 print(f"https://app.getscorecard.ai/view-grades/{run.id}")

Run Scoring

Now let’s review the outputs of our execution in the Scorecard UI and run scoring by clicking on the “Run Scoring button’.

View Results

Finally let’s review the results in the Scorecard UI. Here you can view and understand the performance of your LLM system:

$	export SCORECARD_API_KEY="SCORECARD_API_KEY"
>	export OPENAI_API_KEY="OPENAI_API_KEY"
>	pip install scorecard-ai
>	pip install openai

1	from scorecard.client import Scorecard
2
3	client = Scorecard(
4	api_key=SCORECARD_API_KEY
5	)
6
7	# Create a Testset
8	demo_testset = client.testset.create(
9	name="Demo Testset",
10	description="Demo Testset created via Scorecard Python SDK",
11	)
12
13	# Retrieve the ID of the demo_testset
14	demo_testset_id = demo_testset.id
15
16	# Add three Testcases
17	client.testcase.create(
18	testset_id=demo_testset_id,
19	user_query="The amount of access cabinet secretaries have to the president is most likely to be controlled by the"
20	)
21	client.testcase.create(
22	testset_id=demo_testset_id,
23	user_query="The exclusionary rule was established to"
24	)
25	client.testcase.create(
26	testset_id=demo_testset_id,
27	user_query="Ruled unconstitutional in 1983, the legislative veto had allowed"
28	)
29
30	print("Visit the Scorecard UI to view your Testset:")
31	print(f"https://app.getscorecard.ai/view-dataset/{demo_testset_id}")

1	def answer_query(user_topic: str) -> str:
2	client = OpenAI(api_key=OPENAI_API_KEY)
3	response = client.chat.completions.create(
4	model="gpt-3.5-turbo",
5	messages=[
6	{"role": "system", "content": "You are a helpful assistant."},
7	{"role": "user", "content": user_topic},
8	]
9	)
10	return response.choices[0].message.content

1	from openai import OpenAI
2	from scorecard.client import Scorecard
3
4	# Mock system to test
5	def answer_query(user_topic: str) -> str:
6	client = OpenAI(api_key=OPENAI_API_KEY)
7	response = client.chat.completions.create(
8	model="gpt-3.5-turbo",
9	messages=[
10	{"role": "system", "content": "You are a helpful assistant."},
11	{"role": "user", "content": user_topic},
12	]
13	)
14
15	return response.choices[0].message.content
16
17	client = Scorecard(
18	api_key=SCORECARD_API_KEY
19	)
20
21	client.run_tests(
22	input_testset_id=123,
23	scoring_config_id=456,
24	model_invocation=lambda prompt: answer_query(prompt), # Replace with your system
25	)
26
27	print("Visit the Scorecard UI to view your Run:")
28	print(f"https://app.getscorecard.ai/view-grades/{run.id}")