The LLM Developer’s Journey
Let’s walk through the journey of an LLM Developer, from start to finish. Buckle up! 🤠
Step 1: Build
Let’s pretend you’re a developer at DoctorGPT, an application designed to replace a primary care physician. For the MVP, we’ll send our users’ (patients) queries to OpenAI’s GPT-3.5 API.
Step 2: Test
Now that your application runs, we’ll want to see whether it works. The first way you might do this is simply by playing around with the application — run a few prompts, see if you get a good response. We call this process a “vibe check” or “eyeball eval”.
If you’ve been deploying an LLM application, you may already be doing “eyeball eval”. Eyeball Eval is the process of making a change to your LLM application, and manually testing out a few of your favorite prompts or queries.
If you’ve been working on LLMs for a while, you might have already moved on to a more advanced version of Eyeball Eval — Spreadsheet Eval. Because we want to start standardizing the set of queries we send to our application, teams will often create a spreadsheet of their favorite queries that they send to the LLM application.
Step 3: Iterate
Now we’re ready to get serious. We’ll get our cowboy hat and boots on, and put in the investment to build automated, reliable evaluation. Instead of doing manual reviews and spreadsheets, we’ll use the Scorecard platform to have our tests run at the click of a button. This allows the team to save precious developer hours and ship faster. You’ll also be able to avoid major regressions and be able to sleep better at night (guaranteed!).
Scorecard
With Scorecard, we can start to operationalize this process, so that you don’t have to remember your favorite queries or copy-and-paste responses into a spreadsheet.
Building a Sample Testset
Rather than boiling the ocean and creating a full testset from the get-go, let’s work backwards and start with the simplest approach and work our way up.
In order to build a Sample Testset, you can take 5-10 of your favorite prompts or queries that you generally use for testing. A step-by-step guide can be found here.
Building a Golden Testset
Though Eyeball Eval works for getting a general sense of your product quality, it doesn’t provide enough signal to be confident in a change. In order to work your way up from Eyeball Eval to a full-fledged automated evaluation, we’ll start with a Golden Testset.
In order to build a Golden Testset, we will need to identify a set of ~100 high-quality prompts or queries that are representative of your LLM application’s use case. We can then use these prompts to build a small set of test data that we know the correct responses to.
From this set, we can validate custom metrics, and eventually expand into a Large-Scale Testset. Using a Large-Scale Testset and Metrics, Scorecard can automatically evaluate the quality of your LLM application’s responses and provide you with actionable feedback.
Importing a Golden Testset
To import a Golden Testset, go to the Testset Builder and click on the “Import” button. From there, you can upload a CSV file containing your test data. A step-by-step guide can be found here.
Alternatively, you can use the Scorecard SDK.
Executing your Testset
Once you have your Golden Testset, you can start executing it on the playground. This is a fast way of iterating on prompts or trying out different models on Testsets or arbitrary inputs. You can find more details on the playground guide.
Another way of kicking off execution is via your own LLM application in production. This will allow you to collect data on the quality of your application’s responses. You can do this via the Scorecard SDK.
Scoring your Testset with Metrics
After executing your Testset, you can score it using Metrics. You can use default Metrics that Scorecard vets, or you can create Custom Metrics. Custom Metrics are performance measures that you can define based on your application’s specific use case. By scoring your Testset with Metrics, you can get a better understanding of the quality of your application’s responses without needing ground truth labels.
Meta-Evaluation
Finally, you can use meta-evaluation to analyze the performance of your Testset and Metrics. Meta-evaluation allows you to identify areas where your Testset or Metrics may be biased or incomplete and make improvements to improve the quality of your evaluation process.
Scorecard offers two ways to do meta-evaluation: self-labeled and specialist-labeled. If you would like to have your results labeled by a Scorecard Specialist, please email us.