The LLM Developer’s Journey

Let’s walk through the journey of an LLM Developer, from start to finish. Buckle up! 🤠

Step 1: Build

Let’s pretend you’re a developer at DoctorGPT, an application designed to replace a primary care physician. For the MVP, we’ll send our users’ (patients) queries to OpenAI’s GPT-3.5 API.

General DoctorGPT Workflow
General DoctorGPT Workflow

Step 2: Test

Now that your application runs, we’ll want to see whether it works. The first way you might do this is simply by playing around with the application — run a few prompts, see if you get a good response. We call this process a “vibe check” or “eyeball eval”.

Eyeball Eval Workflow
Eyeball Eval Workflow

If you’ve been deploying an LLM application, you may already be doing “eyeball eval”. Eyeball Eval is the process of making a change to your LLM application, and manually testing out a few of your favorite prompts or queries.

If you’ve been working on LLMs for a while, you might have already moved on to a more advanced version of Eyeball Eval — Spreadsheet Eval. Because we want to start standardizing the set of queries we send to our application, teams will often create a spreadsheet of their favorite queries that they send to the LLM application.

Spreadsheet Eval Workflow
Spreadsheet Eval Workflow

Step 3: Iterate

Now we’re ready to get serious. We’ll get our cowboy hat and boots on, and put in the investment to build automated, reliable evaluation. Instead of doing manual reviews and spreadsheets, we’ll use the Scorecard platform to have our tests run at the click of a button. This allows the team to save precious developer hours and ship faster. You’ll also be able to avoid major regressions and be able to sleep better at night (guaranteed!).

Stages in the Evaluation Journey
Stages in the Evaluation Journey

Scorecard

With Scorecard, we can start to operationalize this process, so that you don’t have to remember your favorite queries or copy-and-paste responses into a spreadsheet.

Building a Sample Testset

Rather than boiling the ocean and creating a full testset from the get-go, let’s work backwards and start with the simplest approach and work our way up.

In order to build a Sample Testset, you can take 5-10 of your favorite prompts or queries that you generally use for testing. A step-by-step guide can be found here.

Building a Golden Testset

Though Eyeball Eval works for getting a general sense of your product quality, it doesn’t provide enough signal to be confident in a change. In order to work your way up from Eyeball Eval to a full-fledged automated evaluation, we’ll start with a Golden Testset.

In order to build a Golden Testset, we will need to identify a set of ~100 high-quality prompts or queries that are representative of your LLM application’s use case. We can then use these prompts to build a small set of test data that we know the correct responses to.

From this set, we can validate custom metrics, and eventually expand into a Large-Scale Testset. Using a Large-Scale Testset and Metrics, Scorecard can automatically evaluate the quality of your LLM application’s responses and provide you with actionable feedback.

Importing a Golden Testset

To import a Golden Testset, go to the Testset Builder and click on the “Import” button. From there, you can upload a CSV file containing your test data. A step-by-step guide can be found here.

Alternatively, you can use the Scorecard SDK.

Executing your Testset

Once you have your Golden Testset, you can start executing it on the playground. This is a fast way of iterating on prompts or trying out different models on Testsets or arbitrary inputs. You can find more details on the playground guide.

Another way of kicking off execution is via your own LLM application in production. This will allow you to collect data on the quality of your application’s responses. You can do this via the Scorecard SDK.

Scoring your Testset with Metrics

After executing your Testset, you can score it using Metrics. You can use default Metrics that Scorecard vets, or you can create Custom Metrics. Custom Metrics are performance measures that you can define based on your application’s specific use case. By scoring your Testset with Metrics, you can get a better understanding of the quality of your application’s responses without needing ground truth labels.

Meta-Evaluation

Finally, you can use meta-evaluation to analyze the performance of your Testset and Metrics. Meta-evaluation allows you to identify areas where your Testset or Metrics may be biased or incomplete and make improvements to improve the quality of your evaluation process.

Scorecard offers two ways to do meta-evaluation: self-labeled and specialist-labeled. If you would like to have your results labeled by a Scorecard Specialist, please email us.