Evals before everything

Evals, or evaluations, are a set of tests used to measure prediction models performance on a specific task.

Why?

Any product using LLMs (Large Language Models) in production environments should have evals. And not only that, evaluations should be built even before writing the code. For those coming from a software development background, it would be like TDD (Test-Driven Development), but with evaluations instead of tests guiding product development.

I understand that for a developer, starting by creating evaluations instead of coding directly might feel a bit discouraging, but the pros far outweigh the cons.

Once we have a set of evaluations that we can trust and run relatively quickly, development will be much more enjoyable and less stressful since every change can be compared to previous versions. This reduces uncertainty, a common characteristic in the world of AI, especially with LLMs, where one day we might be using one model and the next day switching to another.

Look at the data, a lot!

Creating evals should be our priority, especially when starting a project, but at the same time, we must closely monitor the data generated from real-world use in production. Only then can we iteratively improve the cases we evaluate to enhance the system. Evaluations are not static—user behavior changes, and often the models we use do too. Sticking with the same set of test cases defined early in the application and using them at a much later stage can become problematic.

What should we aim for when creating evals?

Coverage: Cover a large percentage of possible cases.
Confidence: If the evals score above a predefined threshold (that we’ve defined as “good enough”), we should feel confident releasing the solution into production.
Speed: We need to run the evaluations quickly. Slow evaluations can hinder development and lead to more predictable, less frequent changes.
Ease of expansion: Anyone wanting to add cases shouldn’t face too many obstacles.
Ease of interpretation: A user interface that allows for easy interpretation of evaluation results and comparison with previous runs.
Ease of execution: Run tests ad-hoc.

Types of Evals by Task

End-to-End

The evaluations we create depend heavily on the type of task the LLM will perform.

For classification tasks, where the LLM must predict which category (from a group) a message belongs to, we can use classic metrics: precision, recall, F1-score.

On the other hand, if what we want is to generate responses to user questions based on documents, the evaluations will be related to identifying how similar the LLM-generated answers are to the “canonical” answer.

The evaluations I described above are more “end-to-end,” where we run the entire pipeline and evaluate the final output against the correct output we define.

Somewhere in the Pipeline

It’s also advisable to evaluate components that are part of the pipeline. In the case of LLMs, it’s very likely that we’ll use vector search to retrieve elements with the highest semantic similarity to the input.

For classification, if there are, for example, 50 possible categories, we could use vector search to narrow it down to the 10 categories that are most semantically similar to the input. The LLM would then make its decision based on these 10 categories. How can we know if these 10 categories are the most relevant? The answer won’t surprise you—with evaluations, of course!