On evaluating agents

“Models constantly change and improve but evals persist”

Look at the data

No amount of evals will replace the need to look at the data, once you have a evals good coverage you’ll be able to decrease the time but it’ll be always a must to just look at the agent traces to identify possible issues or things to improve.

Starting, end to end evals

You must create evals for your agents, stop relying solely on manual testing.

Not sure where to start?

Add e2e evals, define a success criteria (did the agent meet the user’s goal?) and make the evals output a simple yes/no value.

This is much better than no evals.

By performing simple end to end agent evaluations you can quickly manage to:
– identify problematic edge cases
– update, trim and refine the agent prompts
– make sure you are not breaking the already working cases
– compare the performance of the current llm model vs. cheaper ones

N -1 evals

Once you created the e2e evals you can move on with “N – 1” evals, that is, evals that need to “simulate” previous interactions between system and user.

Suppose that either by looking at the data or by running a set of e2e evals you find that there is a problem when the user asks for the brand open stores in his area. Well, it’d be better to create an eval to directly improve this, but if you keep doing it with the e2e evals you won’t be able to always reproduce the error and your evals will take too much time and will cost too much money.

It’d be much better to “simulate” the previous interactions and then get to the point.

There’s one issue with this, you’ll have to be careful to keep the “N – 1” interactions updated whenever you make some changes because you will be “simulating” something that will never happen again in your agent.

Checkpoints

It’s really difficult and time intensive to evaluate agents outputs when you are trying to validate complex conversation patterns that you want the LLM’s to strictly follow.

I usually put “checkpoints” inside the prompts, words that I ask the llm to output verbatim.

This allows me to simply make some evals that check for exact strings. If at some point of the conversation the string is not present, I can pretty much know that the system is not working as expected.

External tools

Tools can help you by simplifying the setup/infra and maybe giving you a nice interface, but you still have to look at the data and build the specific evaluations for your use case.

Don’t rely solely on standard evals, build your own.

Comments

One response to “On evaluating agents”

Nick Swekosky

September 4, 2025

Great post. Any favorite courses you prefer to learn how to do an e2e eval process?

Two I’ve found helpful:
1. https://learn.deeplearning.ai/courses/evaluating-ai-agents
2. https://maven.com/parlance-labs/evals