Verifying coding AIs for LLM powered software

Context

The coding ability of LLMs is growing continuously, and less and less code is being written by human SWEs (Software Engineers). LLMs can generate large amounts of code in a short time, and we SWEs cannot perform a super exhaustive code review; we simply do not have the necessary time.

It may also happen that the SWE does not have experience in certain technologies but still needs to work with them. LLMs can be very helpful in these types of activities as well, although the problem is similar: a human cannot verify code written by AI with total certainty without the proper knowledge.

Even though LLMs are writing better code, they still do things that we humans believe are not right, or at least not the best. Sometimes they may have problems because we didn’t give them the necessary context; other times, the error is theirs.

In my experience, we cannot simply ask the AI to self-verify because we won’t trust that, and we’ll have to get involved anyway. Without trust in the process, automation loses value.

There is no free lunch. We have to take a good amount of time to design ways to verify, test, and evaluate the code (and its result) that LLMs write in order to take advantage of their abilities and to facilitate our work when identifying possible problems with the code they generated.

If you are building traditional software, the way to verify the work is the usual, but if you are building products with a non deterministic component it’ll be wise to add another type of tests.

Evals

Evals are not the classic tests; evals are tests we run on non-deterministic responses, in this case, responses generated by an LLM. The LLM can get close to something deterministic, for example, a prompt that says “answer yes or no to the user’s question, nothing else”, but those are special cases, and even if we set temperature to 0, there is always the possibility that it doesn’t generate exactly what we asked for.

When using LLMs, evals usually have higher latency than tests and also a higher cost because we are likely consuming a paid API. These are important differences we must consider when writing and running each eval.

Evals cannot always be evaluated in a binary way with a “correct or incorrect” output; therefore, we must think about how we are going to compare two LLM outputs to identify which one is better.

General Considerations

In summary, some considerations to keep in mind for each type of eval:

Cost: How many dollars does it cost to run this eval?
Time: How much time does it take to run this eval?
Timing: At what moments am I going to run this eval?
Purpose: What is the purpose of running this eval?
Comparison: How can I easily compare the result with previous ones?
Progress: How can I know if I am better in this new run than in the previous one? (All runs must generate a file with outputs that is comparable with other iterations).
Criticality: Which evals must pass no matter what (critical), with which ones are we satisfied if the majority pass (desirable), and which ones would be nice to pass but don’t worry us yet (bonus)? I define this on a case-by-case basis and not by eval type.
Environment: In which environments should I run this eval?

Types of Evals

End to End (E2E)

Evals that serve to test the general functioning of the backend and NLU functionality of LLMs such as intent detection and user message interpretation. Part of this functioning is how it handles input that doesn’t follow the happy path (flexibility) and how well it detects and makes transitions between nodes of a graph.

They are not the best option for evaluating LLM generation since the text generated by the LLM may not be exactly what was expected, and therefore there may be no alignment between what was expected to be answered and what was actually answered.

They also serve to know the latency of a flow and to be able to compare it when changes are made that could affect it.

At what moments am I going to run this eval? Before (branch) and after (dev) merging to dev. Whenever a non-trivial change is made in the backend that affects the system in general.
How can I know if I am better in this new run than in the previous one? Errors in transitions, errors in LLM input. The expected final result was not reached.

Checkpoints (within E2E)

Included in e2e evals, checkpoints allow quickly validating if an input generated by the LLM is correct. They only serve for cases where we know with enough certainty what the LLMs are going to answer.

Example: if they have to ask for the user’s name at a certain step, we can build a regex or a simple exact match that looks for “name” in an LLM input.
Note: They shouldn’t check the entire string because it will hardly always be the same, but rather a part that we do know should be mentioned.

End to End per Node

If at some point iterative work is done by nodes, we should be able to evaluate the entire flow of the node without necessarily having to run complete end-to-end tests.

The most efficient way is to “mark” the nodes of the end-to-end tests so that these types of tests can be run without creating new files, only running parts of the end-to-end tests.

Example: If we made changes in node 3 that depend on what happens in node 1 and node 2, then this test should “load” the context until reaching node 3 and only evaluate what happens in the interactions relative to that node, regardless of what happens next.
At what moments am I going to run this eval? Before (branch) merging to dev. Whenever a non-trivial change is made in the backend that affects the node.
How can I know if I am better in this new run than in the previous one? Errors in transitions, errors in LLM input within a node. The expected final result of the node was not reached.

N – 1

Surgical evals that serve to evaluate the generative part of LLMs because they point only to the generation of a single LLM response. Previous human-bot interactions (conversation history) are simulated as having already happened.

They are faster to run than end-to-end evals because they only require one interaction.

At what moments am I going to run this eval? When a prompt or an LLM model changes.
How can I know if I am better in this new run than in the previous one? With binary evals (TODO: resolve doubt).

E2E (General) <——> E2E per Node <——> N-1 (Specific)

LLM as a Judge

Use an LLM to “judge” the output generated by the LLM we want to evaluate. In this case, we delegate the validation of an output to an LLM, so we must trust that the judge does a good job.

How do we ensure we trust the judge? It has to “align” with a human, even better if the human is an expert in the domain. The “alignment” process involves “judging the judge,” which basically means that you not only have to evaluate the agent, but you also have to evaluate the judge. That alignment process must be done before using the judge to evaluate, and the output of that process will be a judge prompt refined based on tests and human input.

At what moments am I going to run this eval?
- In production: It is key because a human cannot be constantly reviewing conversations in real-time.
- In development: It should be used for specific cases where the LLM output is highly variable, as it is costly, slow to run, requires maintenance, and we will never be 100% sure that it is working well.

Performance

It can be measured in any eval, but it is most useful to do it in End to End ones, having the detail of how much time passes in each node or interaction.

How to know if I am better? Simple numerical comparison: less is better.

Stress Tests

They must run the End to End in different scenarios but with a cheaper LLM model if possible; it only matters to know how much demand the system supports with the changes we introduced.

Note: In some cases with a cheaper model, it won’t be possible, because perhaps to advance to a certain stage of the system it is required that the LLM returns a certain specific output; if that doesn’t happen, it won’t be possible to advance.

Web Browser User E2E Test (WIP)

Costly in time and money. I think it can be good to verify that the most critical cases continue to work.

At what moments would I run this eval? Before (dev) and after (main) merging to main.

Adversarial LLMs (WIP)

Have an LLM act as a user and interact directly with the system, generating different conversations that can then be evaluated to determine how well the system works.

(Future topics: golden datasets, synthetic data).

Tests

This will be covered in an additional document, but they are the usual tests: unit, integration, conventional end-to-end.

Current Workflow

I still don’t have a workflow I feel 100% happy with. Some time ago I wrote about Evaluation Driven Design, but I’m not sure that approach is the best for this new development paradigm.

What I am currently doing is:

Branch

Create branch.
Ask the Coding Assistant to add a functionality.
Based on that, build evals or adjust existing ones.
Run evals and tests.
Loop:
- If there are problems or I see improvements needed -> Go back to step 2.
- Run stress tests (if there are changes that could impact) -> Commit.
Merge to dev.

Dev

Run evals and tests (eval results are saved along with the last commit).
Run stress test (same condition as step 7).
Compare results after merge with results before merge.

Main

Merge to main.

For now, I do not directly integrate running evals into the AI coding assistant workflow because I feel I lose control and the process doesn’t give me confidence (and incidentally I save some tokens).

Ideal Workflow Goal

I aim for a workflow where my work is mainly:

Clearly explain what needs to be done, providing necessary context.
Give effective feedback.
Design and iterate ways to verify the work of Coding AIs.