The time of evaluation driven development

A New Software Development Paradigm

The increasing adoption of large language models (LLMs) in software development has brought significant changes, especially for teams integrating third-party LLMs via APIs. Many teams that once built traditional, deterministic software products are now shifting toward probabilistic systems, a change that requires a new approach to ensure product quality and reliability.

Incorporating stochastic components introduces uncertainty, which can lead to subpar products if not properly managed.

How are teams currently evaluating their systems?

The most common approach to evaluating AI-powered systems is informal “vibe testing,” where team members manually perform end-to-end (E2E) tests to see if the output “looks good enough.” This approach lacks standardization and rigorous measurement, leading to inconsistent results.

Why is evaluation often overlooked?

Evaluating AI systems is challenging. Building an evaluation pipeline is complex, requiring ongoing maintenance, careful monitoring, and subjective decision-making when assessing outputs.

Moreover, AI products can be deployed to production without a robust evaluation process. Many teams perceive evaluation as a distraction from more immediate development tasks, postponing it indefinitely.

The consequences of a bad evaluation approach

A product that performs well in ideal scenarios but fails in real-world use cases—often during crucial demonstrations with potential clients. This can erode the team’s confidence, damage trust with stakeholders, and even stall the release of new AI-powered features.

Why adopt evaluation driven development?

For teams working with probabilistic systems, confidence is critical. Without it, AI-powered products risk being abandoned or ignored. Robust evaluation ensures the product remains reliable, even when faced with unpredictable inputs. This confidence allows broader adoption within a company and improves credibility with customers.

Additionally, evaluation-driven development (EDD) fosters a positive feedback loop: frequent evaluations drive incremental improvements, guiding the product toward better performance. However, for EDD to be effective, at least some evaluations must be fast and executable on every iteration.

A strong evaluation framework enables teams to release new versions with confidence and provides actionable insights to refine weak points in the system.

How to Implement Evaluation-Driven Development

To build AI-powered software effectively, quick iterations are key. A good evaluation strategy includes at least three types of tests, categorized by:

The time required to build them
The time required to execute them
The time required to analyze their results

1. Low-Complexity Evaluations

These are similar to traditional unit tests but adapted for LLM-generated responses. Some examples include:

Ensuring specific strings are never included in LLM output, such as sensitive information (IDs, profanity, etc.).
Checking for required substrings in the response. For example, if the expected answer to “What are the store opening hours?” is “The store is open from 9 AM to 8 PM every day”, a test could verify that the response contains “9” or “nine”.
Verifying the number of results returned for a given query. If a test dataset should return exactly three results from a vector search, an assertion can confirm the output array has a length of three.

These are the kind of evaluations that have to run quickly and frequently, on each code change if possible.

2. Medium-to-High Complexity Evaluations

These evaluations require more time to develop and analyze because there is usually a human component on every iteration. Examples include:

Assessing open-ended text generation: Checking whether vector search retrieves relevant context, measuring semantic similarity with a ground truth answer, or detecting hallucinations in responses.
Evaluating structured text generation: For example, in a classification task where an LLM assigns categories to inputs, evaluation can be done using traditional supervised learning metrics (F1-score, precision, recall). Unlike conventional models, LLMs provide a confidence proxy through logprobs, which indicate token-level prediction probabilities.
Testing interaction flows: In chatbot applications, evaluation can track whether the conversation progresses logically and whether the system maintains contextual awareness across multiple interactions.

3. End-to-End (E2E) Evaluations

E2E evaluations assess the entire system’s behavior, rather than isolated components. For example, while a medium-complexity evaluation might assess a single chatbot interaction, an E2E evaluation considers an entire conversation flow including the platform on which the chatbot is deployed (e.g., WhatsApp).

Some questions evaluations should answer

Retrieval-Augmented Generation (RAG)

Is the vector search returning the most relevant items?
Does the LLM-generated response accurately address the user’s request?
Is the LLM properly handling previous interactions to provide contextually appropriate answers?

Classification Tasks

What is the model’s F1-score?
What are the most common misclassifications?
Why is the model making errors in certain cases?

Chatbot Performance

How many conversation flows are completed successfully?
Is the chatbot retaining context from previous messages?
Is the chatbot adhering to predefined behavioral guidelines?

Conclusion

As AI-generated outputs become more integral to software products, the need for robust evaluation pipelines is growing. Teams that invest in evaluation-driven development will build more reliable, trustworthy AI applications—gaining a competitive advantage over those that neglect this crucial step.

The time to prioritize evaluation is now. Otherwise, teams risk falling behind as better-tested and more robust AI solutions take the lead.