{"id":89,"date":"2025-02-13T23:50:09","date_gmt":"2025-02-13T23:50:09","guid":{"rendered":"https:\/\/aunhumano.com\/?p=89"},"modified":"2025-02-13T23:50:09","modified_gmt":"2025-02-13T23:50:09","slug":"the-time-of-evaluation-driven-development","status":"publish","type":"post","link":"https:\/\/aunhumano.com\/index.php\/2025\/02\/13\/the-time-of-evaluation-driven-development\/","title":{"rendered":"The time of evaluation driven development"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\"><em>A New Software Development Paradigm<\/em><\/h3>\n\n\n\n<p>The increasing adoption of large language models (LLMs) in software development has brought significant changes, especially for teams integrating third-party LLMs via APIs. Many teams that once built traditional, deterministic software products are now shifting toward probabilistic systems, a change that requires a new approach to ensure product quality and reliability.<\/p>\n\n\n\n<p>Incorporating stochastic components introduces uncertainty, which can lead to subpar products if not properly managed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How are teams currently evaluating their systems?<\/strong><\/h3>\n\n\n\n<p>The most common approach to evaluating AI-powered systems is informal &#8220;vibe testing,&#8221; where team members manually perform end-to-end (E2E) tests to see if the output &#8220;looks good enough.&#8221; This approach lacks standardization and rigorous measurement, leading to inconsistent results.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Why is evaluation often overlooked?<\/strong><\/h3>\n\n\n\n<p>Evaluating AI systems is challenging. Building an evaluation pipeline is complex, requiring ongoing maintenance, careful monitoring, and subjective decision-making when assessing outputs.<\/p>\n\n\n\n<p>Moreover, AI products can be deployed to production without a robust evaluation process. Many teams perceive evaluation as a distraction from more immediate development tasks, postponing it indefinitely.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The consequences of a bad evaluation approach<\/strong><\/h3>\n\n\n\n<p>A product that performs well in ideal scenarios but fails in real-world use cases\u2014often during crucial demonstrations with potential clients. This can erode the team&#8217;s confidence, damage trust with stakeholders, and even stall the release of new AI-powered features.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Why adopt evaluation driven development?<\/strong><\/h3>\n\n\n\n<p>For teams working with probabilistic systems, confidence is critical. Without it, AI-powered products risk being abandoned or ignored. Robust evaluation ensures the product remains reliable, even when faced with unpredictable inputs. This confidence allows broader adoption within a company and improves credibility with customers.<\/p>\n\n\n\n<p>Additionally, evaluation-driven development (EDD) fosters a positive feedback loop: frequent evaluations drive incremental improvements, guiding the product toward better performance. However, for EDD to be effective, at least some evaluations must be fast and executable on every iteration.<\/p>\n\n\n\n<p>A strong evaluation framework enables teams to release new versions with confidence and provides actionable insights to refine weak points in the system.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How to Implement Evaluation-Driven Development<\/strong><\/h3>\n\n\n\n<p>To build AI-powered software effectively, quick iterations are key. A good evaluation strategy includes at least three types of tests, categorized by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The time required to build them<\/li>\n\n\n\n<li>The time required to execute them<\/li>\n\n\n\n<li>The time required to analyze their results<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>1. Low-Complexity Evaluations<\/strong><\/h4>\n\n\n\n<p>These are similar to traditional unit tests but adapted for LLM-generated responses. Some examples include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Ensuring specific strings are never included<\/strong> in LLM output, such as sensitive information (IDs, profanity, etc.).<\/li>\n\n\n\n<li><strong>Checking for required substrings<\/strong> in the response. For example, if the expected answer to <em>&#8220;What are the store opening hours?&#8221;<\/em> is <em>&#8220;The store is open from 9 AM to 8 PM every day&#8221;<\/em>, a test could verify that the response contains <em>&#8220;9&#8221;<\/em> or <em>&#8220;nine&#8221;<\/em>.<\/li>\n\n\n\n<li><strong>Verifying the number of results returned<\/strong> for a given query. If a test dataset should return exactly three results from a vector search, an assertion can confirm the output array has a length of three.<\/li>\n<\/ul>\n\n\n\n<p>These are the kind of evaluations that have to run quickly and frequently, on each code change if possible.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>2. Medium-to-High Complexity Evaluations<\/strong><\/h4>\n\n\n\n<p>These evaluations require more time to develop and analyze because there is usually a human component on every iteration. Examples include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Assessing open-ended text generation<\/strong>: Checking whether vector search retrieves relevant context, measuring semantic similarity with a ground truth answer, or detecting hallucinations in responses.<\/li>\n\n\n\n<li><strong>Evaluating structured text generation<\/strong>: For example, in a classification task where an LLM assigns categories to inputs, evaluation can be done using traditional supervised learning metrics (F1-score, precision, recall). Unlike conventional models, LLMs provide a confidence proxy through <em>logprobs<\/em>, which indicate token-level prediction probabilities.<\/li>\n\n\n\n<li><strong>Testing interaction flows<\/strong>: In chatbot applications, evaluation can track whether the conversation progresses logically and whether the system maintains contextual awareness across multiple interactions.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>3. End-to-End (E2E) Evaluations<\/strong><\/h4>\n\n\n\n<p>E2E evaluations assess the entire system&#8217;s behavior, rather than isolated components. For example, while a medium-complexity evaluation might assess a single chatbot interaction, an E2E evaluation considers an entire conversation flow including the platform on which the chatbot is deployed (e.g., WhatsApp).<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Some questions evaluations should answer<\/strong><\/h3>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>Retrieval-Augmented Generation (RAG)<\/strong><\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Is the vector search returning the most relevant items?<\/li>\n\n\n\n<li>Does the LLM-generated response accurately address the user\u2019s request?<\/li>\n\n\n\n<li>Is the LLM properly handling previous interactions to provide contextually appropriate answers?<\/li>\n<\/ul>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>Classification Tasks<\/strong><\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What is the model\u2019s F1-score?<\/li>\n\n\n\n<li>What are the most common misclassifications?<\/li>\n\n\n\n<li>Why is the model making errors in certain cases?<\/li>\n<\/ul>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>Chatbot Performance<\/strong><\/h5>\n\n\n\n<ul class=\"wp-block-list\">\n<li>How many conversation flows are completed successfully?<\/li>\n\n\n\n<li>Is the chatbot retaining context from previous messages?<\/li>\n\n\n\n<li>Is the chatbot adhering to predefined behavioral guidelines?<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h3>\n\n\n\n<p>As AI-generated outputs become more integral to software products, the need for robust evaluation pipelines is growing. Teams that invest in evaluation-driven development will build more reliable, trustworthy AI applications\u2014gaining a competitive advantage over those that neglect this crucial step.<\/p>\n\n\n\n<p>The time to prioritize evaluation is now. Otherwise, teams risk falling behind as better-tested and more robust AI solutions take the lead.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A New Software Development Paradigm The increasing adoption of large language models (LLMs) in software development has brought significant changes, especially for teams integrating third-party LLMs via APIs. Many teams that once built traditional, deterministic software products are now shifting toward probabilistic systems, a change that requires a new approach to ensure product quality and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-89","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts\/89","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/comments?post=89"}],"version-history":[{"count":1,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts\/89\/revisions"}],"predecessor-version":[{"id":90,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts\/89\/revisions\/90"}],"wp:attachment":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/media?parent=89"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/categories?post=89"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/tags?post=89"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}