{"id":130,"date":"2025-09-03T12:35:47","date_gmt":"2025-09-03T12:35:47","guid":{"rendered":"https:\/\/aunhumano.com\/?p=130"},"modified":"2025-09-03T12:37:54","modified_gmt":"2025-09-03T12:37:54","slug":"on-evaluating-agents","status":"publish","type":"post","link":"https:\/\/aunhumano.com\/index.php\/2025\/09\/03\/on-evaluating-agents\/","title":{"rendered":"On evaluating agents"},"content":{"rendered":"\n<h5 class=\"wp-block-heading\">&#8220;Models constantly change and improve but evals persist&#8221;<\/h5>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Look at the data<\/h4>\n\n\n\n<p>No amount of evals will replace the need to look at the data, once you have a evals good coverage you&#8217;ll be able to decrease the time but it&#8217;ll be always a must to just look at the agent traces to identify possible issues or things to improve.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Starting, end to end evals<\/h4>\n\n\n\n<p>You must create evals for your agents, stop relying solely on manual testing.<br><br>Not sure where to start?<br><br>Add e2e evals, <strong>define a success criteria<\/strong> (did the agent meet the user&#8217;s goal?) and make the evals output a simple yes\/no value.<br><br>This is much better than no evals.<\/p>\n\n\n\n<p>By performing simple end to end agent evaluations you can quickly manage to:<br>&#8211; identify problematic edge cases<br>&#8211; update, trim and refine the agent prompts<br>&#8211; make sure you are not breaking the already working cases<br>&#8211; compare the performance of the current llm model vs. cheaper ones<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">N -1 evals<\/h4>\n\n\n\n<p>Once you created the e2e evals you can move on with &#8220;N &#8211; 1&#8221; evals, that is, evals that need to &#8220;simulate&#8221; previous interactions between system and user.<\/p>\n\n\n\n<p>Suppose that either by looking at the data or by running a set of e2e evals you find that there is a problem when the user asks for the brand open stores in his area. Well, it&#8217;d be better to create an eval to directly improve this, but if you keep doing it with the e2e evals you won&#8217;t be able to  always reproduce the error and your evals will take too much time and will cost too much money.<\/p>\n\n\n\n<p>It&#8217;d be much better to &#8220;simulate&#8221; the previous interactions and then get to the point.<\/p>\n\n\n\n<p>There&#8217;s one issue with this, you&#8217;ll have to be careful to keep the &#8220;N &#8211; 1&#8221; interactions updated whenever you make some changes because you will be &#8220;simulating&#8221; something that will never happen again in your agent.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Checkpoints<\/h4>\n\n\n\n<p>It&#8217;s really difficult and time intensive to evaluate agents outputs when you are trying to validate complex conversation patterns that you want the LLM&#8217;s to strictly follow.<br><br>I usually put &#8220;checkpoints&#8221; inside the prompts, words that I ask the llm to output verbatim.<br><br>This allows me to simply make some evals that check for exact strings. If at some point of the conversation the string is not present, I can pretty much know that the system is not working as expected.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">External tools<\/h4>\n\n\n\n<p>Tools can help you by simplifying the setup\/infra and maybe giving you a nice interface, but you still have to look at the data and build the specific evaluations for your use case.<br><br>Don&#8217;t rely solely on standard evals, build your own.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&#8220;Models constantly change and improve but evals persist&#8221; Look at the data No amount of evals will replace the need to look at the data, once you have a evals good coverage you&#8217;ll be able to decrease the time but it&#8217;ll be always a must to just look at the agent traces to identify possible [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-130","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts\/130","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/comments?post=130"}],"version-history":[{"count":5,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts\/130\/revisions"}],"predecessor-version":[{"id":135,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/posts\/130\/revisions\/135"}],"wp:attachment":[{"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/media?parent=130"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/categories?post=130"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aunhumano.com\/index.php\/wp-json\/wp\/v2\/tags?post=130"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}