
Why?
The biggest hurdle when someone wants to evaluate a machine learning classification task is the burdensome process of manual labeling: time consumption, boredom and human errors are some of the problems related to this task.
These issues are even more critical if we want to know how is the model performing in production periodically, if the model usage is moderate you practically would need people dedicate, if the usage is high you are in trouble.
With the introduction of the LLMs, companies started to experiment the possibility to “automate” this valuable but difficult task. The idea is that you can give a LLM:
- the context in which the model operates
- the input (what the model has to predict a category for)
- the target (the classification model predicted category)
And the LLM goal is to “judge” if the predicted target is correct or not. That could be called an AI judge for classification tasks. The AI judge can be used for other cases, for example checking if another AI has answered correctly what the user asked or evaluating how good a LLM is at a given negotiation. But in this post we’re going to focus on the classification task.
Building an AI judge is pretty simple, what can be difficult is building an AI judge that works. The most important trait that we’re looking for is confidence, when you don’t have confidence on the AI judge output, all the process lacks utility and quickly goes into an abyss of oblivion.
My approach to building an AI judge for classification
1 – Judging the Judge
The most important aspect of using an AI judge to evaluate production bots is trust. We need to have confidence and assurance that the judge will do a good job with production data.
To achieve this, we need to “judge the judge,” meaning evaluate how well the judge performs its evaluations. The way to do this is by creating a dataset and labeling it manually, ideally with real usage data. We need to be quite sure that this dataset is correctly labeled, involving a domain expert in cases of ambiguity. It doesn’t need to include thousands of cases, but it should be sufficiently diverse to provide a representative sample — at least several cases per type of intent will be necessary.
Once we have that dataset ready, we build the judge and ask it to indicate whether the human’s expressed intent is correct or not. Since the human labeling was done carefully, what we’re looking for is a high match rate — that is, the judge agrees with most of the human labeled cases. A low number (under 80% accuracy) is a sign that the judge isn’t working well or that the intentions are too ambiguous.
After the first iteration, it’s best to review the cases where the judge didn’t agree with the human to explore possible errors or ambiguities. If we find human labeling errors, we can correct them. If we find ambiguities, we can do one of two things: remove those cases or include the correct case and write out the reasoning behind the decision. I recommend the second option, as it will help us catch that issue in the production dataset. If we choose this option, we should include that explanation in the judge’s prompt for the production dataset (but not the pre-production one, as that would be giving away the answer). This way, we help prevent specific types of errors.
Other improvements to explore before the next iteration:
- Add more examples for each intent in the prompt
- General improvements to the prompt
- Improvements to the intent descriptions
- Switch the LLM model used by the judge (taking into account cost, token limits, and latency)
After several iterations, we should be able to reach an accuracy of ≥ 90% for a dataset with ambiguous intentions and even higher for better-formulated ones.
Once that’s done, we’ll have a functional AI judge, which should still be improved over time.
2 – Judge for a Sample of Production Data
At this stage, we should already be confident that the judge will do a good job, so it’s time to use it on production data. Still, it doesn’t hurt to do a preliminary phase with a small sample before final labeling, so we can “evaluate the judge” with data we’ve never seen before.
At this point, we’d do something similar to the first phase, but in this case, we’d have the intent predicted by our LLM and the judge’s verdict on whether that intent is correct or not. We would then manually label whether we believe the verdict is correct, and from this, we’d calculate the judge’s accuracy on a production dataset. We’d also get the performance of the original LLM.
Again, we should have enough information to make improvements to both the judge’s prompt and the LLM prompt used to predict intents. We can iterate based on this feedback to fine-tune both prompts.
3 – Judge for Production Data
Finally, we can use the judge to label production data without human supervision. This will allow us to discover strengths and weaknesses in our LLM that predicts intents. There may be more problematic categories, phrases that are always misinterpreted due to poor intent definitions, bugs in the bot, etc.
Some useful analyses we can perform on this data:
- Accuracy: failure rate over total predictions.
- Total and failure rate by category.
- Distribution of failures by confidence level of the predictive LLM.
This stage should be at least semi automated, the judgements on production data should be done periodically in order to get a sense about how the system is performing in real life.
What’s next?
This is an ongoing process. All feedback should be used to continuously improve both the judge and the predictive LLM using the methodology described above.
I’ve talked about asking the judge to only output if the given classification is ok or not for the sake of simplicity, but one nice thing to try would be to ask the judge its suggested category (along with the reasoning) when it considers we’ve made an incorrect prediction, that would be great to get more insight and help make better improvements to the whole system.
There are many additional things we can work on, but the most important thing is to make something that works and helps move the product forward, once you achieve this you can iterate and improve.
Leave a Reply