How to Build an AI Evaluation System That Actually Works

Hard-won lessons from building AI products in the real world

and

Jun 11, 2025

Over the last few months, I've been building Zinc, company focused on automating people management. In that time, I've made a spectacular number of mistakes and learned a ton about how to build AI products that people actually want to use.

I'm not an expert on this topic—far from it. But I've had to learn quickly, often through embarrassing failures that taught me more than any blog post ever could.

The good news? You don't need massive datasets or deep AI expertise to get started. My co-founder Kemble and I built something that works with just 65 labeled examples from 20 meeting transcripts. We're not AI researchers, we're learning as we go. But seeing specific examples of how AI tools are actually built makes it feel accessible and achievable. I hope sharing our process will inspire others to see how they could implement these techniques in their own products.

This is the story of how we learned to build an AI evaluation system that actually worked—complete with the messy details, false starts, and hard-won lessons.

The New Reality: When "Good Enough" Isn't Good Enough

The first harsh lesson: AI products break in ways traditional software doesn't.

Picture this scenario that became our obsession: Sarah, an engineering manager, sits down to write annual reviews for her five direct reports. She needs to synthesize 12 months of feedback from weekly 1-on-1s. The problem? She can barely remember what she discussed yesterday, let alone the nuanced feedback she gave Jake about his presentation skills back in March.

Sarah does what most managers do—focuses on whatever's most recent. Jake's review becomes all about that bug he shipped last week, not his steady improvement in technical communication over the entire year. It’s a case of recency bias, and it's everywhere in performance management.

Our solution seemed elegantly simple: listen to 1-on-1 meetings, automatically identify and extract feedback, then synthesize themes across the full year. Managers could give feedback naturally in conversation, knowing it would be captured and organized for them.

The core AI task appeared straightforward: take a meeting transcript and identify all instances of feedback, then tag each piece with key attributes like giver, receiver, whether it's positive or constructive, and how significant it is.

Then we tried to build it.

The Moment We Realized We Were Fooling Ourselves

We wrote our first prompt, fed it some meeting transcripts, and got results that were... sort of correct? Sometimes?

The model would nail obvious feedback like "Great job on the presentation yesterday, your slides were really clear." But then it would flag collaborative discussions like "Maybe we should reorganize this section of the document" as performance feedback, even though that's just two people working together.

So we'd make the prompt more specific. Problem solved, right? Wrong. Now it was missing subtle but important feedback like a manager's gentle suggestion that an employee "might want to consider speaking up more in team meetings."

Each iteration followed the same maddening pattern:

Run the model on a few transcripts
Manually review outputs to spot problems
Tweak the prompt to fix obvious issues
Discover we'd fixed one problem but created two new ones
Repeat until insanity sets in

We were playing whack-a-mole with model outputs, except the moles were multiplying faster than we could whack them. The humiliation came during a user call when she mentioned spending half the meeting giving feedback on a document—but the only thing our system picked up was a comment about her team's vacation schedule. I was mortified. She was polite, but I could tell she was thinking, "What exactly is this supposed to do for me?"

That moment crystallized the problem: we had no systematic way to know if our "improvements" were actually making things better overall. We might fix precision (fewer false positives) but tank recall (missing real feedback). We needed a way to measure progress systematically, not just eyeball results and hope for the best.

We needed an evaluation system. We just had no idea how to build one.

Building Ground Truth: The Unsexy Foundation That Changes Everything

Here's what nobody tells you about AI evaluations: the hardest part isn't the fancy metrics or automated pipelines. It's creating ground truth data that doesn't suck.

Ground truth is your north star—a dataset where you know the correct answers so you can measure how well your system performs. In our case, that meant taking real meeting transcripts and manually identifying every instance of feedback.

Don't let perfect be the enemy of good. Our 20 meetings felt like a small dataset, but it gave us the signal we needed to make real improvements. Even 20 well-chosen examples will reveal your biggest failure modes.

The Manual Slog That You Can't Skip

We developed what we generously called a "process" for creating our annotated dataset. It became the foundation of everything that followed:

Listen through transcripts using text-to-speech at 3X speed
Use ChatGPT, Gemini, etc. for a rough first pass to identify potential feedback
Manually review and clean up using our best judgment
Work independently then compare results to resolve disagreements

The independence part was crucial. When Kemble and I annotated the same data separately, disagreements revealed where our definitions needed tightening. There are formal metrics for measuring inter-annotator agreement, but given our scale we didn't need that level of precision.

We set aside a separate test set of meetings to avoid evaluation set contamination—basically ensuring we weren't accidentally teaching our system the answers to the test.

Soon enough, we had 20 meticulously annotated meetings with 65 instances of feedback—each one a small masterpiece of labeled examples. It was tedious work, but it became the bedrock that let us actually improve our system.

There are SaaS tools like Datasaur, LightTag, and Labelbox Starter that can streamline this annotation process, but we hadn't yet reached the scale where investing in them made sense. Our scrappy manual approach worked fine for getting started.

Teaching Machines to Grade Themselves

Now came the part that felt like science fiction: how do you automatically evaluate whether your AI correctly identified feedback?

Since AI outputs are non-deterministic (slightly different each time), you can't just do string matching. "Great presentation skills" and "excellent presentation abilities" mean the same thing, but traditional programming would see them as completely different.

Enter "LLM as a judge"—use a language model to evaluate your model's output. It sounds recursive and weird, but it works remarkably well for subjective tasks.

We built an evaluator that classified each piece of model output into one of three categories:

True Positive (TP): The model correctly identified feedback that actually exists in the ground truth.

False Positive (FP): The model thought something was feedback, but it wasn't in our ground truth dataset.

False Negative (FN): The model missed feedback that we know exists.

Our evaluator worked by giving a language model three pieces of information: the original meeting transcript, our ground truth feedback examples, and what our system had identified as feedback. We then asked it to think step-by-step and classify each piece of identified feedback into one of the three categories above.

The key was being extremely specific in our instructions to the evaluator. We told it exactly what constituted real feedback versus casual conversation, and asked it to explain its reasoning before making each classification.

This evaluator became our automated referee, consistently applying the same standards across all test cases. For the technical implementation, we used LangChain to handle the plumbing—prompt templating, model switching, and result aggregation. It saved us weeks of building infrastructure from scratch, though it's not the most exciting part of the story.

The Moment We Could Finally Measure Success

With our classification system running, we could finally calculate the metrics that would guide our optimization. This was the turning point where systematic improvement became possible.

We focused on three key measurements:

Precision: Of all the things our model flagged as feedback, how many actually were feedback? High precision means fewer false alarms—your model isn't crying wolf every time someone mentions improvement.

Recall: Of all the actual feedback in the meeting, how much did our model find? High recall means you're not leaving important feedback on the table.

F1 Score: The harmonic mean of precision and recall. You can't game it by having stellar precision but terrible recall, or vice versa. Both need to be decent for a good F1 score.

Think of your AI like a metal detector on a beach. Precision is: of all the times it beeped, how many were actually treasure (not bottle caps)? Recall is: of all the treasure buried on the beach, how much did you find? F1 score is: are you finding lots of treasure without wasting time digging up junk?

We developed practical benchmarks for our specific use case:

We didn't do fancy statistical tests for significance—we didn't need that level of precision yet. But as we scale, that's something we'll add.

The Optimization Journey: What Actually Moved the Needle

With our evaluation system finally in place, we could start the real work: systematic improvement. This is where we learned which changes actually mattered versus which ones just feel important.

Through weeks of methodical experimentation, we discovered which improvements had outsized impact:

The big wins:

Upgrading to GPT-4.1: Single biggest improvement across all metrics. Better models really do matter more than clever prompting.
Few-shot prompting: Including 2-3 perfect examples in our prompt dramatically improved consistency. The model finally understood what we wanted.
Cleaning up transcripts: Removing filler words and fixing obvious transcription errors. Garbage in, garbage out is still the iron law.
Better voice-to-text quality: Investing in transcript accuracy lifted all our metrics simultaneously.

The smaller wins that added up:

Specific prompt instructions: Being extremely clear about what counts as "significant enough for a performance review"
Temperature tuning: Lower temperature (0.2-0.3) for more consistent outputs
Smarter LLM-as-judge: Using a more sophisticated model for evaluation improved reliability

When Good Enough is Good Enough

One of the hardest lessons was learning when to stop optimizing and start shipping. Perfect is the enemy of shipped, especially in AI products.

We established clear stopping criteria early on:

Minimum viable F1: 0.70 for our MVP launch
User feedback threshold: If beta users stopped complaining about obvious errors

The key insight: your stopping criteria should be driven by user needs, not just improving metrics for their own sake.

The Transformation: From Alchemy to Engineering

The transformation was remarkable. Before our evaluation system, improving our AI felt like alchemy—mysterious, unreliable, and frustrating. After building systematic evaluation, it became engineering—predictable, measurable, and actually enjoyable.

We went from "this seems to work pretty well" to "this correctly identifies 87% of important feedback with 92% precision." That confidence transformed how we built features, talked to users, and made product decisions.

The evaluation system paid dividends in ways we hadn't anticipated:

Faster debugging: When things went wrong, we could quickly isolate whether it was a model issue, prompt issue, or data quality problem.

Better user conversations: Instead of defensive discussions about "why didn't it catch this," we could have productive conversations about trade-offs and priorities.

Team alignment: Having clear metrics gave us a shared understanding of what "good" looked like.

Confidence to iterate: The safety net of systematic evaluation made us bolder about trying new approaches.

Hard-Won Lessons for Your AI Journey

After months of building, breaking, and rebuilding our evaluation system, here are the lessons I wish someone had told me at the start:

Embrace the manual work: Yes, creating ground truth is tedious. Yes, you'll want to shortcut it with synthetic data or existing datasets. Don't. The manual work forces you to understand your problem deeply and creates the foundation for everything else. It's not just annotation—it's product discovery.

Start stupidly simple: Your first evaluation system should be embarrassingly basic. We started with manual spot-checks before building automated pipelines. Build the simplest thing that gives you systematic feedback, then evolve from there.

Connect metrics to user happiness: F1 scores are great, but they're meaningless if they don't connect to actual user outcomes. Always be able to explain why improving your metric makes your product better for real people.

LLM-as-judge works better than expected: Using AI to evaluate AI feels weird at first, but it's remarkably effective for subjective tasks. Just make sure your judge is at least as capable as the model you're evaluating.

Build for debugging, not just measurement: Your evaluation system isn't just about knowing if things work—it's about understanding why they don't. Design for debugging from day one.

If I Were Doing This Again

If I were starting over, I'd begin with a focused 48-hour sprint that looked like this:

Day 1: Pick 20 or so examples of your AI's current output. Manually label what "good" looks like for each one. Work with someone else if possible—disagreements reveal where your definitions need clarity. Document your decisions in a simple spreadsheet.

Day 2: Write a simple LLM-as-judge prompt using our template structure. Run it on your 20 examples and calculate basic precision and recall. Identify the top 2 failure patterns. Make one targeted improvement to your system.

After 48 hours, you'll have a baseline F1 score (probably between 0.4-0.7 for most first attempts), clear visibility into your biggest problems, and the confidence to iterate systematically instead of guessing.

Once you have signal from 20 examples, scale to 50+ examples for shipping confidence, then build automated pipelines for continuous evaluation.

Why this works: Our first evaluation with 20 meetings taught us more about our system than weeks of manual spot-checking. You don't need academic rigor to make product decisions—you need systematic feedback loops.

The anti-pattern is waiting until you have "enough" data. Start measuring with what you have today. The pattern that works is: small dataset → clear signal → targeted improvement → repeat.

The Bigger Picture

We're still in the early innings of the AI product revolution, but one pattern is already clear: teams that figure out systematic evaluation will build better products faster than those that don't.

This isn't just about AI companies or technical teams. Every business will soon have AI components, and someone needs to ensure they work reliably. Understanding evaluation principles—even at a high level—will become a core professional skill.

Whether you're trying to ship reliable AI features, building an AI-first company, or researching how these systems affect human behavior, the fundamentals remain the same: you need systematic ways to measure and improve AI performance.

Don't wait for perfect conditions or massive datasets. Start small, get signal fast, and build from there. The most important step is the first one: admitting that "it seems to work" isn't good enough for AI products you want people to trust and use.

Our company isn’t changing the world yet—but it’s working. We’re starting to help managers write better reviews, and if we can make headway while learning on the fly, others can too.

A guest post by

Kemble Song

Developer of things, living life in New York City. I co-host the podcast STEREO VISION—an exploration of fashion, film and music.

Jordan’s Substack

Discussion about this post