We Stopped Choosing LLMs by Vibe. Here's the Eval Harness We Built Instead

LLMs are moving too fast right now to pick a model once and forget about it. New models ship constantly, pricing shifts, providers have bad weeks. A model that looked great last month might be too slow, too expensive, or just not as accurate as something that dropped two weeks ago.

The hard part isn't trying a new model. It's knowing whether it's actually safe to drop into a production workflow without something quietly breaking.

So we built a small evaluation harness to compare LLMs the way we actually use them at MagicSword: structured prompts, realistic edge cases, deterministic checks, optional judge scoring, cost and latency reporting.

The project is here: https://github.com/magicsword-io/llm-eval-harness

A Quick Note on OpenRouter

Worth saying out loud, since honestly this whole workflow is only practical because of OpenRouter.

Look, a year ago if you wanted to compare models you were dealing with separate SDKs and separate auth and a pile of adapter code nobody actually wants to maintain. So most of us just didn't bother. You picked a provider, built around it, and hoped the model stayed competitive long enough to matter.

OpenRouter changes the math on that. One API, one key, one billing surface, and the model name is just a string in the request. Swapping google/gemini-3-flash-preview for x-ai/grok-4.1-fast in the harness is a CLI flag, not a refactor. That's really the part that makes regular re-evaluation possible at all.

It matters once you ship too. If a provider is having a bad afternoon, or a new model drops that beats your current default, you change a string and re-run the eval. You're not rewriting an integration. The cost of re-evaluating drops low enough that you actually do it on a cadence, instead of every six months when something breaks and you suddenly have to.

The Problem With Manual Model Testing

Most of us test models the same way. Paste the same prompt into a few of them, read the answers, pick the one that feels best.

That's fine for a quick gut check. It's also pretty weak evidence for putting something in production.

A model can sound confident and still fail in real workflows. It might return invalid JSON, ignore a hard instruction, invent a fact that wasn't in the input, or get the right answer in a way that's 20x more expensive than something nearly as good. The boring stuff is what gets you in production, and the vibe-check doesn't catch any of it.

Public Evals Are, Mostly, A Lie

The other thing worth saying out loud, since it's really the deeper version of the same problem. The public model evals, the leaderboard numbers you see when a new model drops, the press release benchmark scores, those mostly don't help you pick what to actually deploy.

Some of this is benchmark contamination, where the model saw the test data during training. Some of it is labs optimizing specifically for whatever benchmarks the press cares about that month. But honestly from what I've seen the bigger issue is just that those benchmarks test something pretty different from what your product is actually doing. A model that crushes MMLU might still be terrible at returning structured JSON for your specific prompt under your specific system instruction. You don't know until you run it against your own cases.

There's a Reddit post that's been making the rounds:

"hot take: the biggest bottleneck in AI agents right now isn't models, frameworks, or even cost. It's that nobody knows how to properly evaluate if their agent is actually working."

The author has been building agents for 14 months and walks through every eval approach they've tried. Just-check-the-final-output, log-every-step-and-review, LLM-as-judge, golden datasets, all of them. They explain how each one falls apart in real production and end up at the same janky combination a lot of us end up at: outcome-based checks, random sampling with human review, regression alerts, user complaints as a lagging indicator. The line that stuck with me was that the industry is sprinting to build more complex agents on top of a foundation we can't even measure for a single agent doing a single task. That's pretty much exactly where we are, right?

I had a version of this realization across detection tooling for years. Vendor benchmarks would show one thing, real-world deployment would show something else. From what I've seen, the only thing that ever reliably told us whether something worked for a specific use case was running it against that specific use case.

So the harness exists because I trust my own test cases more than I trust any leaderboard. Public evals aren't completely useless. Some of them are fine as a first filter, a way to decide which models are even worth bothering to test in your own harness. But the only number that should drive a production decision is the one your own workflow generates.

How The Harness Works

Workflow that details how the Eval Harness works to compare models on what actually matters: accuracy, cost, and speed against real workflows.

At a high level, the harness runs the same cases across multiple OpenRouter models, scores the outputs, optionally has a judge model compare responses, and prints out a recommendation.

The recommendation is ranked by accuracy first, then price, then speed. Order matters there. A cheap model that makes unsafe decisions isn't actually cheap, you'll just pay for it somewhere else. But once two models are close on accuracy, cost and latency are what matter when you're running this thing thousands of times a day in production.

What We Measure

The harness tracks the things that actually matter when an LLM is part of an application: whether the model returned valid structured output, whether it made the right decision, whether it included the required evidence, whether it stayed inside length and schema constraints, whether it hallucinated something not in the input. Plus the operational stuff, token usage, cost per call, latency, and whether the provider failed or timed out.

The point of all this is picking the right model for a specific workflow, which from what I've seen is rarely the same as whatever's topping leaderboards this week.

Layer 1: Deterministic Checks

First scoring layer is pure code, no LLM involved.

Each test case can define objective checks. Things like: decision must equal deny, severity must be high or critical, summary stays under 180 characters, evidence has to include a specific string, response can't include forbidden claims, recommended actions array has at least one entry.

These are cheap to run and repeatable, which means you can run them on every commit if you want. Malformed JSON, invented facts, approving something that should've been escalated, all caught here. No judge needed.

Layer 2: LLM-as-Judge

Some stuff is genuinely hard to score in code.

Two model responses can both be valid JSON and both technically correct, but one is more specific or safer or just better reasoned. That's where an optional judge model comes in.

The judge gets the original system prompt, the user input, gold notes for the case, the rubric, and every candidate response. Then it grades each model on accuracy, format adherence, specificity, safety, and overall quality. It also picks a winner per case and flags hallucinations or risky behavior.

I don't treat the judge as an oracle, right? It's another model with its own bias. But paired with the deterministic checks it's a useful second signal, especially when you're trying to differentiate between two models that are both passing the hard checks cleanly.

Example Usage

Install the project:

git clone https://github.com/magicsword-io/llm-eval-harness.git
cd llm-eval-harness
npm install
cp .env.example .env
Set your OpenRouter key:
export OPENROUTER_API_KEY=sk-or-your-key
Run a basic comparison:
npm run eval -- \
  --model google/gemini-3.1-flash-lite-preview \
  --model google/gemini-3-flash-preview \
  --model x-ai/grok-4.1-fast \
  --judge openai/gpt-5.4-mini \
  --timeout-seconds 60 \
  --judge-timeout-seconds 120 \
  --out reports/judged.md \
  --json reports/judged.json

At the end of each run, the CLI prints a short recommendation. Something like:

Recommended model: google/gemini-3.1-flash-lite-preview
Why: It has the strongest accuracy signal at 10.00/10, with $0.000291 per case and 974ms average latency.

Markdown report: reports/smoke.md
JSON report: reports/smoke.json

That winner will change. Models drop, prices shift, providers have weeks where they're flaky. The whole point is having something you can re-run when that happens, instead of going off whatever you tested six months ago.

Example Report

The generated report includes the recommended model, deterministic score by model, judge score by rubric, token usage, estimated cost, cost per case, average latency, parse failures, API errors, and per-case details.

Example report: docs/example-report.md

A report turns model selection from "which answer did we like?" into actually useful questions. Which model was most accurate on the cases that matter to your workflow? Which one's the best production default, which one's the fallback? Which failures were provider reliability and which were actual model-quality problems? Is the expensive model actually buying enough quality to justify the spend?

One Surprise: The Judge Can Cost More Than The Candidates

Fair warning, judged evals can get expensive fast.

Every judged case sends all the candidate responses back into the judge. If you're testing six models with long outputs and using a premium model as the judge, the judge can end up costing more than all your candidate runs combined. Honestly that surprised me the first time I looked at the OpenRouter dashboard after a run.

The workflow that actually works for us: iterate with deterministic checks first, run small smoke tests when we want to try a new model, only run the full judged evals against the finalists. Save the raw JSON, review offline before rerunning anything expensive. That gives useful signal without turning every experiment into a real bill.

Why Open Source It

The exact test cases for any product really do need to be specific to that product. But the harness pattern itself is reusable.

You bring your own prompts and cases and what you expect to see. The runner handles the repetitive parts: calling candidate models through OpenRouter, parsing structured responses, scoring deterministic checks, optionally calling a judge, tracking cost and latency, writing reports, printing the recommendation.

The included examples are generic on purpose. They show the shape of the workflow. They aren't a benchmark anyone should treat as authoritative for their own product.

The Bigger Point

The best model right now probably isn't the best model in three months.

That's really why the process matters more than any single result. A good eval harness, plus something like OpenRouter sitting underneath, gives you the ability to retest quickly, compare new models fairly, and swap production models based on your own evidence instead of vibes or whatever leaderboard the press is excited about that week. That's what I was missing for years and didn't really realize I was missing.

I'm curious what other folks are doing here. If you're running something similar, I'd love to know what your harness looks like, what you've found that the deterministic checks miss, where the judge has surprised you.

Sign up to our Newsletter if you want deep dives into threat intelligence, application control strategies, and the latest research on living-off-the-land attacks from our security team.