How Minsa drove a 22% gain on APEX-Agents

We built Minsa on a single bet: that a company's intelligence should compound. To test it directly, we ran our system on APEX-Agents, one of the hardest public benchmarks of real knowledge work. Giving our agents a shared knowledge graph, a supervisor that catches mistakes before they happen, and a continual-learning loop improved task performance by 22% — using the exact same underlying model throughout. Here is what we did, and what each piece contributed.

The benchmark

APEX-Agents (Mercor) asks an agent to complete long-horizon, cross-application professional-services tasks — investment banking, law, and management consulting — inside realistic "worlds" of files, spreadsheets, filings, email, and tools. Work is graded against expert-written rubrics. It is genuinely hard: even frontier models pass only ~18–24% of tasks. We chose it precisely because it mirrors the messy, multi-step work our customers actually care about, rather than a toy environment.

Success rate vs. accumulated experience within an APEX-Agents world: with continual learning the curve climbs and holds, rather than resetting each task.

What we added

We held the agent's base model fixed and layered on three pieces of groundwork — the same three that make up the Minsa platform.

1. A knowledge graph of the world

Before the agent starts, we mine each world's files — financial models, filings, memos, prior deliverables — into a graph that captures the key entities, where the right data lives, the procedures that produce a correct answer, and the pitfalls that tend to trip agents up. The agent consults this graph at inference, so it starts informed instead of rediscovering the workspace from scratch on every task.

2. A monitor that catches mistakes

A second agent watches the working agent and intervenes the moment it is about to repeat a known error — opening an archived model instead of the final version, pulling a figure from the wrong period, or reporting a number before a dependent input has been updated. Instead of failing silently and being graded as wrong, the agent gets a targeted correction in the moment it matters.

3. Continual learning

After every task, what worked and what failed is written back into the graph as weighted procedures and pitfalls. The next agent — and the next model — inherits those lessons. The system stops repeating mistakes, and performance climbs with experience rather than resetting to zero each run.

Results

Each component helped, and the effects stacked. On mean rubric score, the full system reached 38.1% versus a 31.2% baseline — a +22% relative improvement, with no change to the underlying model.

Mean rubric score on APEX-Agents. Each layer of groundwork adds to the last; the full system is +22% over baseline.

The continual-learning loop matters most over time. As the graph accumulates lessons across tasks in a world, success rate rises and then holds — the agent keeps the gains instead of re-learning the same workspace on every run.

Why it matters

The gain didn't come from a smarter model — the model never changed. It came from the groundwork around it: a shared memory of how the work is done, a supervisor that stops mistakes before they happen, and a loop that turns every task into a permanent lesson. That is the whole thesis of Minsa. Run that loop long enough, across every team in a company, and the curve only points one way.

Evaluated on APEX-Agents (Vidgen et al., 2026). Same base model across all conditions; differences are attributable to the Minsa layer alone.

Book a demo