Meta-Harness: End-to-End Optimization of Model Harnesses

Link: https://arxiv.org/abs/2603.28052

Authors: Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn (Stanford/MIT) — March 30, 2026

The central claim of this paper is deceptively simple: the performance of an LLM system depends not just on the model weights, but on the harness — the code that decides what to retrieve, what context to inject, what tools to expose, and how outputs are validated. The authors show that the same model, with the same architecture and same compute, can produce a 6× performance gap depending purely on how its harness is designed. This reframes where the leverage actually is: instead of waiting for a better model, you can often get more out of the one you already have.

The system works through an "agentic proposer" — a coding agent with access to the full source code, execution traces, and scores of all prior harness candidates. Unlike conventional optimization loops with hand-designed mutation operators, Meta-Harness lets the agent decide what to look at, what failure modes to address, and whether to make a targeted patch or a deeper rewrite. The results are strong: +7.7 accuracy points on online text classification (with 4× fewer context tokens), +4.7 accuracy points on 200 IMO-level math problems when the discovered harness is transferred across five held-out models, and surpassing hand-engineered baselines on TerminalBench-2 for agentic coding.

If you are building any AI system that wraps a model — which is every production AI system — this paper argues you are likely leaving the majority of your performance on the table by optimizing the model instead of the harness. It is one of the clearest articulations yet of why "prompt engineering" is the wrong unit of analysis, and why the full execution environment deserves the same rigor as model selection.