Agent Harness Engineering in practice

If you use coding agents, one idea is worth internalizing:

the agent is not the model.

In practice, what we call an agent is:

agent = model + harness

The model reasons and generates text/code. The harness defines the environment, rules, tools, boundaries, and verification loops.

And this second part is usually what most impacts quality, predictability, and speed in real work.

This article is a practical read based on write-ups from Addy Osmani, Viv Trivedy, and the Cursor team on harness engineering.

What a harness is, objectively

A harness is everything around the model that turns inference into delivery:

system prompt and rule files (AGENTS.md, skills, conventions)
tools (shell, git, browser, linters, tests, MCPs)
context and memory (how knowledge is loaded, summarized, and persisted)
safety and quality hooks (blocks, checks, approval gates)
orchestration (subagents, handoff, planner x executor x reviewer)
observability (logs, traces, cost, latency, error rates)

An excellent model with a weak harness tends to fail in repetitive ways. A good model with a strong harness often performs better in production.

Mindset shift: from "blame the model" to "improve the system"

When an agent fails, the default reaction is:

"the model is bad, let's wait for the next version."

Harness engineering suggests another approach:

"what system change prevents this error from happening again?"

Practical examples:

the agent comments out tests instead of fixing them -> explicit rule + hook blocking .skip(, xit(, and similar patterns
the agent tries destructive commands -> pre-tool block (rm -rf, git push --force, DROP TABLE)
the agent gets lost in long tasks -> split roles (planner, executor, evaluator)
the agent marks work as done without validation -> force typecheck/lint/test as loop back-pressure

This is the strongest part of the concept: every mistake becomes a specification.

Ratchet effect: each failure becomes permanent protection

Think of the harness as a maturity ratchet.

Whenever a real failure happens:

record the failure pattern
decide whether prevention belongs in prompt, hook, tooling, or process
automate the prevention
measure whether the failure disappears

This moves you from team memory to system memory.

Minimum components of a good software harness

If you want a practical starting point, begin here:

Short, testable rules Keep AGENTS.md concise, objective, and actionable. Use a pilot checklist, not a giant style guide.
Tools people actually use Fewer tools with clear descriptions beat a huge overlapping menu.
Automatic feedback loops Code was edited? run validations. Failed? feed the errors back to the agent. Passed? stay silent.
Safe execution Use sandboxing/isolation when possible. Explicitly block destructive operations.
Project memory Persist conventions, architecture decisions, and known pitfalls in versioned files.
Observability Without failure logs and decision traces, you cannot improve harness quality consistently.

Long context hurts agents (if unmanaged)

Another key point: context degrades over time.

As the window grows, agents tend to:

lose focus
reason worse
stop too early
repeat actions without progress

Three techniques help a lot:

compaction: summarize and offload older history
large-output offloading: move long logs to files; keep only the signal in context
progressive disclosure: load rules/tools on demand, not all at startup

What Cursor adds in a very practical way

Cursor's post adds an important angle: the harness is a product, not static configuration.

They describe a continuous cycle:

define the ideal agent experience
build harness-improvement hypotheses
test with offline evals + online A/B tests
keep only changes that improve real quality

High-value ideas to copy:

Keep Rate for generated code: measure how much of agent-proposed code remains in the repo after a fixed period
semantic user satisfaction signal: infer whether the user was satisfied from follow-up interaction patterns
tool-call error taxonomy: separate unknown errors (harness bugs) from expected ones (for example invalid args, timeouts, provider outages)
anomaly alerts by tool and model: detect degradation relative to baselines
incremental optimization mindset: quality improvements are often many small wins, not one silver bullet

In short: sustainable agent improvement requires product engineering, observability, and disciplined experimentation.

There is no universal harness

Another key Cursor insight: harnesses must be customized per model.

Practical example:

if a model is more trained for patch editing, make patch the primary path
if a model performs better with string replacement, align tools and prompts accordingly

This reduces errors, reasoning overhead, and rework.

General rule:

model-agnostic abstractions outside, model-specific customization inside.

A simple playbook for this week

If your team already uses coding agents, you can do this in a few steps:

List the top 5 most expensive failures from recent weeks.
For each one, choose a prevention mechanism (rule, hook, validation, flow).
Update the harness with small, reversible changes.
Run for a week and compare rework, regressions, and review time.
Keep only what produced measurable improvement.

This is not about over-constraining the agent. It is about increasing hit rate without relying on luck.

Closing

Models will keep improving, but that does not remove the need for harnesses.

In practice, as models become more capable, harness value often increases:

to preserve safety
to guarantee quality
to scale long-running work
to coordinate multiple agents toward one goal

If you want to move from casual AI usage to production-grade agent-assisted engineering, start with the harness.

It is the multiplier that turns model potential into real outcomes.