SHARE

AI Harnesses and CLIs, Explained: The Real Reason Everyone's Talking About Infrastructure

The model is the engine. The harness is the car. Here's what that means, why it matters for your workflow, and whether you should build your own.

Written By

Grant Harvey

Mar 12, 2026

8 minute read

Every few months, AI Twitter discovers a new word and won't shut up about it. Right now, that word is "harness."

OpenAI published a whole blog post about "harness engineering." Salesforce wrote a 3,000-word explainer. Phil Schmid, Hugging Face's former technical lead, called harnesses "the most important concept in 2026." And meanwhile, there are now 15+ AI coding CLIs fighting for space in your terminal, all doing some version of the same thing.

If you've been nodding along to these conversations while privately wondering what any of it actually means, this one's for you.

First up, the TL;DR
The Big Shift: From Models to Systems
What a Harness Actually Contains
The CLI Landscape: Your Harness in a Terminal
Should You Build Your Own?
What This Means for Non-Developers

First up, the TL;DR

The reason everyone's talking about "harnesses" is deceptively simple: the AI model stopped being the bottleneck.

Last year, companies raced to get the smartest model. This year, the companies winning are the ones that built the best systems around the model. That's a harness.

Think of it like this: the model is an engine. A harness is the car built around it. Steering, brakes, GPS, seatbelts. It decides what the model can access, when it needs human approval, how it manages memory, and what happens when something goes wrong.

OpenAI proved this with Codex, which built a production app with over 1M lines of code where zero lines were written by humans. LangChain's coding agent jumped from 52.8% to 66.5% on a benchmark by changing nothing about the model, only the harness. Vercel got better results by removing 80% of the tools available to their agent.

So what makes up a harness? Five core components:

Human-in-the-loop controls that pause the agent at critical decisions.
Context management that keeps working memory focused (compaction, sub-agents).
Tool orchestration that defines which external tools the agent can reach.
Lifecycle hooks that trigger actions at specific moments (auto-format, run tests).
System-level instructions via project files like CLAUDE.md that teach the agent your rules.

Now, CLIs (command-line interfaces) are the most visible harnesses you interact with today. Claude Code, OpenAI Codex CLI, and Google Gemini CLI are all harnesses packaged as terminal tools. They wrap a model in file access, permission systems, MCP connections, and skill systems.

Should you build your own? For most people, no. The existing CLIs and platforms are already excellent harnesses you customize with skills, plugins, and connectors. Start with a CLAUDE.md or AGENTS.md file (that's already a mini-harness). Add skills. Then add hooks. Only graduate to a custom harness when the built-in options limit you.

The punchline: you're probably already building harnesses without calling them that. Every time you write a system prompt, create a skill, or configure a connector, you're wrapping a model in infrastructure. The vocabulary is new. The practice isn't.

The Big Shift: From Models to Systems

For most of 2024 and 2025, the AI conversation was about intelligence. Which model scored highest on benchmarks? Who released the biggest parameter count? Could GPT-5 reason better than Claude 3.5?

That conversation has quietly moved on. The defining insight of 2026 is that model quality is necessary but not sufficient. You need a good model, yes. But the difference between "impressive demo" and "production system that works reliably" comes from everything built around the model.

OpenAI's harness engineering blog makes this case using their own experience building a million-line production application entirely with Codex. The engineers spent their time designing constraints, documentation, feedback loops, linters, and lifecycle management. They spent zero time writing code. Their key finding: the repository itself had to become the single source of truth, because from the agent's perspective, anything it can't access in-context doesn't exist. That Slack discussion where your team aligned on an architectural pattern? If it's not written down in the repo, the agent has never heard of it.

Manus, the agent startup, rewrote their harness five times in six months. Same models each time. Five different architectures. Each rewrite improved reliability and task completion. The model didn't get smarter; the system around it got better.

Phil Schmid frames it with a useful analogy: the model is the CPU, the context window is the RAM, and the harness is the operating system. The OS curates context, handles the boot sequence (system prompts, hooks), and provides standard drivers (tool handling). You wouldn't run software on raw silicon. You shouldn't run agents on raw models.

What a Harness Actually Contains

Strip away the jargon and a harness has five layers. Every production AI system uses some version of these, whether it calls them "a harness" or not.

1. Permission and Approval Systems

The most basic layer: what is the agent allowed to do, and when does it need to ask?

Claude Code defaults to a sandboxed mode where it can edit files in your project folder but needs permission for anything else (network access, running commands). You can loosen this with rules that automatically approve specific commands. OpenAI Codex offers similar tiers: read-only, workspace-write, and full access.

This is the layer that prevents your agent from emailing your entire customer list or deleting your production database. It sounds obvious, but it's the first thing most people skip when building custom systems, and the first thing that causes catastrophic failures.

2. Context Engineering

Models have finite context windows (the amount of text they can "see" at once). A harness manages what goes in that window and what stays out.

This includes compaction (summarizing older parts of the conversation to free up space), sub-agents (spinning up separate workers that handle a research task in isolation and return just a summary), and progressive disclosure (loading skill instructions only when they're relevant, not all at once).

OpenAI's team found that using AGENTS.md as a "table of contents" with pointers to deeper documentation worked better than a single massive instruction file. Claude Code uses a similar pattern with CLAUDE.md, which acts as a persistent context document that the agent reads every session.

3. Tool Orchestration

Which external tools can the agent access? How does it call them? What happens when a tool call fails?

This is where MCP (Model Context Protocol) comes in. MCP is the open standard that Anthropic introduced in 2024, and it's now the shared infrastructure layer across Claude Code, Codex, Gemini CLI, and many other tools. It standardizes how agents discover, authenticate with, and call external services.

A key finding from multiple teams: less is more. Vercel stripped 80% of available tools from their agent and got dramatically better results. With fewer options, the agent made fewer redundant calls and took fewer unnecessary steps. The takeaway: don't give your agent access to everything. Give it access to the minimum set of tools it needs for the task.

4. Lifecycle Hooks

Hooks trigger actions at specific moments in the agent's workflow. Claude Code calls them "hooks." Codex calls them "hooks." (At least they agreed on something.)

Examples: run a linter after every file edit. Check permissions before executing a command. Auto-format code on save. Run tests before committing. These automate the quality-control steps that humans would otherwise need to do manually after every AI interaction.

5. System-Level Instructions

Every harness includes a mechanism for persistent, project-specific instructions. CLAUDE.md, AGENTS.md, GEMINI.md, .cursorrules... the name varies, the function is identical: a document that tells the agent how your specific project works.

This is the cheapest, highest-leverage piece of harness engineering. A well-written project file can dramatically improve output quality with zero code changes. It's also the piece that most people underinvest in.

The CLI Landscape: Your Harness in a Terminal

AI coding CLIs are the most popular type of harness today. They wrap a model in file system access, a permission layer, MCP connections, and a skill system, then expose it all through your terminal.

Here's how the three major players compare:

Claude Code (Anthropic) is widely considered the strongest for complex, multi-file tasks. It features sub-agents, agent teams (multiple agents working in parallel), and deep codebase understanding via agentic search. It requires a paid Claude subscription. Install with npm install -g @anthropic-ai/claude-code.

Codex CLI (OpenAI) emphasizes parallel task delegation and built-in worktree support (so multiple agents can work on the same repo without conflicts). It's included with any ChatGPT paid plan. The Codex desktop app adds automations, skill management, and a review queue. Install with npm install -g @openai/codex.

Gemini CLI (Google) offers the most generous free tier (60 requests/minute, 1,000 requests/day) and a massive 1M-token context window. Good for large codebases and documentation-heavy projects. It also includes built-in Google Search and web fetching.

Beyond the big three, there are now 15+ CLI tools in this space, including Aider, GitHub Copilot CLI, Warp, Cline, and others. The market went from "Copilot vs. Cursor" to a full ecosystem in about a year.

All of them support MCP. All of them use some form of project-level config file. And all of them are, underneath the branding, harnesses.

Should You Build Your Own?

For most people: no. The existing CLIs and platforms are already very good harnesses, and they're getting better fast. The customization tools (skills, plugins, connectors, hooks) let you shape them to your workflow without starting from scratch.

Here's a simple decision framework:

Start with project config files. Write a good CLAUDE.md or AGENTS.md. This alone can transform output quality. It's the minimum viable harness, and it costs nothing.

Then add skills. If you repeat the same process regularly (code review, deployment, report generation), package it as a skill. Skills are instruction files the agent loads on demand. They require no coding.

Then add hooks. If you want automated quality checks (linting, formatting, testing) after every agent action, add hooks to your project config. This is a light touch of infrastructure that saves significant review time.

Only build a custom harness if you've maxed out these options and still have unmet needs. The most common reasons to build custom:

You need to orchestrate multiple models (e.g. Gemini for planning, Claude for execution)
You need domain-specific safety rails that existing permission systems don't cover
You're building a product where AI is the core functionality, not a development tool
You have a workflow that requires integrations or data flows that MCP and existing connectors don't support

If you do build custom, start with the Claude Agent SDK or OpenAI's Agents SDK. Both provide the agentic loop, tool calling, and handoff mechanisms so you don't have to build those from scratch.

What This Means for Non-Developers

Here's the part that gets lost in the technical conversation: you don't need to be a developer to benefit from harness thinking.

Every time you write a custom instruction in ChatGPT, you're building a tiny harness. Every time you create a project in Claude with uploaded reference documents, you're engineering context. Every time you install a skill or connect a tool via the Customize tab, you're assembling tool orchestration.

The vocabulary is new. The practice is one you've probably been doing for months.

The people getting the most out of AI in 2026 are the ones who invest 15 minutes configuring their setup: writing clear instructions, connecting the right tools, and teaching their AI how their specific work gets done. The model's intelligence is table stakes. The system you build around it is the differentiator.