How My Agentic Setup Delivers High Output at Substantially Lesser Costs

The AI model debate — GPT versus Claude versus Gemini, frontier versus open-source, which benchmark number is highest this week — consumes an enormous amount of attention from people who are trying to get more out of AI.

It is mostly noise.

The people extracting 10x more from AI than their peers are not using different models. They have built better architectures around the same models. The gap is not in the intelligence of the system. It is in the environment the system operates within.

Three variables determine the quality of AI output. Most people optimise the wrong one.

The model is the first variable. This is what everyone talks about. The model's capability sets a ceiling — the maximum possible quality of output given perfect context and perfect workflow. That ceiling matters. But the models available at the frontier right now are converging rapidly. The gap between the best model and the second-best model is smaller than it has ever been, and it is narrowing every quarter.

The context is the second variable. What you put in the context window — the background on who you are, what you're trying to accomplish, what constraints apply, what work has already been done — is the primary determinant of output quality for any fixed model. Context is not the system prompt. It is the totality of the information environment in which the model is operating. The same frontier model with excellent context will dramatically outperform itself with poor context.

The workflow architecture is the third variable. How you decompose a task. What sequence of operations you run. Where human review sits in the pipeline. What one agent produces that becomes the input for the next. This is where most of the latent value in AI-assisted work is currently sitting, untouched.

Model choice is one variable in three. It is not the most important one. Most people have their attention completely backwards.

Here is what the difference looks like in practice.

Andrej Karpathy described a pattern he called "autoresearch" — the use of an AI agent to conduct the research phase of a project before you begin the production phase. Instead of doing background reading yourself and then writing, you configure an agent to read, synthesise, and surface what matters, and only then do you engage with the material to produce something.

This is not a trick. It is a workflow architecture decision. The output of the research agent becomes the context for the writing agent. The writing agent operates with a richer, better-structured information environment than any human could construct through their own research process. The result is not marginally better — it is qualitatively different.

The same principle applies to almost every knowledge work task. The person who uses AI as a single-query oracle and the person who uses AI as a multi-stage processing pipeline are not using the same tool. They are using different tools that happen to share a name.

There is a useful analogy in the history of the early web.

The web did not advantage the people with the fastest computers. It advantaged the people who understood network architecture — how to structure information so that it could be retrieved, linked, and built upon. The underlying computation was largely equivalent across participants. The structural intelligence was not.

The same pattern is playing out with AI. The underlying models are accessible to almost everyone with a credit card. The structural intelligence — the CLAUDE.md that actually tells the model who you are and what you care about, the decomposition of complex tasks into discrete steps with appropriate feedback loops, the workflow that separates research from synthesis from editing — is not widely built.

This is a temporary condition. As AI literacy increases, the architectural patterns will become common knowledge. The window to build a durable advantage from architecture is open now, not indefinitely.

Here is what this looks like when you actually build it.

I have been running an agentic harness on a Mac mini for the past few months, and the productivity delta is not subtle. My personal website (the one you are reading this on) was built and is maintained with agents. Enterprise sales decks that used to take days now take minutes. I built an internal tool that lets our Customer Success team handle solutioning queries without pulling in product or engineering. And that is just the visible output. The underlying shift is that I am operating at roughly 10x my previous throughput, and I am not working more hours. I am just working with a better architecture.

The screenshot of my GitHub activity looks like I have been coding obsessively. I have not. Most of those commits were generated by agents working in parallel, with me reviewing and merging. The code is good. The architecture is sound. The difference is that I am no longer the bottleneck on execution.

What better architecture actually looks like, in my setup:

The core insight is that even smaller models can produce excellent output if you give them enough context and break the work down properly. I follow three rules, and they have worked better than any model upgrade I have tried. I call them Agency, Breakdown, and Context.

Agency. Each agent in my setup has a specific role, its own model, and its own skills. I do not ask one agent to research, synthesise, and write. I give the research job to Thanos, the design job to Ive, coding tasks to Beethoven, visual work to Susan, product thinking to Clay, engineering to Linus, and orchestration to Gandalf. Each one runs on the model that fits its work best, and each one has a completely different skill set from the others.

Here is the full team:

Agent	Role	Model
Gandalf	Orchestrator and primary assistant	Moonshot Kimi K2.6
Thanos	AI intelligence and signal synthesis	Moonshot Kimi K2.6
Ive	Product design and visual systems	Moonshot Kimi K2.6
Beethoven	Code orchestration and engineering	Anthropic Claude Sonnet 4.6
Susan	Visual systems and deck design	OpenAI GPT-5.4
Clay	Product strategy and prioritisation	Moonshot Kimi K2.6
Linus	Engineering and architecture	Moonshot Kimi K2.6

What makes this work is not just the specialisation. It is that every agent has an AGENTS.md file that defines its role, its tools, its memory patterns, and its voice. They are all structured similarly, so any agent can reason about what another agent does, but the content inside is specific to the role. This is not a team of generalists. It is a team of specialists that happen to be software.

Here is what Gandalf's AGENTS.md looks like (paths redacted):

# Gandalf - AI Strategist & Orchestrator

## Instructions (read every session)
**IMPORTANT:** Read this every session, but only load the paths, skills, and docs needed for the task at hand.
Prefer explicit project context over guesswork when it matters, but use your judgment for straightforward work.
If a missing skill, path, or reference would materially improve correctness or speed, ask Utkarsh instead of guessing.

### 0. Use subagents for non-trivial work
For multi-step research, audits, comparisons, or file work, spin up a sub-agent, monitor it, and report back.
For short factual replies or tiny edits, answer directly.

### 1. Code Repos
- All git repos live in ***.
- When changing specs, design, code, or anything repo-backed, check there first.
- The Vault is for non-git information only.

### 2. The Vault — Shared Knowledge Base
- Path: ***
- Start with INDEX.md at the Vault root.
- Key folders for Gandalf: ***, ***, ***, ***, ***, ***, ***

### 3. Shared Tools
- Path: ***
- Read this to know what tools are installed and how to use them.

### 4. Research and web search
- Skill path: ***
- Use for researching companies, strategies, markets, or any source-backed analysis.

### 5. Model usage
- Skill path: ***
- Use when asked for per-model usage or cost data from codexbar.

### 6. Relay
- Repo path: ***
- URL: *** (local API: ***)
- Skill path: ***

### 7. Memory
- Daily notes: memory/YYYY-MM-DD.md
- Long-term: MEMORY.md (main session only)
- Write it down. Files persist; mental notes don't.
- When someone says "remember this" → write to memory/YYYY-MM-DD.md
- Every few days: distill recent daily files into MEMORY.md, remove stale entries.

### 8. Safety
- trash > rm
- Ask before: sending emails, tweets, public posts, anything external.
- Read/explore/organize freely.
- Don't exfiltrate private data.

### 9. API Keys
- Store in /workspace/.env as SERVICE_API_KEY=value.
- Not in *** (host-only) or openclaw config (sandbox can't read).

### 10. Group Chats
- Don't share Utkarsh's private info in groups.
- Respond when: directly asked, you add genuine value, correcting misinformation.
- Stay silent otherwise. Quality > quantity. One response per message.

### 11. Agent Writer
- Skill path: ***
- Use when creating, updating, or auditing AGENTS.md or SOUL.md for any agent.

### 12. Telegram formatting
- Send URLs as plain text.
- When quoting a message or text, wrap it in a code block.

Breakdown. Complex tasks die in a single prompt. The same task, broken into a research phase, a synthesis phase, and an execution phase, with human review at the transitions, produces qualitatively different output. This is not because the model improves at each step. It is because each step receives better-structured input.

I am also building a system called Relay that automates this breakdown. Relay takes a complex request, creates an orchestrator that writes detailed task descriptions with evaluation criteria, and assigns those tasks to worker agents. Each agent executes its task, generates a result report, runs its own output against the eval criteria, and retries up to twice if the eval fails. Only then does the result come back to me. The orchestrator does not do the work. It designs the work so that smaller models can do it well.

Context. Before any agent does anything meaningful, it loads context. Not a system prompt. The full information environment. Who I am. What I am building. What decisions have already been made. What constraints apply. What good output looks like and why.

This context lives in a personal wiki I maintain in Obsidian. The wiki is organised into sections (personal-wiki for research, personal-blog for published writing, harness-setup for agent architecture, X-Strategy for voice and content), and every section has an INDEX.md at its root. When an agent starts a task, it reads the relevant INDEX.md first, then drills into the specific files it needs. The agents do not guess at my preferences. They read them.

The wiki is synced via iCloud, so I can read and edit the markdown files on my phone or laptop just as easily as the agents can read them on the Mac mini. The interface is Obsidian, which means the context is as readable for me as it is structured for them.

The model debate will continue because it is easy to have. Benchmarks produce numbers. Numbers produce rankings. Rankings produce discourse.

Architecture is harder to discuss because it is specific to the problem and the person. There is no universal ranking of workflow designs. There is only the discipline of asking, before every significant AI-assisted task: what is the information environment this system needs to do its best work, and what is the sequence of operations that would produce that environment?

The people asking that question are not working with better AI. They are working with the same AI, better.

How My Agentic Setup Delivers High Output at Substantially Lesser Costs

Related

Meta Just Killed Open-Source AI

As AI Compresses Execution, Judgment Becomes the Scarce Resource

The Missing Human Half of AI