Pay Once for the Reasoning

The culmination

Lou opened with a confession: for weeks he’d been working separate problems — context management, token cost, model routing, sharing skills across projects without duplication. This week they fused. Each one of them helped figure the other thing out.

The originating pain: many projects, all needing the same skills. He didn’t want skills duplicated into every project, didn’t want all fifty to sixty skills’ front matter loaded into global context — he measured roughly 45 to 50 thousand tokens of overhead paid on every query, even a simple hello. And he didn’t want to hand-maintain local copies. The destination is an architecture where intelligence lives in the environment and a project inherits only what it declares.

He presented it as a narrated 25-minute deck — from chat to system — built for three audiences: members new to the language, builders who care about where context lives, and operators asking: where in my business am I still manually carrying context the environment could carry for me?

Fork versus spawn

The session’s sharpest distinction. Both fork and spawn run work in an isolated context and return only the result — neither pollutes the main conversation. The difference is the starting state.

A fork inherits the parent’s context. Useful when the child should know what’s already been decided — the parent picked the article’s angle; the drafter should honor it. A spawn starts cold. Useful when inherited context would contaminate the work — an adversarial reviewer shouldn’t inherit the parent’s confidence in the current direction; you want a colder read.

The whole decision reduces to one question: would the child do better seeing what the parent decided? Yes, fork. No, spawn.

Model altitude

Lou’s correction to his own starting question. Which model should I use? is asked at the wrong altitude. The final artifact is one thing, but the process that builds it is many kinds of work — research, angle selection, drafting, copy-editing, fact-checking — each with different needs.

The router emits a small record per step: component, step-class, model, effort, and a rationale. The rationale is what makes a bad output debuggable. Was the classification wrong, the model wrong, the effort wrong, or did the rationale miss the real risk? The routing logic lives once in a shared file so every workflow reuses it. This is modularity applied to judgment. The standing rule: assign the least excessive inference that still clears the bar. Never default everything to the strongest model at high effort.

Pay once for the reasoning

An idea Lou had read that morning and unpacked three times because the inversion is easy to miss. Instead of using a smart model for hard work and a cheap one for easy work, you commit to the cheap model — say Haiku at high effort — and hire the smart model for one job: write the prompt that lets the cheap model perform like the expensive one.

Opus knows Haiku’s capabilities and limits intimately, so it bakes the reasoning, strategy, and think here cues into the instructions. Pay once for the intelligence; reuse it on every cheap inference after. Reported 20 to 75 percent gains, and it pairs with DSPy-style auto-optimization for a further lift.

Kasimir connected this to Nate Herk’s dynamic-downgrade approach — routing puts intelligence in the model choice; this puts it in the prompt, with the tier fixed at the bottom on purpose. Scott and Lou agreed the open question is where Sonnet finally beats Haiku-with-a-great-prompt — and that effort level is a third dial on top of model choice.

Plugins beat manifests and symlinks

Lou narrated the dead ends so members don’t repeat them. A manifest works but doesn’t surface under native slash invocation. Symlinks fail because Claude doesn’t reliably follow them. What works is plugins from a marketplace: a version-controlled Git repo with a marketplace manifest, publishing either one plugin per skill or bundles. A project declares which plugins to install; a tightly-scoped plugin loads only front matter, keeping context cheap.

Two bonuses: version-pinning and clean client distribution. One worth noting: the skill-creator tool tends to fill the 1024-character description limit. Twenty skills at roughly 1000 characters is roughly 20 thousand tokens of wasted context. When generating a skill, tell it to keep the description tight — a couple of trigger keywords and a one-to-two-line description.

The deck was the demo

The 25-minute presentation was itself an artifact of the architecture. Lou fed Claude three days of chat exports, asked it to pull out the main ideas and make a cohesive presentation, and got a second-draft script. ElevenLabs voiced it; Claude built the slides.

The forked writing team that produced a sample article showed the payoff concretely: an orchestrator with scan, architect, draft, review, and polish stages — each forked, each returning only a summary plus an artifact path. Drafts moved by file path, never pasted into the parent, so 68K, 63K, 61K, and 56K of intermediate work stayed out of the main conversation. Only the final draft entered context.

The heavy close

Joanna surfaced the fear under the optimism — citing reports that insiders say the opposite in private of what they say in public about AI safety. Lou gave his most unguarded read of the year. He credited Anthropic’s transparency in admitting they hold back the most powerful models until precautions exist. But he can’t square the utopian everyone-will-have-money-and-free-time story with how power actually concentrates. What boils his blood most: the technology was developed on the back of humanity’s collective knowledge and didn’t pay for it.

He landed on the only answer he trusts: community and leverage. When we have enough people wanting to do something, we do it together, we’ll figure out a way. Donald closed it cleanly: The more powerful a tool is, the more it cuts both ways — the more potential for good and harm.

What to try before next session

Take one task you currently run on Opus or your premium model and try to move it down a tier. Hand the premium model this instead:

I’m going to run this task on Haiku. Knowing that model’s specific capabilities and limits, write a prompt that gets it to perform this task as well as you would — include the strategy, when to think step-by-step, and anything it’s likely to get wrong without being told.

Run Haiku with that prompt. Compare to your old output. Where it holds up, you’ve just made that task roughly ten times cheaper — and you’ve felt the pay-once-for-reasoning pattern in your own hands.