The cheapest token is the one you never send. That sounds obvious until you measure an agent: the same system prompt, the same tool definitions, the same pasted files, re-sent on every single turn of the loop. Brevity stops being a writing tip and becomes a budget. One Show HN project reported cutting more than 60% of tokens from agentic tasks just by killing repeated context across turns, and a recent guide on applying brevity and language efficiency to prompt engineering argues you can do this without losing output quality. The two ideas are the same idea, viewed from different ends.
The pattern
The shift is from prompts as prose to prompts as payload. A one-shot chat prompt is sent once, so a few extra polite sentences cost almost nothing. An agent is different. It runs a loop: observe, act, re-evaluate, repeat. Each iteration ships the accumulated context back to the model, so a 500-token preamble on a 20-turn task is not 500 tokens, it is 10,000. Brevity, in this setting, is not about reading nicely. It is about not paying for the same words ten or twenty times.
The brevity guide reframes prompt writing as compression with a fixed structure: context, task, constraint, output format. Everything that does not fall into one of those four buckets is a candidate for deletion. Its sharpest rule is uncomfortable for anyone who likes clean writing: sacrifice grammar before sacrificing precision. Drop articles and conjunctions, keep the technical signal. Models price information density, not syntactic elegance.
Why now
Two things converged. First, practitioners moved from single mega-prompts to agentic loops, because one-shot prompting hits a context ceiling: a 200-file refactor does not fit in the window, so the model guesses at what it cannot see. Loops solve the correctness problem by breaking work into small, checked steps. But they create a cost problem, because looping means re-sending context, and re-sending context means brevity now multiplies.
Second, pricing pressure made the multiplication visible. As the economics of long-running agents get scrutinized, see Anthropic pausing token-based Agent SDK billing, teams started auditing where their tokens actually go. The answer is rarely the clever instruction. It is the boilerplate that rides along on every turn.
How it works in practice
Concrete techniques, ordered roughly by payoff:
- Hunt repeated context first. The single biggest win is removing what gets resent each turn: stale tool output, full file contents the agent already acted on, prior reasoning it no longer needs. This is where the 60%+ cuts come from. Summarize completed steps into a short state note instead of carrying their raw transcript forward.
- State the stack once. Declare environment, framework, and versions at session start. Do not restate them in every message. Carry forward only what changed.
- Use the four-bucket structure. Context, task, constraint, output format. If a sentence is none of those, delete it. Pleasantries ("I hope this helps") are pure overhead.
- Ask for the artifact, not the essay. If you want code, say "code only, no explanation." Output tokens are billed too, and an unrequested walkthrough is the same waste as a verbose prompt, just on the other side.
- Paste snippets, not files. Send the relevant function and a placeholder for boilerplate. Whole-file pastes are the most common avoidable bloat.
- Split overloaded tasks. A request that bundles five concerns forces the model to hold all five in context. Sequential, atomic prompts keep each turn lean and, conveniently, map cleanly onto loop iterations.
The trade-off
Brevity has a failure mode: cutting signal and calling it efficiency. Telegram-style prompts that strip grammar can also strip the disambiguating word that kept the model on track, and over-aggressive context pruning is worse, because the agent silently loses information it needed and starts guessing, exactly the failure that pushed people toward loops in the first place.
The test is simple: if shortening the prompt changes the output, you removed a constraint, not filler, and you should put it back.
There is also a real tension with prompt caching. Some providers cache stable prefixes cheaply, which can make a long, unchanging system prompt less costly than its raw token count suggests. Measure your actual billed tokens before optimizing on instinct. Brevity is the default, not a dogma.
Where it goes next
Expect brevity to get automated. The same way linters flag dead code, agent frameworks will start flagging dead context: tokens shipped every turn that never influence a decision. Loop design and token budgeting will merge into one discipline, where "what does each iteration actually need to see" is answered explicitly rather than by dumping the whole history forward. The teams that win on cost will not be the ones with the cleverest prompts. They will be the ones who send the least.
Write prompts like you pay per word in an agent, you usually do, once per turn.
READY TO ASCEND
Get AI news that respects your time
The signal, distilled. Curated AI news and prompt-engineering insight. No noise.