When you’re testing an AI agent, token costs are irrelevant. When that agent is making 10,000 decisions per day, they’re a line item the CFO notices.
Here are the strategies we use to keep agent costs manageable without sacrificing decision quality.
Know Your Cost Baseline First
Before you optimize anything, instrument everything. You need per-agent, per-task token cost data. Most teams don’t have this, which means they’re optimizing blindly.
Track:
- Input tokens per agent call
- Output tokens per agent call
- Which tools are being called and how often
- Which tasks are generating outlier costs
The distribution is almost always heavy-tailed: 20% of tasks generate 80% of token spend.
Optimization 1: Prompt Compression
System prompts get bloated. Every time someone adds a rule, a few tokens get added. Six months in, your system prompt is 4,000 tokens of partially redundant instructions.
Audit your system prompts quarterly. Remove redundancies. Convert verbose instructions to compact formats. We typically find 30-40% compression is achievable without behavior change.
Optimization 2: Context Window Discipline
The biggest cost driver is almost always unnecessary context. Agents that receive everything “just in case” spend most of their token budget on context that’s irrelevant to the current task.
The fix: context assembly as a first-class concern. Before each agent call, explicitly select the context that’s relevant to the current task. This requires understanding what each task actually needs — which requires the audit from step one.
Optimization 3: Model Routing
Not every decision needs GPT-4 or Claude Opus. A triage decision — “is this exception routine or non-routine?” — can often be made by a smaller, cheaper model.
We use a routing layer that classifies task complexity and routes to the appropriate model. Simple, well-defined tasks go to fast/cheap models. Complex, ambiguous tasks go to the frontier models.
This alone typically reduces per-decision costs by 40-50%.
Optimization 4: Caching
Agents often re-retrieve the same context repeatedly. Carrier rates, business rules, reference data — this content doesn’t change between calls, but gets included in the context window every time.
Semantic caching at the retrieval layer can eliminate a significant fraction of retrieval calls. We also cache system prompts and static context at the inference layer where the provider supports it (Claude and GPT-4 both have prompt caching).
Results in Practice
On a recent logistics client engagement, we applied all four strategies to an order exception agent that was handling 8,000 exceptions per day:
- Prompt compression: -28% on system prompt tokens
- Context discipline: -35% on input tokens overall
- Model routing: -45% on model costs (routing ~60% of calls to Haiku/mini)
- Caching: -20% on retrieval costs
Total: 62% cost reduction. Decision quality (measured by exception resolution accuracy) was unchanged.
The optimization work took about two weeks. At 8,000 decisions per day, the payback period was under a month.