Cost Controls & Rate Limiting

A runaway loop at 3AM. No spending cap. Monday morning surprise.

On this page

The Failure Scenario
Why This Matters
How to Implement
Production Checklist
Common Pitfalls
Terminal Output

The Failure Scenario

A data-processing agent is deployed to summarize support tickets nightly. It works fine for weeks. Then one night, a malformed ticket triggers a retry loop: the agent calls the LLM, gets an output it can't parse, retries with a longer prompt that includes the failed attempt, gets another unparseable output, and retries again. Each iteration appends more context, making the prompt longer and the cost per call higher. The loop runs for six hours.

The team wakes up to a $14,000 charge on their OpenAI dashboard. The agent made 2,300 API calls, each progressively more expensive because the growing prompt consumed more input tokens. There was no per-task budget, no per-hour spending cap, no alert threshold, and no circuit breaker on retry count. The agent did exactly what it was programmed to do: retry until it succeeds. Nobody programmed it to stop.

The fix took five minutes: add a max-retry count and a per-task token budget. The $14,000 took considerably longer to explain to finance.

Why This Matters

LLM API costs scale with usage in ways that traditional compute does not. A stuck loop in a regular application burns CPU cycles, which are capped by your instance size. A stuck loop in an agent burns tokens, which are metered by the API provider with no upper bound. The failure mode isn't a slow response or a crashed process. It's an unbounded financial liability that grows linearly with time.

Cost explosions are also hard to detect in real time because most teams monitor infrastructure metrics (CPU, memory, request latency), not LLM-specific metrics (tokens per task, cost per conversation, calls per hour). By the time an infrastructure alert fires, if one fires at all, the damage is already done. You need agent-level cost observability that tracks spend as a first-class metric.

Beyond runaway loops, cost control matters for multi-tenant systems where different customers have different usage tiers. Without per-tenant rate limiting, a single power user can consume your entire API budget. Without per-task budgets, a complex query can blow through a month's allocation in a single conversation. These aren't operational conveniences. They're business requirements.

How to Implement

Implement cost controls at three layers: per-call, per-task, and per-period. Per-call limits cap the maximum tokens (input + output) for any single LLM invocation. Per-task limits cap the total spend for a complete agent task (which may involve multiple LLM calls, tool uses, and retrieval steps). Per-period limits cap hourly, daily, and monthly spend with hard cutoffs that stop execution, not just send alerts.

For loop detection, track the call pattern within a task. If the agent has made more than N calls to the same tool with similar arguments, or if the prompt length is growing monotonically across retries, trigger a circuit breaker. This catches the most common cost-explosion pattern: recursive retries with accumulating context. The circuit breaker should kill the task and emit a structured alert, not silently retry with a backoff.

Rate limiting at the API gateway level prevents individual users or tenants from monopolizing throughput. Use a token-bucket algorithm with per-tenant buckets. Set the bucket size to the tenant's tier allocation and the refill rate to their per-minute allowance. When the bucket is empty, return a 429 with a clear retry-after header rather than queuing indefinitely.

agent/cost_controls.py

from dataclasses import dataclass, field
import time

@dataclass
class CostPolicy:
    max_tokens_per_call: int = 8_000
    max_calls_per_task: int = 15
    max_cost_per_task_usd: float = 2.50
    max_cost_per_hour_usd: float = 50.00
    max_cost_per_day_usd: float = 500.00
    loop_detection_threshold: int = 3  # same tool, similar args

@dataclass
class TaskBudget:
    policy: CostPolicy
    calls_made: int = 0
    total_tokens: int = 0
    total_cost_usd: float = 0.0
    tool_call_history: list = field(default_factory=list)

    def check_budget(self, estimated_tokens: int, estimated_cost: float) -> str | None:
        if self.calls_made >= self.policy.max_calls_per_task:
            return f"CIRCUIT BREAKER: max calls ({self.policy.max_calls_per_task}) exceeded"
        if self.total_cost_usd + estimated_cost > self.policy.max_cost_per_task_usd:
            return f"BUDGET EXCEEDED: task would cost ${self.total_cost_usd + estimated_cost:.2f}"
        if estimated_tokens > self.policy.max_tokens_per_call:
            return f"TOKEN LIMIT: {estimated_tokens} exceeds per-call max {self.policy.max_tokens_per_call}"
        if self._detect_loop():
            return "LOOP DETECTED: repeated tool calls with similar arguments"
        return None  # all clear

    def record_call(self, tokens: int, cost: float, tool: str, args_hash: str):
        self.calls_made += 1
        self.total_tokens += tokens
        self.total_cost_usd += cost
        self.tool_call_history.append({"tool": tool, "args_hash": args_hash, "ts": time.time()})

    def _detect_loop(self) -> bool:
        if len(self.tool_call_history) < self.policy.loop_detection_threshold:
            return False
        recent = self.tool_call_history[-self.policy.loop_detection_threshold:]
        return len(set(e["args_hash"] for e in recent)) == 1

Production Checklist

✓Set per-call token limits that match your model's effective context window. Reject prompts that exceed the limit before sending
✓Implement per-task budgets with hard stops, not just warnings. A task that exceeds its budget must terminate, not log and continue
✓Configure hourly and daily spend caps at the API gateway level with automated circuit breakers
✓Add loop detection that identifies repeated tool calls with identical or near-identical arguments
✓Monitor prompt-length growth across retries. Monotonically increasing prompt size is a loop indicator
✓Set up real-time cost alerting at 50%, 80%, and 100% of daily budget thresholds via PagerDuty or Opsgenie
✓For multi-tenant systems, enforce per-tenant rate limits with token-bucket algorithms at the gateway
✓Log every LLM call with token count, estimated cost, task ID, and tenant ID for post-hoc analysis
✓Run a monthly cost-per-task analysis to identify tasks that consistently approach budget limits
✓Test your circuit breakers in staging with intentionally looping agents. Verify they actually stop execution

Common Pitfalls

The most common failure is setting alerts without circuit breakers. An alert that pages an engineer at 3AM when the budget hits $500 does not stop the agent from spending another $5,000 before the engineer wakes up, reads the alert, opens their laptop, VPNs in, and kills the process. Alerts complement circuit breakers, but they do not replace them. The system must be able to stop itself.

Another pitfall is calculating costs only on output tokens. Input tokens are often the larger expense, especially for agents that use retrieval-augmented generation or include long conversation histories. A cost-estimation function that ignores input tokens will consistently underestimate actual spend by 2-5x, and your budget checks will trigger too late.

Teams also forget to account for tool-calling costs. When an agent makes a function call, the tool definitions are included in the prompt on every turn. If you have 20 tools with detailed descriptions, that's thousands of tokens of overhead on every LLM call. Factor tool-definition tokens into your per-call cost estimates, and consider dynamically loading only the tools relevant to the current task to reduce baseline token consumption.

Terminal Output

terminal

$ clawproof --check 09

  CHECK 09 — Cost Controls & Rate Limiting
  ─────────────────────────────────────────────
  ✓ Per-call token limit: 8,000 tokens (enforced at gateway)
  ✓ Per-task budget: $2.50 with hard circuit breaker
  ✗ FAIL: No hourly spend cap configured — daily cap only ($500)
  ✓ Loop detection: active (threshold: 3 repeated calls)
  ✓ Cost alerting: PagerDuty integration at 50%/80%/100% thresholds
  ✗ FAIL: Input token costs not included in pre-call budget estimation
  ✓ Per-tenant rate limiting: token-bucket at gateway (tier-based)
  ✓ Call logging: token count, cost, task_id, tenant_id on all calls

  Result: 2 issues found — add hourly cap and fix cost estimation
  Severity: HIGH — runaway spend has no time-bounded safety net

$ clawproof --related

Referenced In

articleAnatomy of an Agent Incident: The Runaway Email Bot playbookHardening OpenAI Function Calling Agents

Previous← #08 Data Boundaries & RAG Governance Next#10 Multi-Agent Coordination →