Token economics
Every /build-plan run emits a per-phase token cost table at the end. The numbers below are real measurements from total_tokens reported directly by each subagent (PR #110), not line-count estimates.
Two measured features
These are the first two real /build-plan runs in red-profesionales after PRs #100/#102/#104 shipped (the coverage-aware loading + anti-duplication + DoD-checker budget).
Feature 1 — backend (single iter, no fast mode)
| Phase | Model | Tool uses | Total tokens | Duration |
|---|---|---|---|---|
| Bundle generator | sonnet | 38 | 111,362 | 10m 38s |
| Backend Developer | opus | 98 | 234,487 | 15m 32s |
| DoD-checker | haiku | 35 | 55,462 | 1m 16s |
| Backend Reviewer | sonnet | 47 | 84,434 | 5m 19s |
| Tester | sonnet | 42 | 62,746 | 4m 27s |
| update-specs | opus | 24 | 109,922 | 2m 31s |
| Total subagents | — | 284 | 658,413 | ~40 min |
Feature 2 — frontend (2 iters, fast mode on iter 2)
| Phase | Model | Tool uses | Total tokens | Duration |
|---|---|---|---|---|
| Frontend Dev iter 1 | opus | 50 | 115,078 | 8m 45s |
| DoD-checker | haiku | 27 | 52,062 | 1m 25s |
| Frontend Reviewer iter 1 | sonnet | 107 | 94,851 | 10m 31s |
| Frontend Dev iter 2 | opus | 25 | 79,274 | 3m 24s |
| Frontend Reviewer iter 2 (fast) | sonnet | 14 | 47,169 | 1m 36s |
| Tester | sonnet | 42 | 105,517 | 8m 24s |
| Total subagents | — | 265 | 493,951 | ~34 min |
What the data confirms
Reviewer savings predicted by PRs #102/#104 (30-50k Sonnet/phase) — confirmed at N=2:
| Phase | Before | After (measured) | Predicted | Verdict |
|---|---|---|---|---|
| Reviewer (feature 1) | ~130k | 84k | 80-100k | ✓ bullseye |
| Reviewer iter 1 (feature 2) | ~130k | 95k | 80-100k | ✓ in band |
| Reviewer iter 2 fast | — | 47k | ~50% iter 1 | ✓ ~50% |
The fast re-review mode (PR #96) also performs as designed — 47k iter 2 = 50% of 95k iter 1.
Per-tool-use efficiency
| Phase | Total tokens | Tool uses | Tokens / tool use |
|---|---|---|---|
| Reviewer iter 1 (F2) | 94,851 | 107 | 886 |
| DoD-checker (F2) | 52,062 | 27 | 1,928 |
| Reviewer iter 2 fast (F2) | 47,169 | 14 | 3,369 |
| Tester (F2) | 105,517 | 42 | 2,512 |
| Frontend Dev iter 1 (F2) | 115,078 | 50 | 2,302 |
| Frontend Dev iter 2 (F2) | 79,274 | 25 | 3,171 |
The Reviewer’s 886 tokens/tool-use is the lowest — coverage-aware loading produces many small, targeted reads instead of a single full-checklist load. The DoD-checker’s 1,928/use stays inside its declared budget (max 2 tool calls per row).
The Tester’s 2,512 tokens/tool-use is the cost frontier today. Frontend Tester at 105k vs backend Tester at 62k is a 70% gap. Possible causes under investigation: more test cases (27 vs ~8), heavier tool surface (Vitest + jsdom + sometimes Playwright vs PHPUnit), and full-loads of logging.md / gdpr-pii.md / attack-surface-hardening.md in the tester-bundle (PR #112 introduced section-loading; pending empirical confirmation on next frontend feature).
Approximate cost per /build-plan
Using public Anthropic pricing as an order-of-magnitude:
| Model | Tokens (sum across F1+F2) | Approx cost |
|---|---|---|
| Opus (Backend Dev + Frontend Dev + update-specs) | ~538k | $4-7 |
| Sonnet (Bundle + Reviewer + Tester) | ~505k | $1.5-3 |
| Haiku (DoD-checker × 2) | ~107k | $0.10-0.20 |
| Subtotal subagents | ~1.15M | $6-10 |
| Orchestrator overhead (Opus, NOT in per-subagent total) | ~200-400k each run | $1-3 each run |
Total per /build-plan | — | ~$8-13 per feature |
For comparison, the same feature implemented by a developer using Cursor + 4-6h of their time costs roughly the same in tooling but consumes the developer’s day. The framework’s value-add is parallelism (BE ‖ FE), enforced architecture, and a Reviewer-approved + tested deliverable at the end.
What’s instrumented
PR #110 replaced the prior lines × 8 estimate with the Anthropic SDK’s total_tokens field per subagent. The orchestrator emits the table verbatim at the end of every /build-plan run. Future regressions in any phase become visible per-feature rather than having to be inferred from aggregate cost.
What’s not yet instrumented
- Orchestrator overhead. The 200-400k Opus tokens the orchestrator itself consumes (system prompt, build-plan-command.md, lessons-learned reads, accumulated tool results) is not in the per-subagent table. Adding it is non-trivial because the orchestrator is the conversation, not a subagent.
- Per-section checklist read cost. Reviewer’s per-section reads via
offset+limitare not separately instrumented; only the aggregate Reviewer cost is visible. - Cache hit rate. Anthropic’s prompt cache is leveraged via static-first prefix ordering, but the framework does not currently report cache-hit vs cache-miss tokens. Adding this requires the SDK’s cached-input-tokens telemetry.