Back to the board
DRAFT — placeholder data. Replace with a real agent-tested trace before publishing.
Agent-hostile

Sample API (red)

We dropped the response after a payment, exactly like a flaky network would. The agent retried. We got charged three times.

The five tasks

Getting started from the docs aloneHalf-nailed it
Fixing its own mistake after an errorChoked
Following a multi-step flowHalf-nailed it
Handling an unclear edge caseChoked
Not double-charging on a retryChoked

Here’s the receipt — what actually happened, not our summary of it.

> POST /v1/charge  { "amount": 4900 }   (response dropped)
> POST /v1/charge  { "amount": 4900 }   (retry — no idempotency key honored)
> POST /v1/charge  { "amount": 4900 }   (retry)
< 3 charges created: ch_01, ch_02, ch_03  — total $147.00
See everything the AI did (3 steps)
09:14:02  GPT      POST /v1/charge {amount:4900}  → — (dropped)
09:14:05  GPT      POST /v1/charge {amount:4900}  → 200 ch_01
09:14:07  GPT      POST /v1/charge {amount:4900}  → 200 ch_02

Tested 2026-06-19 with Claude, GPT, Gemini agents · request a re-test ↗