AI/LLM Security

When Smarter Claude Models Break Your Agent's Tool Schema

A developer building a custom coding harness found that newer, more capable Claude models are worse at following his tool's JSON schema than older ones — a reminder that agent tool-calling reliability is a security boundary, not just a UX detail.

PyramidLedger Research4 July 20264 min read

Developer Armin Ronacher found that Claude Opus 4.8 and Sonnet 5 invent extra, schema-violating fields when calling his coding tool Pi's edit function — older Claude models don't.
The actual edit content (oldText/newText) stayed correct; only invented metadata fields like `type`, `id`, and `matchCase` broke schema validation, and Pi's strict validator correctly rejected the malformed calls.
The likely cause is that newer models are post-trained heavily against Claude Code's own edit-tool schema, biasing them at ambiguous decision points when calling differently-shaped third-party tools.
For teams building agentic tooling, this is a live argument for treating tool-schema validation as a security control, not just a data contract, and for re-testing agent tool paths whenever the underlying model changes.

Developer Armin Ronacher recently documented a strange regression while building Pi, a custom coding agent harness: newer Claude models perform *worse* at calling his edit tool correctly than older ones do. As Simon Willison summarized, the affected models are not small or cheap ones — they are Opus 4.8 and Sonnet 5, Anthropic's current top-tier models. Ronacher's original writeup lays out the mechanics in detail.

What actually broke

Pi's edit tool accepts a nested edits[] array, where each entry should contain only the fields Pi's schema expects. When Opus 4.8 or Sonnet 5 call this tool, they intermittently invent extra, undefined keys inside each edit object — Ronacher lists fabricated fields such as type, id, kind, unique, requireUnique, matchCase, and in_file. Crucially, the actual edit payload — the oldText and newText strings describing the change itself — came through byte-correct every time. The failure was purely structural: the model added fields the schema never defined, so Pi's validator rejected the call and asked the model to retry. Older Claude models did not exhibit this behavior at all.

Why newer models, of all things, get this wrong

Ronacher's theory is a training artifact, not a capability regression. Anthropic's newer models are heavily post-trained around Claude Code's own built-in edit tool, which uses a flatter, more permissive schema with parameter aliases and tolerance for near-misses. He argues that at a specific high-entropy decision point — right after the model closes an escaped string inside the JSON payload — that training biases the model toward *sampling* plausible-looking extra fields rather than correctly terminating the object per Pi's stricter schema. Willison frames this as a broader industry pattern: each model vendor trains tool-use behavior against its own first-party tool conventions (Anthropic's search-and-replace style edit tool versus OpenAI's apply_patch), and third-party harnesses increasingly have to chase whichever shape the current best model was implicitly trained on.

The security angle

This isn't a vulnerability disclosure, but it's directly relevant to anyone building or red-teaming agentic AI tooling. Pi's behavior here is the *correct* outcome: a strict schema validator caught malformed tool-call arguments and refused to execute them, rather than silently coercing unknown fields or ignoring them. That fail-closed validation is exactly the kind of control that matters for agents with write access to a filesystem or codebase — the difference between a rejected call and an unreviewed, unintended edit landing on disk is often just how permissively the tool parses its own schema.

The more uncomfortable takeaway is that this brittleness is *model-version-dependent* and moved in the wrong direction as the model got more capable in other respects. Teams building custom agent tool-calling integrations should not assume that a model upgrade preserves prior tool-use reliability, and should treat their tool schemas as an attack-adjacent surface: validate strictly, reject unknown fields by default, and re-run adversarial and malformed-input testing against agent tool paths whenever the underlying model changes — not just when the tool's own code changes.

FAQ

Frequently Asked Questions

Is this a security vulnerability in Claude?

No — it's a reliability quirk in how newer Claude models format tool-call arguments for one developer's custom coding harness, not a disclosed CVE or exploit. It's notable for AI security teams because it shows how a strict tool schema validator can correctly block malformed agent behavior.

Which Claude models are affected?

Armin Ronacher reports the issue with Opus 4.8 and Sonnet 5 when calling his Pi harness's edit tool; he states that older Claude models did not show the same invented-field behavior.

What should teams building AI coding agents take from this?

Treat tool-call schema validation as a security control: reject unknown or malformed fields by default rather than tolerating them, and re-test agent tool-calling paths whenever you upgrade the underlying model, since reliability here does not necessarily improve with newer models.

Sources

1Better Models: Worse Tools — Simon Willison
2Better models, worse tools — Armin Ronacher