Most security tools grade their own detection from the architecture diagram. The deck says "four detection layers"; the layers should catch the attack; therefore the attack is caught. We didn't trust that chain of reasoning for our own engine, so we wrote six attacks and ran them against the live scanner — not a fork, not a mock, the same code that scores every skill in the registry. It caught four. It missed two. Here is the scoreboard, before any fixes, and what we changed afterward.
The test
Six payloads, each a single SKILL.md crafted to probe a specific seam. Four are attacks we claim to flag. One is a known-hard case we expected to be on the edge. One is a clean control, there to catch the failure mode nobody talks about: a scanner that "passes" the battery by flagging everything. We scored each one through the production engine and recorded the raw number it returned.
The scoreboard
This is the result before we touched anything. We are publishing the misses first, on purpose — the fixes only mean something if you can see what they're fixing.
| # | Attack | Result |
|---|---|---|
| 1 | Plain eval(x) | caught — regex + AST |
| 2 | ("ev"+"al")(x) — string-split eval | MISSED |
| 3 | .env → base64 → POST | caught — compound |
| 4 | Read + send across two code blocks | caught — cross-zone compound |
| 5 | Semantic injection, zero trigger tokens | MISSED |
| 6 | Clean control | correctly clean |
Four caught, two missed, control clean. The two misses are not random — they sit on the two seams every static scanner has, and they're worth understanding before you trust any tool's detection claim, ours included.
Why it missed them
#2 — string-split eval. The keyword and AST layers
look for eval as a call expression. Write
("ev"+"al")(x) and there is no eval
token to find — the string is composed at runtime, after the parser has already moved on.
The skill scored 90/100, Trusted. A reviewer eyeballing the
code would catch it in a second; the static layers had nothing literal to grab.
#5 — semantic injection. A prose instruction that tells
the agent to do something harmful, carrying zero of the trigger tokens our keyword and AST
passes key on. No curl, no process.env,
no suspicious call — just natural language aimed at the model rather than the interpreter.
It scored 93/100, Trusted. This is the seam that doesn't live
in the code at all; it lives in the instructions.
The fixes (verified live)
Both are now deployed and re-measured against the same payloads:
- #2 now scores 40/100, Risky — was passing at 90/Trusted.
- #5 now scores 75/100, Caution — was passing at 93/Trusted.
- The clean control still scores 90/100, Trusted. The fixes added no false positive — which is the only reason they're worth shipping. A detection gain you pay for by flagging clean skills isn't a gain; it's a relocation of the error.
How far the fix actually reaches
Here is the part most write-ups would leave out. The obfuscated-eval fix does not mean "we now catch obfuscated execution." It means something narrower and exactly stateable: the scanner now follows named rungs plus a single-step variable hop. It will resolve a string assembled and called one indirection away. It will not follow a longer indirection chain — build the string across several variables, route it through an array, defer it a few more steps, and it gets through again.
We're stating the boundary precisely because the boundary is the credibility. "We close obfuscated execution" would be the comfortable sentence and the false one. The true sentence is: we closed the single-hop case, the longer chains remain open, and a determined attacker who reads this knows exactly where the line is. That's the trade we'll make every time — an honest edge over a generous claim.
The semantic-injection fix carries the same caveat in a different shape. Prose attacks are a spectrum, not a switch; #5 moved from Trusted to Caution, not to Dangerous, because what we added is suspicion, not proof. The scanner can flag the shape of an instruction aimed at the model. It cannot tell you the author meant harm. It surfaces; it doesn't conclude.
Measure, don't assert
The reason to publish a 4-of-6 instead of a 6-of-6 is not modesty. It's that a detection claim you can't reproduce is just a marketing number, and the entire premise of this tool is that you shouldn't take a security verdict on faith — not ours, not the registry's, not anyone's. We grade our own engine against real payloads, we write down the misses, and we tell you precisely how far each fix reaches. That's the standard we'd want from a scanner before we trusted it with a skill we were about to install.
So: found a miss we didn't? That's the most useful thing you can send us. Run a skill through the live scanner, and if the verdict is wrong — too soft or too harsh — use the Dispute control on the verdict page. We'd rather be corrected than trusted blindly.
curl https://api.clauwdit.4worlds.dev/audit/owner/skillOr open a verdict page directly — e.g. /skill/browser-use — and read the verdict, the confidence chip, and the dispute affordance for yourself.