9 min read

Safety Gates in Claude Code Skills: From Auditing 35 Skills to a Three-Layer Protection Model

Table of Contents

I had a Skill that handled K8s deployments. The flow ran for months without incident. Then one day I went back to look at how the “confirm” step was actually written — a single line of Confirm with user before proceeding, with no mechanism guaranteeing the model would stop.

That didn’t sit well with me.


The starting point: one line of natural language

The deployment Skill’s flow was: refresh token → dry-run → confirm → deploy.

The “confirm” step looked like this:

Step 4: Confirm with user before proceeding

Most of the time Claude would stop and ask if I wanted to continue. But “most of the time” and “guaranteed” are two different things. That line is natural language. The model will “try” to follow it during token generation, but there’s no runtime mechanism ensuring it actually stops.

I decided to go through every Skill I had and see how many had the same problem.


Auditing 35 Skills

I had two sets of Skills — 14 shared repo skills and 21 personal skills. After reading through all of them, I first sorted by whether they had destructive operations:

TypeCountExamples
Read-only / Advisory21Log analysis, code review, status checks
Has destructive operations14Deployments, git push, config changes, device commands

Then I looked at how those 14 destructive Skills handled “confirm before executing”:

ApproachCount
Nothing at all8
Natural language (CHECKPOINT, STOP, Confirm with user)5
Specifies calling AskUserQuestion tool1

14 Skills with destructive operations. 8 with no checkpoint whatsoever. 5 relying on a line of natural language.

This isn’t just my setup. Search GitHub for public Claude Code Skills and you’ll find the same pattern everywhere — natural language signposts:

  • The claude-code-starter-kit incident-response skill puts it right in the behavioral rules: **STOP at checkpoints** — wait for user confirmation before proceeding, with each phase ending in **CHECKPOINT**: Present triage summary. Wait for user to confirm before investigation.
  • The claude-code-ultimate-guide talk-pipeline skill uses a CHECKPOINT step with Do not invoke Stage 5 without explicit user confirmation, and its anti-patterns section warns against “Skipping the CHECKPOINT — it’s the pipeline’s most important control point”
  • awesome-claude-skills curates 50+ verified skills — I went through them, and not a single one uses a runtime mechanism for checkpoints

Whether you call it CHECKPOINT, STOP, WAIT, or Confirm with user, it’s the same thing: a line of natural language, hoping the model reads it and stops.

But these signposts are not 100%. GitHub Issue #18454 documents a case where a user wrote ⛔ MANDATORY SESSION START (DO NOT SKIP) and Wait for confirmation before proceeding in their CLAUDE.md — bold, emoji, all-caps — and the model acknowledged reading it, then completely ignored it, modifying 23 files in one go.

What about the one that used AskUserQuestion? It was a sprint planning skill that called AskUserQuestion after listing stories for the user to confirm. Written like this:

Use AskUserQuestion to confirm the stories:
- Question: "Are these stories correct?"
- Options:
  - "Correct, proceed" → Continue to next step
  - "Need changes" → Ask what to modify

My first reaction: “That’s the right approach. AskUserQuestion is a tool call. Once invoked, the runtime forcibly pauses generation and waits for user response. This is a hard constraint.”

Two weeks of testing later, I found that conclusion was only half right.


AskUserQuestion isn’t as hard as you’d think

Worth pausing to consider: the model deciding whether to invoke AskUserQuestion and deciding whether to obey CHECKPOINT/WAIT/STOP use the same mechanism — token generation.

CHECKPOINT/WAIT:   Probabilistic compliance → outputs text and waits
AskUserQuestion:   Probabilistic invocation → (if invoked) runtime forces a block

The second step is genuinely deterministic — once the tool call fires, the runtime pauses generation, presents UI, and waits for user selection. This is backed by official documentation:

Execution remains paused until your callback returns, and the SDK only cancels the wait when the query itself is cancelled.

But the first step? The model “deciding whether to issue the tool call” is probabilistic in exactly the same way as “deciding whether to obey CHECKPOINT.”

This isn’t speculation. GitHub Issue #19308 has a title that says it directly:

Claude systematically ignores Skill tool despite explicit BLOCKING REQUIREMENT instructions

All-caps bold “you MUST call this tool” in the Skill, and the model skips it anyway.

So is AskUserQuestion better than plain natural language? Yes — it adds a runtime protection layer. But is it 100%? No. The difference between the two is single-layer (pure probability) vs two-layer (probability + deterministic), not “soft vs hard.”


So what’s actually 100%?

After going through the official docs, I found three mechanisms that don’t depend on the model “deciding to comply” — they operate at the runtime layer, and the model can’t bypass them.

PreToolUse Hook

This is the strongest one. Hooks intercept tool calls before execution. You inspect the command and decide to allow or block:

// .claude/settings.json
{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": ["bash .claude/hooks/block-destructive.sh"]
      }
    ]
  }
}
# .claude/hooks/block-destructive.sh
INPUT=$(cat /dev/stdin)
COMMAND=$(echo "$INPUT" | jq -r '.tool_input.command // empty')

if echo "$COMMAND" | grep -qE "git push|kubectl apply|kubectl delete|rm -rf"; then
  cat <<EOF
{
  "decision": "block",
  "reason": "Blocked: $COMMAND — run manually if intended."
}
EOF
  exit 0
fi

According to the official docs, PreToolUse Hooks block even bypassPermissions mode:

A hook that returns permissionDecision: "deny" blocks the tool even in bypassPermissions mode or with --dangerously-skip-permissions.

Model tries git push? Hook intercepts before the shell executes. The model can’t get around it because the whole thing happens outside the model’s control.

Skill splitting

Split one long-flow Skill into two independent Skills:

Before: /deploy → dry-run → confirm → deploy → verify
After:  /deploy-prepare → dry-run → output results
        /deploy-execute → user manually triggers → deploy → verify

The user has to type /deploy-execute themselves. The model won’t trigger a user-invocable Skill on its own — that’s a runtime guarantee.

disallowed-tools

---
name: log-analyzer
disallowed-tools:
  - Edit
  - Write
  - Bash
---

disallowed-tools removes specified tools from the model’s available tool pool for the duration of the Skill. The model can’t see these tools, so it won’t call them. The caveat: the restriction clears after the user’s next message. Good enough for analysis Skills, not enough for deployment Skills.

One thing that’s easy to confuse here: allowed-tools is not a restriction. The official docs are explicit that it only grants permission (pre-approves tools), and does not prevent the model from calling tools outside the list. I got this backwards initially and only corrected it after checking the docs.


The three-layer protection model

After sorting through everything, all mechanisms for “making the model stop before destructive operations” fall into three layers:

LayerApproachDepends on model complianceReliability
Natural languageCHECKPOINT, WAIT, STOP, Confirm100% dependentProbabilistic
Tool call instructionUse AskUserQuestionInvocation decision dependent, execution independentProbability + deterministic
Runtime mechanismHooks, Skill splitting, disallowed-tools0% dependent100%

Back to that original K8s deployment Skill. Here’s the corrected protection:

  1. Hook intercepts kubectl apply — model can try all it wants, it won’t execute
  2. AskUserQuestion presents options before deploy — the normal-flow UX
  3. Natural language IMPORTANT: Never deploy without approval — the last soft line of defense

The primary defense is the Hook. Even if AskUserQuestion gets skipped (Issue #19308 confirms this happens), kubectl apply is still blocked by the Hook. AskUserQuestion’s value isn’t security — it’s providing a better user experience (the selection UI).


Decision framework

Does the Skill have irreversible operations?

├── No → No checkpoint needed
│   Optional: disallowed-tools to remove write tools

└── Yes → Runtime layer as primary defense
    ├── Hook to intercept dangerous commands (most flexible)
    ├── Skill splitting: prepare + execute (simplest)
    └── Optional: layer AskUserQuestion on top for UX

Looking back

After going through 35 Skills, reading the official docs, and combing through GitHub Issues, the biggest takeaway wasn’t “discovering the three-layer model.” It was realizing that my intuition about LLM control mechanisms was wrong.

I assumed “telling the model to call a tool” was more reliable than “telling the model to stop and wait.” Sounds reasonable — tool calls have runtime protection, natural language doesn’t. But “the model deciding whether to call the tool” is itself probabilistic, using the same mechanism as “the model deciding whether to obey CHECKPOINT.”

One sentence: if a behavior’s safety depends on the model “deciding to comply” with your instruction, it’s not 100%. 100% only exists in mechanisms outside the model’s control.


References


This is part of the “Claude Code in Practice” series. Previous: Git as an External Brain for Claude Code: Beyond MEMORY.md.