Claude Code Skill Safety: From 'Please Stop' to 'You Can't Move'

38 Skills, three layers of defense, one hard lesson: natural language instructions are not a safety mechanism.

TL;DR

I have 38 Claude Code Skills (24 personal + 14 repo). After auditing them, I found: 12 destructive Skills had zero checkpoints, and 10 relied solely on natural language like “Confirm with user before proceeding.” This post documents how I systematically fixed the problem using a three-layer protection model.

Core insight: Natural language instructions are a “sign,” not a “physical barrier.” You don’t put up a “Please Keep Back” sign at the edge of a cliff and call it safety.

The Problem: Are Your Skills Actually Safe?

Claude Code Skills are reusable workflow templates. A well-written Skill can compress 10 minutes of work into 30 seconds. But here’s the thing — many Skills contain irreversible operations:

git push to remote
aws s3 rm deleting S3 objects
kubectl delete removing K8s resources
adb push overwriting system APKs
--execute --yes writing to production DB

I did a full audit of my 38 Skills:

Category	Count	Examples
Destructive, no checkpoint	5	`bads-skynet-e2e`, `device-test`, `commit-phase`, `test-folio`, `ventura-memory`
Destructive, natural language only	10	`bads-update`, `self-evolution`, `phase-impl`, `writing`, `code-review`, etc.
Destructive, has AskUserQuestion	3	`prepare-feature-for-ux-review`, `sprint-plan`, `release-notes`
Read-only	17	`aosp-analysis`, `director`, `trade-analysis`, etc.

The scariest finding: My bads-update Skill runs 8 steps end-to-end — from token refresh to production DB write to S3 cleanup. The only thing between dry-run and execution was a single “Confirm with user before proceeding.” In long-context conversations, the model may skip right past it.

The Three-Layer Protection Model

I categorize safety mechanisms into three layers, ordered by increasing reliability:

Layer 1 (Sign)           → Natural language: "Wait for user confirmation"
Layer 2 (Traffic light)  → AskUserQuestion tool call
Layer 3 (Physical wall)  → PreToolUse Hook / Skill split / disallowed-tools

Layer 1: Natural Language — “Please Stop”

**→ WAIT** for user response before proceeding.
Confirm with user before proceeding.

This is a sign. The model sees it, weighs it against all other instructions, and decides whether to comply. Compliance drops under these conditions:

Long context dilution: The longer the conversation, the lower the weight of earlier instructions
Task completion bias: The model prefers completing tasks over stopping
Vague wording: “Confirm with user” vs “Use AskUserQuestion tool”

Conclusion: Layer 1 should only be a last resort, never the primary defense.

Layer 2: AskUserQuestion — “Traffic Light That Might Break”

Use AskUserQuestion to confirm:
> "Push to origin refs/for/main? (Change-Id: I1234)"
Only proceed after user confirms.

AskUserQuestion is a tool call instruction. If the model decides to call it, the runtime forces a pause and waits for user response. But the first step — “the model decides to call it” — is still probabilistic.

GitHub Issue #19308 confirms: the model ignores explicit tool call instructions in Skills.

Probabilistic invocation → (if invoked) → Runtime blocks deterministically
         ↑                                        ↑
    May fail                                 100% reliable

Better than Layer 1 thanks to the runtime blocking, but the entry point is still probabilistic.

Layer 3: Runtime Mechanisms — “You Can’t Move”

3a. PreToolUse Hook

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "if": "Bash(git push*)",
            "command": "/path/to/safety-hook.sh",
            "timeout": 5
          }
        ]
      }
    ]
  }
}

The hook intercepts before the tool call executes. The model cannot bypass it. The if field ensures the hook process only spawns when matching a dangerous pattern, so normal commands aren’t affected.

The hook script reads JSON from stdin, inspects the command, and returns a permissionDecision:

#!/usr/bin/env bash
INPUT=$(cat)
COMMAND=$(echo "$INPUT" | python3 -c "
import json,sys
print(json.load(sys.stdin).get('tool_input',{}).get('command',''))
" 2>/dev/null)

# DENY: hard block
if echo "$COMMAND" | grep -qE 'git\s+push\s+.*(--force|-f)\b'; then
  echo '{"hookSpecificOutput":{
    "hookEventName":"PreToolUse",
    "permissionDecision":"deny",
    "permissionDecisionReason":"Force push rewrites remote history."
  }}'
  exit 0
fi

# ASK: prompt user
if echo "$COMMAND" | grep -qE 'git\s+push\s'; then
  echo '{"hookSpecificOutput":{
    "hookEventName":"PreToolUse",
    "permissionDecision":"ask",
    "permissionDecisionReason":"Git push detected. Confirm target branch."
  }}'
  exit 0
fi

# Passthrough
exit 0

Three decisions:

Decision	Effect	Use Case
`deny`	Hard block, user must re-authorize	Force push, S3 deletion, production DB writes
`ask`	Prompt for confirmation, can proceed	Regular push, device reboot
(empty)	Pass through	Normal commands

3b. Skill Splitting

/bads-update          → Steps 1-4 (dry-run only)
/bads-update-execute  → Steps 5-8 (execute + verify + cleanup)

The user must manually type /bads-update-execute. This is a runtime guarantee — the model cannot autonomously invoke a Skill that requires manual user input.

3c. `disallowed-tools`

---
name: aosp-analysis
disallowed-tools:
  - Edit
  - Write
---

Removes write tools from the model’s available tool pool. A read-only Skill doesn’t need to modify files.

Implementation

Phase 1: PreToolUse Hook

Created a shared hook script with 6 if filters:

Pattern	Decision	Reason
`git push --force`	deny	Rewrites remote history
`aws s3 rm`	deny	S3 deletion is irreversible
`kubectl delete` (except tunnel cleanup)	deny	K8s resource deletion
`--execute --yes`	deny	Production DB write
`git push`	ask	Confirm target branch
`adb reboot`	ask	Device restart
`docker compose down`	ask	Service shutdown

The if field is critical — it uses permission rule syntax filtering to ensure the hook process only spawns on matches. It doesn’t affect ls, git status, ./gradlew, or any other normal command.

Phase 2: Split the Most Dangerous Skill

bads-update was my most dangerous Skill:

Before:

Steps 1-8: token → DB tunnel → dry-run → execute → verify → S3 cleanup
Only a single "Confirm with user before proceeding" in between

After:

/bads-update (Steps 1-4):
  token → DB tunnel → dry-run → STOP
  "Dry-run complete. Run /bads-update-execute when ready."

/bads-update-execute (Steps 5-8):
  execute → verify → S3 cleanup
  Hook intercepts --execute --yes and aws s3 rm

Double protection: Skill split (user must manually invoke) + Hook interception (blocks even within the execute skill).

Phase 3: Add AskUserQuestion to Destructive Skills

Skill	Checkpoint Location	Question
`commit-phase`	Before push	”Push to {remote} refs/for/{branch}?”
`device-test`	Before APK push	”Push {apk} to {device}:/product/priv-app/?”
`self-evolution`	Before modifications	”Apply these changes to CLAUDE.md / skills?”
`writing`	Before file write	”Write this content to {filepath}?”

Phase 4: `disallowed-tools` for 17 Read-Only Skills

One line of frontmatter. Analysis Skills can’t modify files.

Verification

Tested the hook script with Python subprocess — all 10 test cases passed:

PASS | safe cmd             | expected=passthrough  | actual=passthrough
PASS | adb reboot           | expected=ask          | actual=ask
PASS | docker down          | expected=ask          | actual=ask
PASS | s3 rm                | expected=deny         | actual=deny
PASS | kubectl del          | expected=deny         | actual=deny
PASS | tunnel cleanup       | expected=passthrough  | actual=passthrough
PASS | execute yes          | expected=deny         | actual=deny
PASS | regular push         | expected=ask          | actual=ask
PASS | force push           | expected=deny         | actual=deny
PASS | git status           | expected=passthrough  | actual=passthrough

The most entertaining validation: the hook intercepted its own test commands during testing. Because the Bash command string contained --execute --yes, the PreToolUse hook thought it was a production DB write and denied it. This turned out to be the strongest end-to-end proof — the hook works in real runtime.

Before and After

Before

38 skills:
  Layer 3: 0 skills   ← Not a single one used Runtime protection
  Layer 2: 1 skill
  Layer 1: 10 skills
  No checkpoint: 12 skills (5 destructive)
  Read-only: 17 skills (unprotected)

After

38 skills:
  PreToolUse Hook:     6 if-filters covering all dangerous Bash patterns
  Skill split:         bads-update → bads-update + bads-update-execute
  AskUserQuestion:     4 destructive skills
  disallowed-tools:    17 read-only skills

  Every destructive operation has at least one Runtime defense ✓

Design Principles

1. Defense in Depth, But Primary Defense Must Be Layer 3

Layer 3 (must-have)    → PreToolUse Hook / Skill split
Layer 2 (nice-to-have) → AskUserQuestion for UX
Layer 1 (last resort)  → Natural language hints

Layers 2 and 1 provide defense-in-depth. They are not the primary defense.

2. `if` Filters Are a Performance Requirement

Without if filters, every Bash command spawns a hook process. ls, git status, ./gradlew build — each adds 100-200ms of latency.

// Good: only spawn hook on git push
{"if": "Bash(git push*)", "command": "safety-hook.sh"}

// Bad: every Bash command spawns hook
{"command": "safety-hook.sh"}

3. Skill Splitting Is More Reliable Than Checkpoints

Adding a checkpoint inside a Skill (whether Layer 1 or Layer 2) means the model can probabilistically skip it. But Skill splitting requires the user to manually type another slash command — that’s a 0% chance the model can bypass.

4. `allowed-tools` ≠ Restriction

Claude Code’s allowed-tools is pre-approval, not restriction. Using allowed-tools: [Bash, Read] doesn’t prevent the model from using Edit — it just triggers a permission prompt when Edit is called.

For actual restriction, use disallowed-tools.

Conclusion

The core question of AI Agent safety isn’t “will the model listen?” — it’s “what happens if it doesn’t?”

For read-only operations — no big deal. Worst case is some wasted context.

For irreversible operations — git push --force, aws s3 rm, production DB writes — you don’t need “please ask me first.” You need “you literally can’t.”

PreToolUse Hook is that “you literally can’t.”

References

Claude Code Hooks Guide — Official PreToolUse Hook documentation
Claude Code Skills — Skill frontmatter (allowed-tools, disallowed-tools)
Handle approvals and user input — AskUserQuestion blocking behavior
GitHub Issue #19308 — Model ignoring explicit tool call instructions in Skills
GitHub Issue #18454 — Model ignoring MANDATORY natural language checkpoints in CLAUDE.md

This is part of the “Claude Code in Practice” series. Previous post: Claude Code Skill Safety Gates: A Three-Layer Model from Auditing 35 Skills.

Running AOSP Module Builds on Mac with Apple Container + Rosetta

Safety Gates in Claude Code Skills: From Auditing 35 Skills to a Three-Layer Protection Model