10 min read

Claude Code Skill Safety: From 'Please Stop' to 'You Can't Move'

Table of Contents

38 Skills, three layers of defense, one hard lesson: natural language instructions are not a safety mechanism.


TL;DR

I have 38 Claude Code Skills (24 personal + 14 repo). After auditing them, I found: 12 destructive Skills had zero checkpoints, and 10 relied solely on natural language like “Confirm with user before proceeding.” This post documents how I systematically fixed the problem using a three-layer protection model.

Core insight: Natural language instructions are a “sign,” not a “physical barrier.” You don’t put up a “Please Keep Back” sign at the edge of a cliff and call it safety.


The Problem: Are Your Skills Actually Safe?

Claude Code Skills are reusable workflow templates. A well-written Skill can compress 10 minutes of work into 30 seconds. But here’s the thing — many Skills contain irreversible operations:

  • git push to remote
  • aws s3 rm deleting S3 objects
  • kubectl delete removing K8s resources
  • adb push overwriting system APKs
  • --execute --yes writing to production DB

I did a full audit of my 38 Skills:

CategoryCountExamples
Destructive, no checkpoint5bads-skynet-e2e, device-test, commit-phase, test-folio, ventura-memory
Destructive, natural language only10bads-update, self-evolution, phase-impl, writing, code-review, etc.
Destructive, has AskUserQuestion3prepare-feature-for-ux-review, sprint-plan, release-notes
Read-only17aosp-analysis, director, trade-analysis, etc.

The scariest finding: My bads-update Skill runs 8 steps end-to-end — from token refresh to production DB write to S3 cleanup. The only thing between dry-run and execution was a single “Confirm with user before proceeding.” In long-context conversations, the model may skip right past it.


The Three-Layer Protection Model

I categorize safety mechanisms into three layers, ordered by increasing reliability:

Layer 1 (Sign)           → Natural language: "Wait for user confirmation"
Layer 2 (Traffic light)  → AskUserQuestion tool call
Layer 3 (Physical wall)  → PreToolUse Hook / Skill split / disallowed-tools

Layer 1: Natural Language — “Please Stop”

**→ WAIT** for user response before proceeding.
Confirm with user before proceeding.

This is a sign. The model sees it, weighs it against all other instructions, and decides whether to comply. Compliance drops under these conditions:

  • Long context dilution: The longer the conversation, the lower the weight of earlier instructions
  • Task completion bias: The model prefers completing tasks over stopping
  • Vague wording: “Confirm with user” vs “Use AskUserQuestion tool”

Conclusion: Layer 1 should only be a last resort, never the primary defense.

Layer 2: AskUserQuestion — “Traffic Light That Might Break”

Use AskUserQuestion to confirm:
> "Push to origin refs/for/main? (Change-Id: I1234)"
Only proceed after user confirms.

AskUserQuestion is a tool call instruction. If the model decides to call it, the runtime forces a pause and waits for user response. But the first step — “the model decides to call it” — is still probabilistic.

GitHub Issue #19308 confirms: the model ignores explicit tool call instructions in Skills.

Probabilistic invocation → (if invoked) → Runtime blocks deterministically
         ↑                                        ↑
    May fail                                 100% reliable

Better than Layer 1 thanks to the runtime blocking, but the entry point is still probabilistic.

Layer 3: Runtime Mechanisms — “You Can’t Move”

3a. PreToolUse Hook

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "if": "Bash(git push*)",
            "command": "/path/to/safety-hook.sh",
            "timeout": 5
          }
        ]
      }
    ]
  }
}

The hook intercepts before the tool call executes. The model cannot bypass it. The if field ensures the hook process only spawns when matching a dangerous pattern, so normal commands aren’t affected.

The hook script reads JSON from stdin, inspects the command, and returns a permissionDecision:

#!/usr/bin/env bash
INPUT=$(cat)
COMMAND=$(echo "$INPUT" | python3 -c "
import json,sys
print(json.load(sys.stdin).get('tool_input',{}).get('command',''))
" 2>/dev/null)

# DENY: hard block
if echo "$COMMAND" | grep -qE 'git\s+push\s+.*(--force|-f)\b'; then
  echo '{"hookSpecificOutput":{
    "hookEventName":"PreToolUse",
    "permissionDecision":"deny",
    "permissionDecisionReason":"Force push rewrites remote history."
  }}'
  exit 0
fi

# ASK: prompt user
if echo "$COMMAND" | grep -qE 'git\s+push\s'; then
  echo '{"hookSpecificOutput":{
    "hookEventName":"PreToolUse",
    "permissionDecision":"ask",
    "permissionDecisionReason":"Git push detected. Confirm target branch."
  }}'
  exit 0
fi

# Passthrough
exit 0

Three decisions:

DecisionEffectUse Case
denyHard block, user must re-authorizeForce push, S3 deletion, production DB writes
askPrompt for confirmation, can proceedRegular push, device reboot
(empty)Pass throughNormal commands

3b. Skill Splitting

/bads-update          → Steps 1-4 (dry-run only)
/bads-update-execute  → Steps 5-8 (execute + verify + cleanup)

The user must manually type /bads-update-execute. This is a runtime guarantee — the model cannot autonomously invoke a Skill that requires manual user input.

3c. disallowed-tools

---
name: aosp-analysis
disallowed-tools:
  - Edit
  - Write
---

Removes write tools from the model’s available tool pool. A read-only Skill doesn’t need to modify files.


Implementation

Phase 1: PreToolUse Hook

Created a shared hook script with 6 if filters:

PatternDecisionReason
git push --forcedenyRewrites remote history
aws s3 rmdenyS3 deletion is irreversible
kubectl delete (except tunnel cleanup)denyK8s resource deletion
--execute --yesdenyProduction DB write
git pushaskConfirm target branch
adb rebootaskDevice restart
docker compose downaskService shutdown

The if field is critical — it uses permission rule syntax filtering to ensure the hook process only spawns on matches. It doesn’t affect ls, git status, ./gradlew, or any other normal command.

Phase 2: Split the Most Dangerous Skill

bads-update was my most dangerous Skill:

Before:

Steps 1-8: token → DB tunnel → dry-run → execute → verify → S3 cleanup
Only a single "Confirm with user before proceeding" in between

After:

/bads-update (Steps 1-4):
  token → DB tunnel → dry-run → STOP
  "Dry-run complete. Run /bads-update-execute when ready."

/bads-update-execute (Steps 5-8):
  execute → verify → S3 cleanup
  Hook intercepts --execute --yes and aws s3 rm

Double protection: Skill split (user must manually invoke) + Hook interception (blocks even within the execute skill).

Phase 3: Add AskUserQuestion to Destructive Skills

SkillCheckpoint LocationQuestion
commit-phaseBefore push”Push to {remote} refs/for/{branch}?”
device-testBefore APK push”Push {apk} to {device}:/product/priv-app/?”
self-evolutionBefore modifications”Apply these changes to CLAUDE.md / skills?”
writingBefore file write”Write this content to {filepath}?”

Phase 4: disallowed-tools for 17 Read-Only Skills

One line of frontmatter. Analysis Skills can’t modify files.


Verification

Tested the hook script with Python subprocess — all 10 test cases passed:

PASS | safe cmd             | expected=passthrough  | actual=passthrough
PASS | adb reboot           | expected=ask          | actual=ask
PASS | docker down          | expected=ask          | actual=ask
PASS | s3 rm                | expected=deny         | actual=deny
PASS | kubectl del          | expected=deny         | actual=deny
PASS | tunnel cleanup       | expected=passthrough  | actual=passthrough
PASS | execute yes          | expected=deny         | actual=deny
PASS | regular push         | expected=ask          | actual=ask
PASS | force push           | expected=deny         | actual=deny
PASS | git status           | expected=passthrough  | actual=passthrough

The most entertaining validation: the hook intercepted its own test commands during testing. Because the Bash command string contained --execute --yes, the PreToolUse hook thought it was a production DB write and denied it. This turned out to be the strongest end-to-end proof — the hook works in real runtime.


Before and After

Before

38 skills:
  Layer 3: 0 skills   ← Not a single one used Runtime protection
  Layer 2: 1 skill
  Layer 1: 10 skills
  No checkpoint: 12 skills (5 destructive)
  Read-only: 17 skills (unprotected)

After

38 skills:
  PreToolUse Hook:     6 if-filters covering all dangerous Bash patterns
  Skill split:         bads-update → bads-update + bads-update-execute
  AskUserQuestion:     4 destructive skills
  disallowed-tools:    17 read-only skills

  Every destructive operation has at least one Runtime defense ✓

Design Principles

1. Defense in Depth, But Primary Defense Must Be Layer 3

Layer 3 (must-have)    → PreToolUse Hook / Skill split
Layer 2 (nice-to-have) → AskUserQuestion for UX
Layer 1 (last resort)  → Natural language hints

Layers 2 and 1 provide defense-in-depth. They are not the primary defense.

2. if Filters Are a Performance Requirement

Without if filters, every Bash command spawns a hook process. ls, git status, ./gradlew build — each adds 100-200ms of latency.

// Good: only spawn hook on git push
{"if": "Bash(git push*)", "command": "safety-hook.sh"}

// Bad: every Bash command spawns hook
{"command": "safety-hook.sh"}

3. Skill Splitting Is More Reliable Than Checkpoints

Adding a checkpoint inside a Skill (whether Layer 1 or Layer 2) means the model can probabilistically skip it. But Skill splitting requires the user to manually type another slash command — that’s a 0% chance the model can bypass.

4. allowed-tools ≠ Restriction

Claude Code’s allowed-tools is pre-approval, not restriction. Using allowed-tools: [Bash, Read] doesn’t prevent the model from using Edit — it just triggers a permission prompt when Edit is called.

For actual restriction, use disallowed-tools.


Conclusion

The core question of AI Agent safety isn’t “will the model listen?” — it’s “what happens if it doesn’t?”

For read-only operations — no big deal. Worst case is some wasted context.

For irreversible operations — git push --force, aws s3 rm, production DB writes — you don’t need “please ask me first.” You need “you literally can’t.”

PreToolUse Hook is that “you literally can’t.”


References


This is part of the “Claude Code in Practice” series. Previous post: Claude Code Skill Safety Gates: A Three-Layer Model from Auditing 35 Skills.