getsentry · dcramer · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/skills/prompt-optimizer/SKILL.md b/skills/prompt-optimizer/SKILL.md
@@ -1,6 +1,6 @@
 ---
 name: prompt-optimizer
-description: Create, optimize, and iteratively refine agent prompts and system prompts. Use when asked to "improve a prompt", "optimize a system prompt", "rewrite an agent prompt", "tune prompt wording", "make this prompt more reliable", or "adapt a prompt for OpenAI, Claude, or Gemini". Handles model-specific prompt guidance, prompt markers/tags, eval design, and meta optimization loops for new and existing prompts.
+description: Create, optimize, and iteratively refine agent prompts and system prompts. Use when asked to "improve a prompt", "optimize a system prompt", "rewrite an agent prompt", "tune prompt wording", "make this prompt more reliable", "adapt a prompt for OpenAI, Claude, or Gemini", "design tool policy for an agent prompt", "how should I expose tools in a prompt", or "how should I disclose skills in an agent". Handles model-specific prompt guidance, prompt markers/tags, tool disclosure and tool-call narration, skill disclosure and routing, layered platform/deployer prompts, eval design, and meta optimization loops.
 ---
 
 # Prompt Optimizer
@@ -55,75 +55,22 @@ Read `references/model-family-notes.md`.
 
 ## Step 3: Shape the prompt deliberately
 
-Read `references/core-patterns.md`. When the prompt surface includes tools or a skill layer, also read `references/tools.md` or `references/skills.md` respectively.
-
-1. Separate durable behavior from task-local context:
-- stable policy and behavioral defaults belong in `system` or `developer`
-- variable inputs, retrieved context, and task instances belong in templated user-facing sections
-- when the system prompt is assembled at runtime from a platform layer and a deployer-authored persona layer (e.g., `SOUL.md`, `CLAUDE.md`, `AGENTS.md`), see "Layered prompts with multiple owners" in `references/core-patterns.md` — platform behavior rules must not depend on what the deployer layer contains
-
-2. Keep one authoritative instruction per behavior:
-- if a rule appears in more than one layer, choose one owner for it
-- stable cross-task rules belong in `system` or `developer`
-- examples should teach format, edge-case handling, or tool behavior, not restate the whole policy
-- user payloads should carry task-local facts, not durable policy
-
-3. Use markers only when they reduce ambiguity:
-- use markdown headings or XML-style tags to separate instructions, context, examples, tool rules, and output contracts
-- keep tag names descriptive and consistent
-- do not wrap every sentence in markup
-
-4. Make the prompt easy to execute:
-- put one high-value behavior per bullet or line when the task is fragile
-- prefer positive instructions over "do not do X" lists
-- place tool-use rules, escalation boundaries, and stop conditions in explicit sections
-- keep persona light unless it changes behavior in a useful way
-- use the shortest wording that preserves the intended behavioral constraint
-- cut motivational filler, repeated reminders, and examples that do not improve evals
-- for long-context prompts, place evidence before the final query and keep the actual ask in a clear terminal section
-- keep instructions, evidence, and schemas in distinct blocks so the model does not have to infer what is policy versus data
-
-5. Treat examples as first-class prompt assets:
-- start simple before adding examples
-- add examples only when they improve format control, edge-case handling, or tool behavior
-- keep examples structurally consistent
-- prefer positive demonstrations over anti-pattern-only demonstrations
+Read `references/core-patterns.md`. When the prompt surface includes tools or a skill layer, also read `references/tools.md` or `references/skills.md`. Reach for `references/transformed-examples.md` when the task is under-specified or the first draft is weak.
+
+Apply, in order:
+
+1. Layer the prompt — stable behavior in `system`/`developer`, task-local context in templated user sections, examples as a third layer.
+2. Place directives in canonical rules sections (`<behavior>`, `<tool_policy>`, `<constraints>`, `<workflow>`), not buried inside descriptive markers.
+3. Keep one authoritative owner per rule. Collapse duplicates.
+4. Cross-check the symptom-to-fix table in `core-patterns.md` before adding new instructions.
 
 ## Step 4: Run the meta optimization loop
 
 Read `references/meta-optimization-loop.md`.
 
-1. Start with the current prompt or a simple first draft.
-2. Score it on a representative slice:
-- at least one happy-path case
-- at least one failure replay
-- at least one ambiguous case
-- at least one edge case
-- at least one "should refuse", "should ask", or "should defer" case when relevant
-
-3. Turn failures into explicit criticisms:
-- identify what the prompt under-specified, over-specified, or contradicted
-- write critiques as actionable edits, not vague complaints
-
-4. Generate a small beam of candidate prompts:
-- one minimal-diff repair
-- one structure-first rewrite
-- one example- or tool-rule-centered variant when that is the likely bottleneck
-- one provider-specific adapter when cross-model behavior is the issue
-
-5. Compare candidates on the same eval slice.
-6. Keep the best candidate and log what changed and why.
-7. Preserve the evidence for each round:
-- prompt version
-- eval case
-- model output
-- failure reason
-- relevant scores
-
-8. Test the winner on a holdout slice before finalizing.
-9. Stop when scores plateau, edits oscillate, cost rises without quality gain, or the remaining issue is outside prompt control.
-
-Keep edits minimal and causal. Record what you removed as well as what you added. If you change everything at once, you learn nothing about what actually helped.
+Baseline on a representative slice → cluster failures → write critiques as concrete edits → generate a small candidate beam (minimal-diff repair, structure-first rewrite, example-or-tool-rule variant) → compare on the same slice → keep the best → validate on a holdout → stop when scores plateau, edits oscillate, or cost rises without gain.
+
+Record what you remove as well as what you add.
 
 ## Step 5: Produce a reusable deliverable
 
@@ -139,24 +86,6 @@ Return:
 
 If the user supplied an existing prompt, include a concise diff-style explanation of the biggest behavioral changes.
 
-## Step 6: Guard against common failure modes
-
-Read `references/transformed-examples.md` when the task is ambiguous or the first draft is weak.
-
-Do not:
-
-- optimize wording before defining the eval target
-- mix instructions, examples, and raw context without boundaries
-- keep the same rule in multiple layers unless there is a proven reason
-- let stable rules drift into the user payload just because the current prompt template makes it convenient
-- ask reasoning models to reveal chain-of-thought just because the task is hard
-- keep contradictory legacy instructions in the same prompt
-- overfit to one or two examples
-- keep examples that do not improve measured behavior
-- solve tool-use failures only in the system prompt when the real problem is the tool description or schema
-- add markers everywhere and mistake structure for clarity
-- use a bloated persona as a substitute for concrete behavior rules
-
 ## Output standard
 
 The final prompt package should be reusable by another engineer without rediscovering:

diff --git a/skills/prompt-optimizer/SOURCES.md b/skills/prompt-optimizer/SOURCES.md
@@ -121,6 +121,12 @@ Why: this skill is a repeatable prompt-optimization workflow with explicit preco
 - "Port this prompt from GPT to Gemini."
 - "Make this tool-using prompt more reliable."
 - "Tune this prompt wording with a proper eval loop."
+- "How should I expose tools in my agent's system prompt?"
+- "Design tool policy for our harness prompt."
+- "Stop the model from narrating 'let me check' before tool calls."
+- "How should I disclose skills in the system prompt — eager or lazy?"
+- "Route between two adjacent skills that keep mis-matching."
+- "Split platform rules out of our customer-authored persona file."
 
 ### Should not trigger
 
@@ -130,6 +136,8 @@ Why: this skill is a repeatable prompt-optimization workflow with explicit preco
 - "Summarize this document."
 - "Design a new model architecture."
 - "Tune only the temperature and top-p settings."
+- "Implement a new MCP server." (this is a tool/server authoring task, not a prompt task)
+- "Write the SKILL.md body for a new skill." (this is a skill-authoring task — use `skill-writer`)
 
 ## Open gaps
 

diff --git a/skills/prompt-optimizer/references/core-patterns.md b/skills/prompt-optimizer/references/core-patterns.md
@@ -2,6 +2,17 @@
 
 Use this file when creating a new prompt or restructuring a weak one.
 
+## Contents
+
+- When markers help
+- Where rules live
+- Layer the prompt correctly
+- Layered prompts with multiple owners
+- Portable agent prompt skeleton
+- High-value prompt moves
+- Examples
+- Symptom to fix mapping
+
 ## When markers help
 
 Use markers when the prompt mixes different content types:

diff --git a/skills/prompt-optimizer/references/meta-optimization-loop.md b/skills/prompt-optimizer/references/meta-optimization-loop.md
@@ -2,6 +2,13 @@
 
 Use this file when refining an existing prompt or when a first draft needs disciplined iteration.
 
+## Contents
+
+- Inputs
+- Optimization loop (baseline, failure clustering, textual gradients, candidate beam, compare, reflective memory, holdout validation, stop conditions)
+- Practical defaults
+- What this loop is borrowing from
+
 ## Inputs
 
 Collect these before iterating:

diff --git a/skills/prompt-optimizer/references/transformed-examples.md b/skills/prompt-optimizer/references/transformed-examples.md
@@ -2,6 +2,13 @@
 
 Use these examples when the task is under-specified or when you need a stronger default shape.
 
+## Contents
+
+- Example 1: Happy-path new agent prompt
+- Example 2: Robust variant for a weak existing prompt
+- Example 3: Anti-pattern and corrected version
+- Example 4: Directive placement — state marker vs. rules section
+
 ## Example 1: Happy-path new agent prompt
 
 ### Input brief
@@ -60,10 +67,10 @@ Default to implementation when the user's intent is execution rather than discus
 Use tools to discover missing facts instead of guessing.
 </default_behavior>
 
-<tool_rules>
+<tool_policy>
 Use repository tools whenever correctness depends on current files, logs, or config.
 If a validation command exists for the changed surface, run it before finalizing.
-</tool_rules>
+</tool_policy>
 
 <progress_updates>
 Send short progress updates during long tasks.
@@ -107,9 +114,9 @@ You are a reliable implementation agent.
 Complete the user's task accurately and efficiently.
 </goal>
 
-<tool_use>
+<tool_policy>
 Use tools when current repository facts, logs, or external state are needed.
-</tool_use>
+</tool_policy>
 
 <clarification>
 Ask only when required information is missing or the action is risky.