Source Code
Skill Reviewer
Audit agent skills (SKILL.md files) for quality, correctness, and completeness. Provides a structured review framework with scoring rubric, defect checklists, and improvement recommendations.
When to Use
- Reviewing a skill before publishing to the registry
- Evaluating a skill you downloaded from the registry
- Auditing your own skills for quality improvements
- Comparing skills in the same category
- Deciding whether a skill is worth installing
Review Process
Step 1: Structural Check
Verify the skill has the required structure. Read the file and check each item:
STRUCTURAL CHECKLIST:
[ ] Valid YAML frontmatter (opens and closes with ---)
[ ] `name` field present and is a valid slug (lowercase, hyphenated)
[ ] `description` field present and non-empty
[ ] `metadata` field present with valid JSON
[ ] `metadata.clawdbot.emoji` is a single emoji
[ ] `metadata.clawdbot.requires.anyBins` lists real CLI tools
[ ] Title heading (# Title) immediately after frontmatter
[ ] Summary paragraph after title
[ ] "When to Use" section present
[ ] At least 3 main content sections
[ ] "Tips" section present at the end
Step 2: Frontmatter Quality
Description field audit
The description is the most impactful field. Evaluate it against these criteria:
DESCRIPTION SCORING:
[2] Starts with what the skill does (active verb)
GOOD: "Write Makefiles for any project type."
BAD: "This skill covers Makefiles."
BAD: "A comprehensive guide to Make."
[2] Includes trigger phrases ("Use when...")
GOOD: "Use when setting up build automation, defining multi-target builds"
BAD: No trigger phrases at all
[2] Specific scope (mentions concrete tools, languages, or operations)
GOOD: "SQLite/PostgreSQL/MySQL โ schema design, queries, CTEs, window functions"
BAD: "Database stuff"
[1] Reasonable length (50-200 characters)
TOO SHORT: "Make things" (no search surface)
TOO LONG: 300+ characters (gets truncated)
[1] Contains searchable keywords naturally
GOOD: "cron jobs, systemd timers, scheduling"
BAD: Keywords stuffed unnaturally
Score: __/8
Metadata audit
METADATA SCORING:
[1] emoji is relevant to the skill topic
[1] requires.anyBins lists tools the skill actually uses (not generic tools like "bash")
[1] os array is accurate (don't claim win32 if commands are Linux-only)
[1] JSON is valid (test with a JSON parser)
Score: __/4
Step 3: Content Quality
Example density
Count code blocks and total lines:
EXAMPLE DENSITY:
Lines: ___
Code blocks: ___
Ratio: 1 code block per ___ lines
TARGET: 1 code block per 8-15 lines
< 8 lines per block: possibly over-fragmented
> 20 lines per block: needs more examples
Example quality
For each code block, check:
EXAMPLE QUALITY CHECKLIST:
[ ] Language tag specified (```bash, ```python, etc.)
[ ] Command is syntactically correct
[ ] Output shown in comments where helpful
[ ] Uses realistic values (not foo/bar/baz)
[ ] No placeholder values left (TODO, FIXME, xxx)
[ ] Self-contained (doesn't depend on undefined variables)
OR setup is shown/referenced
[ ] Covers the common case (not just edge cases)
Score each example 0-3:
- 0: Broken or misleading
- 1: Works but minimal (no output, no context)
- 2: Good (correct, has output or explanation)
- 3: Excellent (copy-pasteable, realistic, covers edge case)
Section organization
ORGANIZATION SCORING:
[2] Organized by task/scenario (not by abstract concept)
GOOD: "## Encode and Decode" โ "## Inspect Characters" โ "## Convert Formats"
BAD: "## Theory" โ "## Types" โ "## Advanced"
[2] Most common operations come first
GOOD: Basic usage โ Variations โ Advanced โ Edge cases
BAD: Configuration โ Theory โ Finally the basic usage
[1] Sections are self-contained (can be used independently)
[1] Consistent depth (not mixing h2 with h4 randomly)
Score: __/6
Cross-platform accuracy
PLATFORM CHECKLIST:
[ ] macOS differences noted where relevant
(sed -i '' vs sed -i, brew vs apt, BSD vs GNU flags)
[ ] Linux distro variations noted (apt vs yum vs pacman)
[ ] Windows compatibility addressed if os includes "win32"
[ ] Tool version assumptions stated (Docker v2 syntax, Python 3.x)
Step 4: Actionability Assessment
The core question: can an agent follow these instructions to produce correct results?
ACTIONABILITY SCORING:
[3] Instructions are imperative ("Run X", "Create Y")
NOT: "You might consider..." or "It's recommended to..."
[3] Steps are ordered logically (prerequisites before actions)
[2] Error cases addressed (what to do when something fails)
[2] Output/result described (how to verify it worked)
Score: __/10
Step 5: Tips Section Quality
TIPS SCORING:
[2] 5-10 tips present
[2] Tips are non-obvious (not "read the documentation")
GOOD: "The number one Makefile bug: spaces instead of tabs"
BAD: "Make sure to test your code"
[2] Tips are specific and actionable
GOOD: "Use flock to prevent overlapping cron runs"
BAD: "Be careful with concurrent execution"
[1] No tips contradict the main content
[1] Tips cover gotchas/footguns specific to this topic
Score: __/8
Scoring Summary
SKILL REVIEW SCORECARD
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Skill: [name]
Reviewer: [agent/human]
Date: [date]
Category Score Max
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Structure __ 11
Description __ 8
Metadata __ 4
Example density __ 3*
Example quality __ 3*
Organization __ 6
Actionability __ 10
Tips __ 8
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
TOTAL __ 53+
* Example density and quality are per-sample,
not summed. Use the average across all examples.
RATING:
45+ Excellent โ publish-ready
35-44 Good โ minor improvements needed
25-34 Fair โ significant gaps to address
< 25 Poor โ needs major rework
VERDICT: [PUBLISH / REVISE / REWORK]
Common Defects
Critical (blocks publishing)
DEFECT: Invalid frontmatter
DETECT: YAML parse error, missing required fields
FIX: Validate YAML, ensure name/description/metadata all present
DEFECT: Broken code examples
DETECT: Syntax errors, undefined variables, wrong flags
FIX: Test every command in a clean environment
DEFECT: Wrong tool requirements
DETECT: metadata.requires lists tools not used in content, or omits tools that are used
FIX: Grep content for command names, update requires to match
DEFECT: Misleading description
DETECT: Description promises coverage the content doesn't deliver
FIX: Align description with actual content, or add missing content
Major (should fix before publishing)
DEFECT: No "When to Use" section
IMPACT: Agent doesn't know when to activate the skill
FIX: Add 4-8 bullet points describing trigger scenarios
DEFECT: Text walls without examples
DETECT: Any section > 10 lines with no code block
FIX: Add concrete examples for every concept described
DEFECT: Examples missing language tags
DETECT: ``` without language identifier
FIX: Add bash, python, javascript, yaml, etc. to every code fence
DEFECT: No Tips section
IMPACT: Missing the distilled expertise that makes a skill valuable
FIX: Add 5-10 non-obvious, actionable tips
DEFECT: Abstract organization
DETECT: Sections named "Theory", "Overview", "Background", "Introduction"
FIX: Reorganize by task/operation: what the user is trying to DO
Minor (nice to fix)
DEFECT: Placeholder values
DETECT: foo, bar, baz, example.com, 1.2.3.4, TODO, FIXME
FIX: Replace with realistic values (myapp, api.example.com, 192.168.1.100)
DEFECT: Inconsistent formatting
DETECT: Mixed heading levels, inconsistent code block style
FIX: Standardize heading hierarchy and formatting
DEFECT: Missing cross-references
DETECT: Mentions tools/concepts covered by other skills without referencing them
FIX: Add "See the X skill for more on Y" notes
DEFECT: Outdated commands
DETECT: docker-compose (v1), python (not python3), npm -g without npx alternative
FIX: Update to current tool versions and syntax
Comparative Review
When comparing skills in the same category:
COMPARATIVE CRITERIA:
1. Coverage breadth
Which skill covers more use cases?
2. Example quality
Which has more runnable, realistic examples?
3. Depth on common operations
Which handles the 80% case better?
4. Edge case coverage
Which addresses more gotchas and failure modes?
5. Cross-platform support
Which works across more environments?
6. Freshness
Which uses current tool versions and syntax?
WINNER: [skill A / skill B / tie]
REASON: [1-2 sentence justification]
Quick Review Template
For a fast review when you don't need full scoring:
## Quick Review: [skill-name]
**Structure**: [OK / Issues: ...]
**Description**: [Strong / Weak: reason]
**Examples**: [X code blocks across Y lines โ density OK/low/high]
**Actionability**: [Agent can/cannot follow these instructions because...]
**Top defect**: [The single most impactful thing to fix]
**Verdict**: [PUBLISH / REVISE / REWORK]
Review Workflow
Reviewing your own skill before publishing
# 1. Validate frontmatter
head -20 skills/my-skill/SKILL.md
# Visually confirm YAML is valid
# 2. Count code blocks
grep -c '```' skills/my-skill/SKILL.md
# Divide total lines by this number for density
# 3. Check for placeholders
grep -n -i 'todo\|fixme\|xxx\|foo\|bar\|baz' skills/my-skill/SKILL.md
# 4. Check for missing language tags
grep -n '^```$' skills/my-skill/SKILL.md
# Every code fence should have a language tag โ bare ``` is a defect
# 5. Verify tool requirements match content
# Extract requires from frontmatter, then grep for each tool in content
# 6. Test commands (sample 3-5 from the skill)
# Run them in a clean shell to verify they work
# 7. Run the scorecard mentally or in a file
# Target: 35+ for good, 45+ for excellent
Reviewing a registry skill after installing
# Install the skill
npx molthub@latest install skill-name
# Read it
cat skills/skill-name/SKILL.md
# Run the quick review template
# If score < 25, consider uninstalling and finding an alternative
Tips
- The description field accounts for more real-world impact than all other fields combined. A perfect skill with a bad description will never be found via search.
- Count code blocks as your first quality signal. Skills with fewer than 8 code blocks are almost always too abstract to be useful.
- Test 3-5 commands from the skill in a clean environment. If more than one fails, the skill wasn't tested before publishing.
- "Organized by task" vs. "organized by concept" is the single biggest structural quality differentiator. Good skills answer "how do I do X?" โ bad skills explain "what is X?"
- A skill with great tips but weak examples is better than one with thorough examples but no tips. Tips encode expertise that examples alone don't convey.
- Check the
requires.anyBinsagainst what the skill actually uses. A common defect is listingbash(which everything has) instead of the actual tools likedocker,curl, orjq. - Short skills (< 150 lines) usually aren't worth publishing โ they don't provide enough value over a quick web search. If your skill is short, it might be better as a section in a larger skill.
- The best skills are ones you'd bookmark yourself. If you wouldn't use it, don't publish it.