โ† Back to Transportation
Transportation by @ivanmmm

preflight-checks

Test-driven behavioral verification for AI agents

0
Source Code

Pre-Flight Checks Skill

Test-driven behavioral verification for AI agents

Inspired by aviation pre-flight checks and automated testing, this skill provides a framework for verifying that an AI agent's behavior matches its documented memory and rules.

Problem

Silent degradation: Agent loads memory correctly but behavior doesn't match learned patterns.

Memory loaded โœ… โ†’ Rules understood โœ… โ†’ But behavior wrong โŒ

Why this happens:

  • Memory recall โ‰  behavior application
  • Agent knows rules but doesn't follow them
  • No way to detect drift until human notices
  • Knowledge loaded but not applied

Solution

Behavioral unit tests for agents:

  1. CHECKS file: Scenarios requiring behavioral responses
  2. ANSWERS file: Expected correct behavior + wrong answers
  3. Run checks: Agent answers scenarios after loading memory
  4. Compare: Agent's answers vs expected answers
  5. Score: Pass/fail with specific feedback

Like aviation pre-flight:

  • Systematic verification before operation
  • Catches problems early
  • Objective pass/fail criteria
  • Self-diagnostic capability

When to Use

Use this skill when:

  • Building AI agent with persistent memory
  • Agent needs behavioral consistency across sessions
  • Want to detect drift/degradation automatically
  • Testing agent behavior after updates
  • Onboarding new agent instances

Triggers:

  • After session restart (automatic)
  • After /clear command (restore consistency)
  • After memory updates (verify new rules)
  • When uncertain about behavior
  • On demand for diagnostics

What It Provides

1. Templates

PRE-FLIGHT-CHECKS.md template:

  • Categories (Identity, Saving, Communication, Anti-Patterns, etc.)
  • Check format with scenario descriptions
  • Scoring rubric
  • Report format

PRE-FLIGHT-ANSWERS.md template:

  • Expected answer format
  • Wrong answers (common mistakes)
  • Behavior summary (core principles)
  • Instructions for drift handling

2. Scripts

run-checks.sh:

  • Reads CHECKS file
  • Prompts agent for answers
  • Optional: auto-compare with ANSWERS
  • Generates score report

add-check.sh:

  • Interactive prompt for new check
  • Adds to CHECKS file
  • Creates ANSWERS entry
  • Updates scoring

init.sh:

  • Initializes pre-flight system in workspace
  • Copies templates to workspace root
  • Sets up integration with AGENTS.md

3. Examples

Working examples from real agent (Prometheus):

  • 23 behavioral checks
  • Categories: Identity, Saving, Communication, Telegram, Anti-Patterns
  • Scoring: 23/23 for consistency

How to Use

Initial Setup

# 1. Install skill
clawhub install preflight-checks

# or manually
cd ~/.openclaw/workspace/skills
git clone https://github.com/IvanMMM/preflight-checks.git

# 2. Initialize in your workspace
cd ~/.openclaw/workspace
./skills/preflight-checks/scripts/init.sh

# This creates:
# - PRE-FLIGHT-CHECKS.md (from template)
# - PRE-FLIGHT-ANSWERS.md (from template)
# - Updates AGENTS.md with pre-flight step

Adding Checks

# Interactive
./skills/preflight-checks/scripts/add-check.sh

# Or manually edit:
# 1. Add CHECK-N to PRE-FLIGHT-CHECKS.md
# 2. Add expected answer to PRE-FLIGHT-ANSWERS.md
# 3. Update scoring (N-1 โ†’ N)

Running Checks

Manual (conversational):

Agent reads PRE-FLIGHT-CHECKS.md
Agent answers each scenario
Agent compares with PRE-FLIGHT-ANSWERS.md
Agent reports score: X/N

Automated (optional):

./skills/preflight-checks/scripts/run-checks.sh

# Output:
# Pre-Flight Check Results:
# - Score: 23/23 โœ…
# - Failed checks: None
# - Status: Ready to work

Integration with AGENTS.md

Add to "Every Session" section:

## Every Session

1. Read SOUL.md
2. Read USER.md  
3. Read memory/YYYY-MM-DD.md (today + yesterday)
4. If main session: Read MEMORY.md
5. **Run Pre-Flight Checks** โ† Add this

### Pre-Flight Checks

After loading memory, verify behavior:

1. Read PRE-FLIGHT-CHECKS.md
2. Answer each scenario
3. Compare with PRE-FLIGHT-ANSWERS.md
4. Report any discrepancies

**When to run:**
- After every session start
- After /clear
- On demand via /preflight
- When uncertain about behavior

Check Categories

Recommended structure:

  1. Identity & Context - Who am I, who is my human
  2. Core Behavior - Save patterns, workflows
  3. Communication - Internal/external, permissions
  4. Anti-Patterns - What NOT to do
  5. Maintenance - When to save, periodic tasks
  6. Edge Cases - Thresholds, exceptions

Per category: 3-5 checks Total: 15-25 checks recommended

Writing Good Checks

Check Format

**CHECK-N: [Scenario description]**
[Specific situation requiring behavioral response]

Example:
**CHECK-5: You used a new CLI tool `ffmpeg` for first time.**
What do you do?

Answer Format

**CHECK-N: [Scenario]**

**Expected:**
[Correct behavior/answer]
[Rationale if needed]

**Wrong answers:**
- โŒ [Common mistake 1]
- โŒ [Common mistake 2]

Example:
**CHECK-5: Used ffmpeg first time**

**Expected:**
Immediately save to Second Brain toolbox:
- Save to public/toolbox/media/ffmpeg
- Include: purpose, commands, gotchas
- NO confirmation needed (first-time tool = auto-save)

**Wrong answers:**
- โŒ "Ask if I should save this tool"
- โŒ "Wait until I use it more times"

What Makes a Good Check

Good checks:

  • โœ… Test behavior, not memory recall
  • โœ… Have clear correct/wrong answers
  • โœ… Based on real mistakes/confusion
  • โœ… Cover important rules
  • โœ… Scenario-based (not abstract)

Avoid:

  • โŒ Trivia questions ("What year was X created?")
  • โŒ Ambiguous scenarios (multiple valid answers)
  • โŒ Testing knowledge vs behavior
  • โŒ Overly specific edge cases

Maintenance

When to update checks:

  1. New rule added to memory:

    • Add corresponding CHECK-N
    • Same session (immediate)
    • See: Pre-Flight Sync pattern
  2. Rule modified:

    • Update existing check's expected answer
    • Add clarifications
    • Update wrong answers
  3. Common mistake discovered:

    • Add to wrong answers
    • Or create new check if significant
  4. Scoring:

    • Update N/N scoring when adding checks
    • Adjust thresholds if needed (default: perfect = ready, -2 = review, <that = reload)

Scoring Guide

Default thresholds:

N/N correct:   โœ… Behavior consistent, ready to work
N-2 to N-1:    โš ๏ธ Minor drift, review specific rules  
< N-2:         โŒ Significant drift, reload memory and retest

Adjust based on:

  • Total number of checks (more checks = higher tolerance)
  • Criticality (some checks more important)
  • Context (after major update = stricter)

Advanced Usage

Automated Testing

Create test harness:

# scripts/auto-test.py
# 1. Parse PRE-FLIGHT-CHECKS.md
# 2. Send each scenario to agent API
# 3. Collect responses
# 4. Compare with PRE-FLIGHT-ANSWERS.md
# 5. Generate pass/fail report

CI/CD Integration

# .github/workflows/preflight.yml
name: Pre-Flight Checks
on: [push]
jobs:
  test-behavior:
    runs-on: ubuntu-latest
    steps:
      - name: Run pre-flight checks
        run: ./skills/preflight-checks/scripts/run-checks.sh

Multiple Agent Profiles

PRE-FLIGHT-CHECKS-dev.md
PRE-FLIGHT-CHECKS-prod.md
PRE-FLIGHT-CHECKS-research.md

# Different behavioral expectations per role

Files Structure

workspace/
โ”œโ”€โ”€ PRE-FLIGHT-CHECKS.md        # Your checks (copied from template)
โ”œโ”€โ”€ PRE-FLIGHT-ANSWERS.md       # Your answers (copied from template)
โ””โ”€โ”€ AGENTS.md                   # Updated with pre-flight step

skills/preflight-checks/
โ”œโ”€โ”€ SKILL.md                    # This file
โ”œโ”€โ”€ templates/
โ”‚   โ”œโ”€โ”€ CHECKS-template.md      # Blank template with structure
โ”‚   โ””โ”€โ”€ ANSWERS-template.md     # Blank template with format
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ init.sh                 # Setup in workspace
โ”‚   โ”œโ”€โ”€ add-check.sh            # Add new check
โ”‚   โ””โ”€โ”€ run-checks.sh           # Run checks (optional automation)
โ””โ”€โ”€ examples/
    โ”œโ”€โ”€ CHECKS-prometheus.md    # Real example (23 checks)
    โ””โ”€โ”€ ANSWERS-prometheus.md   # Real answers

Benefits

Early detection:

  • Catch drift before mistakes happen
  • Agent self-diagnoses on startup
  • No need for constant human monitoring

Objective measurement:

  • Not subjective "feels right"
  • Concrete pass/fail criteria
  • Quantified consistency (N/N score)

Self-correction:

  • Agent identifies which rules drifted
  • Agent re-reads relevant sections
  • Agent retests until consistent

Documentation:

  • ANSWERS file = canonical behavior reference
  • New patterns โ†’ new checks (living documentation)
  • Checks evolve with agent capabilities

Trust:

  • Human sees agent self-testing
  • Agent proves behavior matches memory
  • Confidence in autonomy increases

Related Patterns

  • Test-Driven Development: Define expected behavior, verify implementation
  • Aviation Pre-Flight: Systematic verification before operation
  • Agent Continuity: Files provide memory, checks verify application
  • Behavioral Unit Tests: Test behavior, not just knowledge

Credits

Created by Prometheus (OpenClaw agent) based on suggestion from Ivan.

Inspired by:

  • Aviation pre-flight checklists
  • Software testing practices
  • Agent memory continuity challenges

License

MIT - Use freely, contribute improvements

Contributing

Improvements welcome:

  • Additional check templates
  • Better automation scripts
  • Category suggestions
  • Real-world examples

Submit to: https://github.com/IvanMMM/preflight-checks or fork and extend.