Emergency Rescue Kit
Step-by-step recovery procedures for the worst moments in a developer's day. Every section follows the same pattern: diagnose โ fix โ verify. Commands are non-destructive by default. Destructive steps are flagged.
When something has gone wrong, find your situation below and follow the steps in order.
When to Use
- Someone force-pushed to main and overwrote history
- Credentials were committed to a public repository
- A rebase or reset destroyed commits you need
- Disk is full and nothing works
- A process is consuming all memory or won't die
- A database migration failed halfway through
- A deploy needs to be rolled back immediately
- SSH access is locked out
- SSL certificates expired in production
- You don't know what went wrong, but it's broken
Git Disasters
Force-pushed to main (or any shared branch)
Someone ran git push --force and overwrote remote history.
# DIAGNOSE: Check the reflog on any machine that had the old state
git reflog show origin/main
# Look for the last known-good commit hash
# FIX (if you have the old state locally):
git push origin <good-commit-hash>:main --force-with-lease
# --force-with-lease is safer than --force: it fails if remote changed again
# FIX (if you DON'T have the old state locally):
# GitHub/GitLab retain force-pushed refs temporarily
# GitHub: check the "push" event in the audit log or use the API
gh api repos/{owner}/{repo}/events --jq '.[] | select(.type=="PushEvent") | .payload.before'
# GitLab: check the reflog on the server (admin access needed)
# Or restore from any CI runner or team member's local clone
# VERIFY:
git log --oneline -10 origin/main
# Confirm the history looks correct
Lost commits after rebase or reset --hard
You ran git rebase or git reset --hard and commits disappeared.
# DIAGNOSE: Your commits are NOT gone. Git keeps everything for 30+ days.
git reflog
# Find the commit hash from BEFORE the rebase/reset
# Look for entries like "rebase (start)" or "reset: moving to"
# FIX: Reset back to the pre-disaster state
git reset --hard <commit-hash-before-disaster>
# FIX (alternative): Cherry-pick specific lost commits
git cherry-pick <lost-commit-hash>
# FIX (if reflog is empty โ rare, usually means different repo):
git fsck --lost-found
# Look in .git/lost-found/commit/ for dangling commits
ls .git/lost-found/commit/
git show <hash> # Inspect each one
# VERIFY:
git log --oneline -10
# Your commits should be back
Committed to the wrong branch
You made commits on main that should be on a feature branch.
# DIAGNOSE: Check where you are and what you committed
git log --oneline -5
git branch
# FIX: Create the feature branch at current position, then reset main
git branch feature-branch # Create branch pointing at current commit
git reset --hard HEAD~<N> # Move main back N commits (โ ๏ธ destructive)
git checkout feature-branch # Switch to the new branch
# FIX (safer alternative using cherry-pick):
git checkout -b feature-branch # Create and switch to new branch
git checkout main
git reset --hard origin/main # Reset main to remote state
# Your commits are safely on feature-branch
# VERIFY:
git log --oneline main -5
git log --oneline feature-branch -5
Merge gone wrong (conflicts everywhere, wrong result)
A merge produced a bad result and you want to start over.
# FIX (merge not yet committed โ still in conflict state):
git merge --abort
# FIX (merge was committed but not pushed):
git reset --hard HEAD~1
# FIX (merge was already pushed): Create a revert commit
git revert -m 1 <merge-commit-hash>
# -m 1 means "keep the first parent" (your branch before merge)
git push
# VERIFY:
git log --oneline --graph -10
git diff HEAD~1 # Review what changed
Corrupted git repository
Git commands fail with "bad object", "corrupt", or "broken link" errors.
# DIAGNOSE: Check repository integrity
git fsck --full
# FIX (if remote is intact โ most common):
# Save any uncommitted work first
cp -r . ../repo-backup
# Re-clone and restore local work
cd ..
git clone <remote-url> repo-fresh
cp -r repo-backup/path/to/uncommitted/files repo-fresh/
# FIX (repair without re-cloning):
# Remove corrupt objects and fetch them again
git fsck --full 2>&1 | grep "corrupt\|missing" | awk '{print $NF}'
# For each corrupt object:
rm .git/objects/<first-2-chars>/<remaining-hash>
git fetch origin # Re-download from remote
# VERIFY:
git fsck --full # Should report no errors
git log --oneline -5
Credential Leaks
Secret committed to git (API key, password, token)
A credential is in the git history. Every second counts โ automated scrapers monitor public GitHub repos for leaked keys.
# STEP 1: REVOKE THE CREDENTIAL IMMEDIATELY
# Do this FIRST, before cleaning git history.
# The credential is already compromised the moment it was pushed publicly.
# AWS keys:
aws iam delete-access-key --access-key-id AKIAXXXXXXXXXXXXXXXX --user-name <user>
# Then create a new key pair
# GitHub tokens:
# Go to github.com โ Settings โ Developer settings โ Tokens โ Revoke
# Database passwords:
# Change the password in the database immediately
# ALTER USER myuser WITH PASSWORD 'new-secure-password';
# Generic API tokens:
# Revoke in the provider's dashboard, generate new ones
# STEP 2: Remove from current branch
git rm --cached <file-with-secret> # If the whole file is secret
# OR edit the file to remove the secret, then:
git add <file>
# STEP 3: Add to .gitignore
echo ".env" >> .gitignore
echo "credentials.json" >> .gitignore
git add .gitignore
# STEP 4: Remove from git history (โ ๏ธ rewrites history)
# Option A: git-filter-repo (recommended, install with pip install git-filter-repo)
git filter-repo --path <file-with-secret> --invert-paths
# Option B: BFG Repo Cleaner (faster for large repos)
# Download from https://rtyley.github.io/bfg-repo-cleaner/
java -jar bfg.jar --delete-files <filename> .
git reflog expire --expire=now --all
git gc --prune=now --aggressive
# STEP 5: Force push the cleaned history
git push origin --force --all
git push origin --force --tags
# STEP 6: Notify all collaborators to re-clone
# Their local copies still have the secret in reflog
# VERIFY:
git log --all -p -S '<the-secret-string>' --diff-filter=A
# Should return nothing
.env file pushed to public repo
# STEP 1: Revoke ALL credentials in that .env file. All of them. Now.
# STEP 2: Remove and ignore
git rm --cached .env
echo ".env" >> .gitignore
git add .gitignore
git commit -m "Remove .env from tracking"
# STEP 3: Remove from history (see credential removal above)
git filter-repo --path .env --invert-paths
# STEP 4: Check what was exposed
# List every variable that was in the .env:
git show HEAD~1:.env 2>/dev/null || git log --all -p -- .env | head -50
# Rotate every single value.
# PREVENTION: Add a pre-commit hook
cat > .git/hooks/pre-commit << 'HOOK'
#!/bin/bash
if git diff --cached --name-only | grep -qE '\.env$|\.env\.local$|credentials'; then
echo "ERROR: Attempting to commit potential secrets file"
echo "Files: $(git diff --cached --name-only | grep -E '\.env|credentials')"
exit 1
fi
HOOK
chmod +x .git/hooks/pre-commit
Secret visible in CI/CD logs
# STEP 1: Revoke the credential immediately
# STEP 2: Delete the CI run/logs if possible
# GitHub Actions:
gh run delete <run-id>
# Or: Settings โ Actions โ delete specific run
# STEP 3: Fix the pipeline
# Never echo secrets. Mask them:
# GitHub Actions: echo "::add-mask::$MY_SECRET"
# GitLab CI: variables are masked if marked as "Masked" in settings
# STEP 4: Audit what was exposed
# Check the log output for patterns like:
# AKIAXXXXXXXXX (AWS)
# ghp_XXXXXXXXX (GitHub)
# sk-XXXXXXXXXXX (OpenAI/Stripe)
# Any connection strings with passwords
Disk Full Emergencies
System or container disk is full
Nothing works โ builds fail, logs can't write, services crash.
# DIAGNOSE: What's using space?
df -h # Which filesystem is full?
du -sh /* 2>/dev/null | sort -rh | head -20 # Biggest top-level dirs
du -sh /var/log/* | sort -rh | head -10 # Log bloat?
# QUICK WINS (safe to run immediately):
# 1. Docker cleanup (often the #1 cause)
docker system df # See Docker disk usage
docker system prune -a -f # Remove all unused images, containers, networks
docker volume prune -f # Remove unused volumes
docker builder prune -a -f # Remove build cache
# โ ๏ธ This removes ALL unused Docker data. Safe if you can re-pull/rebuild.
# 2. Package manager caches
# npm
npm cache clean --force
rm -rf ~/.npm/_cacache
# pip
pip cache purge
# apt
sudo apt-get clean
sudo apt-get autoremove -y
# brew
brew cleanup --prune=all
# 3. Log rotation (immediate)
# Truncate (not delete) large log files to free space instantly
sudo truncate -s 0 /var/log/syslog
sudo truncate -s 0 /var/log/journal/*/*.journal # systemd journals
find /var/log -name "*.log" -size +100M -exec truncate -s 0 {} \;
# Truncate preserves the file handle so services don't break
# 4. Old build artifacts
find . -name "node_modules" -type d -prune -exec rm -rf {} + 2>/dev/null
find . -name ".next" -type d -exec rm -rf {} + 2>/dev/null
find . -name "dist" -type d -exec rm -rf {} + 2>/dev/null
find /tmp -type f -mtime +7 -delete 2>/dev/null
# 5. Find the actual culprit
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head -20
# Shows files over 100MB, sorted by size
# VERIFY:
df -h # Check free space increased
Docker-specific disk full
# DIAGNOSE:
docker system df -v
# Common culprits:
# 1. Dangling images from builds
docker image prune -f
# 2. Stopped containers accumulating
docker container prune -f
# 3. Build cache (often the biggest)
docker builder prune -a -f
# 4. Volumes from old containers
docker volume ls -qf dangling=true
docker volume prune -f
# NUCLEAR OPTION (โ ๏ธ removes EVERYTHING):
docker system prune -a --volumes -f
# You will need to re-pull all images and recreate all volumes
# VERIFY:
docker system df
df -h
Process Emergencies
Port already in use
# DIAGNOSE: What's using the port?
# Linux:
lsof -i :8080
ss -tlnp | grep 8080
# macOS:
lsof -i :8080
# Windows:
netstat -ano | findstr :8080
# FIX: Kill the process
kill $(lsof -t -i :8080) # Graceful
kill -9 $(lsof -t -i :8080) # Force (if graceful didn't work)
# FIX (Windows):
# Find PID from netstat output, then:
taskkill /PID <pid> /F
# FIX (if it's a leftover Docker container):
docker ps | grep 8080
docker stop <container-id>
# VERIFY:
lsof -i :8080 # Should return nothing
Process won't die
# DIAGNOSE:
ps aux | grep <process-name>
# Note the PID
# ESCALATION LADDER:
kill <pid> # SIGTERM (graceful shutdown)
sleep 5
kill -9 <pid> # SIGKILL (cannot be caught, immediate death)
# If SIGKILL doesn't work, it's a zombie or kernel-stuck process:
# Check if zombie:
ps aux | grep <pid>
# State "Z" = zombie. The parent must reap it:
kill -SIGCHLD $(ps -o ppid= -p <pid>)
# Or kill the parent process
# If truly stuck in kernel (state "D"):
# Only a reboot will fix it. The process is stuck in an I/O syscall.
# MASS CLEANUP: Kill all processes matching a name
pkill -f <pattern> # Graceful
pkill -9 -f <pattern> # Force
Out of memory (OOM killed)
# DIAGNOSE: Was your process OOM-killed?
dmesg | grep -i "oom\|killed process" | tail -20
journalctl -k | grep -i "oom\|killed" | tail -20
# Check what's using memory right now:
ps aux --sort=-%mem | head -20 # Top memory consumers
free -h # System memory overview
# FIX: Free memory immediately
# 1. Kill the biggest consumer (if safe to do so)
kill $(ps aux --sort=-%mem | awk 'NR==2{print $2}')
# 2. Drop filesystem caches (safe, no data loss)
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
# 3. Disable swap thrashing (if swap is full)
sudo swapoff -a && sudo swapon -a
# PREVENT: Set memory limits
# Docker:
docker run --memory=512m --memory-swap=1g myapp
# Systemd service:
# Add to [Service] section:
# MemoryMax=512M
# MemoryHigh=400M
# Node.js:
node --max-old-space-size=512 app.js
# VERIFY:
free -h
ps aux --sort=-%mem | head -5
Database Emergencies
Failed migration (partially applied)
# DIAGNOSE: What state is the database in?
# Check which migrations have run:
# Rails:
rails db:migrate:status
# Django:
python manage.py showmigrations
# Knex/Node:
npx knex migrate:status
# Prisma:
npx prisma migrate status
# Raw SQL โ check migration table:
# PostgreSQL/MySQL:
SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 10;
# Or: SELECT * FROM _migrations ORDER BY id DESC LIMIT 10;
# FIX: Roll back the failed migration
# Most frameworks track migration state. Roll back to last good state:
# Rails:
rails db:rollback STEP=1
# Django:
python manage.py migrate <app_name> <previous_migration_number>
# Knex:
npx knex migrate:rollback
# FIX (manual): If the framework is confused about state:
# 1. Check what the migration actually did
# 2. Manually undo partial changes
# 3. Delete the migration record from the migrations table
# 4. Fix the migration code
# 5. Re-run
# VERIFY:
# Run the migration again and confirm it applies cleanly
# Check the affected tables/columns exist correctly
Accidentally dropped a table or database
# PostgreSQL:
# If you have WAL archiving / point-in-time recovery configured:
pg_restore -d mydb /backups/latest.dump -t dropped_table
# If no backup exists, check if the transaction is still open:
# (Only works if you haven't committed yet)
# Just run ROLLBACK; in your SQL session.
# MySQL:
# If binary logging is enabled:
mysqlbinlog /var/log/mysql/mysql-bin.000001 \
--start-datetime="2026-02-03 10:00:00" \
--stop-datetime="2026-02-03 10:30:00" > recovery.sql
# Review recovery.sql, then apply
# SQLite:
# If the file still exists, it's fine โ SQLite DROP TABLE is within the file
# Restore from backup:
cp /backups/db.sqlite3 ./db.sqlite3
# PREVENTION: Always run destructive SQL in a transaction
BEGIN;
DROP TABLE users; -- oops
ROLLBACK; -- saved
Database locked / deadlocked
# PostgreSQL:
-- Find blocking queries
SELECT pid, usename, state, query, wait_event_type, query_start
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Find locks
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- Kill blocking query
SELECT pg_terminate_backend(<blocking_pid>);
# MySQL:
SHOW PROCESSLIST;
SHOW ENGINE INNODB STATUS\G -- Look for "LATEST DETECTED DEADLOCK"
KILL <process_id>;
# SQLite:
# SQLite uses file-level locking. Common fix:
# 1. Find and close all connections
# 2. Check for .db-journal or .db-wal files (active transactions)
# 3. If stuck: cp database.db database-fixed.db && mv database-fixed.db database.db
# This forces SQLite to release the lock by creating a fresh file handle
# VERIFY:
# Run a simple query to confirm database is responsive
SELECT 1;
Connection pool exhausted
# DIAGNOSE:
# Error messages like: "too many connections", "connection pool exhausted",
# "FATAL: remaining connection slots are reserved for superuser"
# PostgreSQL โ check connection count:
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
SELECT max_conn, used, max_conn - used AS available
FROM (SELECT count(*) AS used FROM pg_stat_activity) t,
(SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') m;
# FIX: Kill idle connections
-- Terminate idle connections older than 5 minutes
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '5 minutes';
# FIX: Increase max connections (requires restart)
# postgresql.conf:
# max_connections = 200 (default is 100)
# BETTER FIX: Use a connection pooler
# PgBouncer or pgcat in front of PostgreSQL
# Application-level: set pool size to match your needs
# Node.js (pg): { max: 20 }
# Python (SQLAlchemy): pool_size=20, max_overflow=10
# Go (database/sql): db.SetMaxOpenConns(20)
# VERIFY:
SELECT count(*) FROM pg_stat_activity;
# Should be well below max_connections
Deploy Emergencies
Quick rollback
# Git-based deploys:
git log --oneline -5 origin/main
git revert HEAD # Create a revert commit
git push origin main # Deploy the revert
# Revert is safer than reset because it preserves history
# Docker/container deploys:
# Roll back to previous image tag
docker pull myapp:previous-tag
docker stop myapp-current
docker run -d --name myapp myapp:previous-tag
# Kubernetes:
kubectl rollout undo deployment/myapp
kubectl rollout status deployment/myapp # Watch rollback progress
# Heroku:
heroku releases
heroku rollback v<previous-version>
# AWS ECS:
aws ecs update-service --cluster mycluster --service myservice \
--task-definition myapp:<previous-revision>
# VERIFY:
# Hit the health check endpoint
curl -s -o /dev/null -w "%{http_code}" https://myapp.example.com/health
# Should return 200
Container won't start
# DIAGNOSE: Why did it fail?
docker logs <container-id> --tail 100
docker inspect <container-id> | grep -A5 "State"
# Common causes and fixes:
# 1. "exec format error" โ wrong platform (built for arm64, running on amd64)
docker build --platform linux/amd64 -t myapp .
# 2. "permission denied" โ file not executable or wrong user
# In Dockerfile:
RUN chmod +x /app/entrypoint.sh
# Or: USER root before the command, then drop back
# 3. "port already allocated" โ another container or process on that port
docker ps -a | grep <port>
docker stop <conflicting-container>
# 4. "no such file or directory" โ entrypoint or CMD path is wrong
docker run -it --entrypoint sh myapp # Get a shell to debug
ls -la /app/ # Check what's actually there
# 5. Healthcheck failing โ container keeps restarting
docker inspect <container-id> --format='{{json .State.Health}}'
# Temporarily disable healthcheck to get logs:
docker run --no-healthcheck myapp
# 6. Out of memory โ container OOM killed
docker inspect <container-id> --format='{{.State.OOMKilled}}'
# If true: docker run --memory=1g myapp
# VERIFY:
docker ps # Container should show "Up" status
docker logs <container-id> --tail 5 # No errors
SSL certificate expired
# DIAGNOSE: Check certificate expiry
echo | openssl s_client -connect mysite.com:443 -servername mysite.com 2>/dev/null | \
openssl x509 -noout -dates
# notAfter shows expiry date
# FIX (Let's Encrypt โ most common):
sudo certbot renew --force-renewal
sudo systemctl reload nginx # or: sudo systemctl reload apache2
# FIX (manual certificate):
# 1. Get new certificate from your CA
# 2. Replace files:
sudo cp new-cert.pem /etc/ssl/certs/mysite.pem
sudo cp new-key.pem /etc/ssl/private/mysite.key
# 3. Reload web server
sudo nginx -t && sudo systemctl reload nginx
# FIX (AWS ACM):
# ACM auto-renews if DNS validation is configured.
# If email validation: check the admin email for renewal link
# If stuck: request a new certificate in ACM and update the load balancer
# PREVENTION: Auto-renewal with monitoring
# Cron job to check expiry and alert:
echo '0 9 * * 1 echo | openssl s_client -connect mysite.com:443 2>/dev/null | openssl x509 -checkend 604800 -noout || echo "CERT EXPIRES WITHIN 7 DAYS" | mail -s "SSL ALERT" admin@example.com' | crontab -
# VERIFY:
curl -sI https://mysite.com | head -5
# Should return HTTP/2 200, not certificate errors
Access Emergencies
SSH locked out
# DIAGNOSE: Why can't you connect?
ssh -vvv user@host # Verbose output shows where it fails
# Common causes:
# 1. Key not accepted โ wrong key, permissions, or authorized_keys issue
ssh -i ~/.ssh/specific_key user@host # Try explicit key
chmod 600 ~/.ssh/id_rsa # Fix key permissions
chmod 700 ~/.ssh # Fix .ssh dir permissions
# 2. "Connection refused" โ sshd not running or firewall blocking
# If you have console access (cloud provider's web console):
sudo systemctl start sshd
sudo systemctl status sshd
# 3. Firewall blocking port 22
# Cloud console:
sudo ufw allow 22/tcp # Ubuntu
sudo firewall-cmd --add-service=ssh --permanent && sudo firewall-cmd --reload # CentOS
# 4. Changed SSH port and forgot
# Try common alternate ports:
ssh -p 2222 user@host
ssh -p 22222 user@host
# Or check from console: grep -i port /etc/ssh/sshd_config
# 5. IP changed / DNS stale
ping hostname # Verify IP resolution
ssh user@<direct-ip> # Try IP instead of hostname
# 6. Locked out after too many attempts (fail2ban)
# From console:
sudo fail2ban-client set sshd unbanip <your-ip>
# Or wait for the ban to expire (usually 10 min)
# CLOUD PROVIDER ESCAPE HATCHES:
# AWS: EC2 โ Instance โ Connect โ Session Manager (no SSH needed)
# GCP: Compute โ VM instances โ SSH (browser-based)
# Azure: VM โ Serial console
# DigitalOcean: Droplet โ Access โ Console
# VERIFY:
ssh user@host echo "connection works"
Lost sudo access
# If you have physical/console access:
# 1. Boot into single-user/recovery mode
# - Reboot, hold Shift (GRUB), select "recovery mode"
# - Or add init=/bin/bash to kernel command line
# 2. Remount filesystem read-write
mount -o remount,rw /
# 3. Fix sudo access
usermod -aG sudo <username> # Debian/Ubuntu
usermod -aG wheel <username> # CentOS/RHEL
# Or edit directly:
visudo
# Add: username ALL=(ALL:ALL) ALL
# 4. Reboot normally
reboot
# If you have another sudo/root user:
su - other-admin
sudo usermod -aG sudo <locked-user>
# CLOUD: Use the provider's console or reset the instance
# AWS: Create an AMI, launch new instance, mount old root volume, fix
Network Emergencies
Nothing connects (total network failure)
# DIAGNOSE: Isolate the layer
# 1. Is the network interface up?
ip addr show # or: ifconfig
ping 127.0.0.1 # Loopback works?
# 2. Can you reach the gateway?
ip route | grep default
ping <gateway-ip>
# 3. Can you reach the internet by IP?
ping 8.8.8.8 # Google DNS
ping 1.1.1.1 # Cloudflare DNS
# 4. Is DNS working?
nslookup google.com
dig google.com
# DECISION TREE:
# ping 127.0.0.1 fails โ network stack broken, restart networking
# ping gateway fails โ local network issue (cable, wifi, DHCP)
# ping 8.8.8.8 fails โ routing/firewall issue
# ping 8.8.8.8 works but โ DNS issue
# nslookup fails
# FIX: DNS broken
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Or: sudo systemd-resolve --flush-caches
# FIX: Interface down
sudo ip link set eth0 up
sudo dhclient eth0 # Request new DHCP lease
# FIX: Restart networking entirely
sudo systemctl restart NetworkManager # Desktop Linux
sudo systemctl restart networking # Server
sudo systemctl restart systemd-networkd # Systemd-based
# Docker: Container can't reach the internet
docker run --rm alpine ping 8.8.8.8 # Test from container
# If fails:
sudo systemctl restart docker # Often fixes Docker networking
# Or: docker network prune
DNS not propagating after change
# DIAGNOSE: Check what different DNS servers see
dig @8.8.8.8 mysite.com # Google
dig @1.1.1.1 mysite.com # Cloudflare
dig @ns1.yourdns.com mysite.com # Authoritative nameserver
# Check TTL (time remaining before caches expire):
dig mysite.com | grep -i ttl
# REALITY CHECK:
# DNS propagation takes time. TTL controls this.
# TTL 300 = 5 minutes. TTL 86400 = 24 hours.
# You cannot speed this up. You can only wait.
# FIX: If authoritative nameserver has wrong records
# Update the record at your DNS provider (Cloudflare, Route53, etc.)
# Then flush your local cache:
# macOS:
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder
# Linux:
sudo systemd-resolve --flush-caches
# Windows:
ipconfig /flushdns
# WORKAROUND: While waiting for propagation
# Add to /etc/hosts for immediate local effect:
echo "93.184.216.34 mysite.com" | sudo tee -a /etc/hosts
# Remove this after propagation completes!
# VERIFY:
dig +short mysite.com # Should show new IP/record
File Emergencies
Accidentally deleted files (not in git)
# DIAGNOSE: Are the files recoverable?
# If the process still has the file open:
lsof | grep deleted
# Then recover from /proc:
cp /proc/<pid>/fd/<fd-number> /path/to/restored-file
# If recently deleted on ext4 (Linux):
# Install extundelete or testdisk
sudo extundelete /dev/sda1 --restore-file path/to/file
# Or use testdisk interactively for a better UI
# macOS:
# Check Trash first: ~/.Trash/
# Time Machine: tmutil restore /path/to/file
# PREVENTION:
# Use trash-cli instead of rm:
# npm install -g trash-cli
# trash file.txt (moves to trash instead of permanent delete)
# Or alias: alias rm='echo "Use trash instead"; false'
Wrong permissions applied recursively
# "I ran chmod -R 777 /" or "chmod -R 000 /important/dir"
# FIX: Common default permissions
# For a web project:
find /path -type d -exec chmod 755 {} \; # Directories: rwxr-xr-x
find /path -type f -exec chmod 644 {} \; # Files: rw-r--r--
find /path -name "*.sh" -exec chmod 755 {} \; # Scripts: executable
# For SSH:
chmod 700 ~/.ssh
chmod 600 ~/.ssh/id_rsa
chmod 644 ~/.ssh/id_rsa.pub
chmod 600 ~/.ssh/authorized_keys
chmod 644 ~/.ssh/config
# For a system directory (โ ๏ธ serious โ may need rescue boot):
# If /etc permissions are broken:
# Boot from live USB, mount the drive, fix permissions
# Reference: dpkg --verify (Debian) or rpm -Va (RHEL) to compare against package defaults
# VERIFY:
ls -la /path/to/fixed/directory
The Universal Diagnostic
When you don't know what's wrong, run this sequence:
#!/bin/bash
# emergency-diagnostic.sh โ Quick system health check
echo "=== DISK ==="
df -h | grep -E '^/|Filesystem'
echo -e "\n=== MEMORY ==="
free -h
echo -e "\n=== CPU / LOAD ==="
uptime
echo -e "\n=== TOP PROCESSES (by CPU) ==="
ps aux --sort=-%cpu | head -6
echo -e "\n=== TOP PROCESSES (by MEM) ==="
ps aux --sort=-%mem | head -6
echo -e "\n=== NETWORK ==="
ping -c 1 -W 2 8.8.8.8 > /dev/null 2>&1 && echo "Internet: OK" || echo "Internet: UNREACHABLE"
ping -c 1 -W 2 $(ip route | awk '/default/{print $3}') > /dev/null 2>&1 && echo "Gateway: OK" || echo "Gateway: UNREACHABLE"
echo -e "\n=== RECENT ERRORS ==="
journalctl -p err --since "1 hour ago" --no-pager | tail -20 2>/dev/null || \
dmesg | tail -20
echo -e "\n=== DOCKER (if running) ==="
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" 2>/dev/null || echo "Docker not running"
docker system df 2>/dev/null || true
echo -e "\n=== LISTENING PORTS ==="
ss -tlnp 2>/dev/null | head -15 || netstat -tlnp 2>/dev/null | head -15
echo -e "\n=== FAILED SERVICES ==="
systemctl --failed 2>/dev/null || true
Run it, read the output, then jump to the relevant section above.
Tips
- Revoke credentials before cleaning git history. The moment a secret is pushed publicly, automated scrapers have it within minutes. Cleaning the history is important but secondary to revocation.
git reflogis your undo button. It records every HEAD movement for 30+ days. Lost commits, bad rebases, accidental resets โ the reflog has the recovery hash. Learn to read it before you need it.- Truncate log files, don't delete them.
truncate -s 0 file.logfrees disk space instantly while keeping the file handle open. Deleting a log file that a process has open won't free space until the process restarts. --force-with-leaseinstead of--force. Always. It fails if someone else has pushed, preventing you from overwriting their work on top of your recovery.- Every recovery operation should end with verification. Run the diagnostic command, check the output, confirm the fix worked. Don't assume โ confirm.
- Docker is the #1 disk space thief on developer machines.
docker system prune -ais almost always safe on development machines and can recover tens of gigabytes. - Database emergencies: wrap destructive operations in transactions.
BEGIN; DROP TABLE users; ROLLBACK;costs nothing and saves everything. Make it muscle memory. - When SSH is locked out, every cloud provider has a console escape hatch. AWS Session Manager, GCP browser SSH, Azure Serial Console. Know where yours is before you need it.
- The order matters: diagnose โ fix โ verify. Skipping diagnosis leads to wrong fixes. Skipping verification leads to false confidence. Follow the sequence every time.
- Keep this skill installed. You won't need it most days. The day you do need it, you'll need it immediately.