File-Deduplicator - Find and Remove Duplicates
Vernox Utility Skill - Clean up your digital hoard.
Overview
File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.
Features
โ Duplicate Detection
- Content-based hashing (MD5) for fast comparison
- Size-based detection (exact match, near match)
- Name-based detection (similar filenames)
- Directory scanning (recursive)
- Exclude patterns (.git, node_modules, etc.)
โ Removal Options
- Auto-delete duplicates (keep newest/oldest)
- Interactive review before deletion
- Move to archive instead of delete
- Preserve permissions and metadata
- Dry-run mode (preview changes)
โ Analysis Tools
- Duplicate count summary
- Space savings estimation
- Largest duplicate files
- Most common duplicate patterns
- Detailed report generation
โ Safety Features
- Confirmation prompts before deletion
- Backup to archive folder
- Size threshold (don't remove huge files by mistake)
- Whitelist important directories
- Undo functionality (log for recovery)
Installation
clawhub install file-deduplicator
Quick Start
Find Duplicates in Directory
const result = await findDuplicates({
directories: ['./documents', './downloads', './projects'],
options: {
method: 'content', // content-based comparison
includeSubdirs: true
}
});
console.log(`Found ${result.duplicateCount} duplicate groups`);
console.log(`Potential space savings: ${result.spaceSaved}`);
Remove Duplicates Automatically
const result = await removeDuplicates({
directories: ['./documents', './downloads'],
options: {
method: 'content',
keep: 'newest', // keep newest, delete oldest
action: 'delete', // or 'move' to archive
autoConfirm: false // show confirmation for each
}
});
console.log(`Removed ${result.filesRemoved} duplicates`);
console.log(`Space saved: ${result.spaceSaved}`);
Dry-Run Preview
const result = await removeDuplicates({
directories: ['./documents', './downloads'],
options: {
method: 'content',
keep: 'newest',
action: 'delete',
dryRun: true // Preview without actual deletion
}
});
console.log('Would remove:');
result.duplicates.forEach((dup, i) => {
console.log(`${i+1}. ${dup.file}`);
});
Tool Functions
findDuplicates
Find duplicate files across directories.
Parameters:
directories(array|string, required): Directory paths to scanoptions(object, optional):method(string): 'content' | 'size' | 'name' - comparison methodincludeSubdirs(boolean): Scan recursively (default: true)minSize(number): Minimum size in bytes (default: 0)maxSize(number): Maximum size in bytes (default: 0)excludePatterns(array): Glob patterns to exclude (default: ['.git', 'node_modules'])whitelist(array): Directories to never scan (default: [])
Returns:
duplicates(array): Array of duplicate groupsduplicateCount(number): Number of duplicate groups foundtotalFiles(number): Total files scannedscanDuration(number): Time taken to scan (ms)spaceWasted(number): Total bytes wasted by duplicatesspaceSaved(number): Potential savings if duplicates removed
removeDuplicates
Remove duplicate files based on findings.
Parameters:
directories(array|string, required): Same as findDuplicatesoptions(object, optional):keep(string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keepaction(string): 'delete' | 'move' | 'archive'archivePath(string): Where to move files when action='move'dryRun(boolean): Preview without actual actionautoConfirm(boolean): Auto-confirm deletionssizeThreshold(number): Don't remove files larger than this
Returns:
filesRemoved(number): Number of files removed/movedspaceSaved(number): Bytes savedgroupsProcessed(number): Number of duplicate groups handledlogPath(string): Path to action logerrors(array): Any errors encountered
analyzeDirectory
Analyze a single directory for duplicates.
Parameters:
directory(string, required): Path to directoryoptions(object, optional): Same as findDuplicates options
Returns:
fileCount(number): Total files in directorytotalSize(number): Total bytes in directoryduplicateSize(number): Bytes in duplicate filesduplicateRatio(number): Percentage of files that are duplicates
Use Cases
Digital Hoarder Cleanup
- Find duplicate photos/videos
- Identify wasted storage space
- Remove old duplicates, keep newest
- Clean up download folders
Document Management
- Find duplicate PDFs, docs, reports
- Keep latest version, archive old versions
- Prevent version confusion
- Reduce backup bloat
Project Cleanup
- Find duplicate source files
- Remove duplicate build artifacts
- Clean up node_modules duplicates
- Save storage on SSD/HDD
Backup Optimization
- Find duplicate backup files
- Remove redundant backups
- Identify what's actually duplicated
- Save space on backup drives
Configuration
Edit config.json:
{
"detection": {
"defaultMethod": "content",
"sizeTolerancePercent": 0, // exact match only
"nameSimilarity": 0.7, // 0-1, lower = more similar
"includeSubdirs": true
},
"removal": {
"defaultAction": "delete",
"defaultKeep": "newest",
"archivePath": "./archive",
"sizeThreshold": 10485760, // 10MB threshold
"autoConfirm": false,
"dryRunDefault": false
},
"exclude": {
"patterns": [".git", "node_modules", ".vscode", ".idea"],
"whitelist": ["important", "work", "projects"]
}
}
Methods
Content-Based (Recommended)
- Fast MD5 hashing
- Detects exact duplicates regardless of filename
- Works across renamed files
- Perfect for documents, code, archives
Size-Based
- Compares file sizes
- Faster than content hashing
- Good for media files where content hashing is slow
- Finds near-duplicates (similar but not exact)
Name-Based
- Compares filenames
- Detects similar named files
- Good for finding version duplicates (file_v1, file_v2)
Examples
Find Duplicates in Documents
const result = await findDuplicates({
directories: '~/Documents',
options: {
method: 'content',
includeSubdirs: true
}
});
console.log(`Found ${result.duplicateCount} duplicate sets`);
result.duplicates.slice(0, 5).forEach((set, i) => {
console.log(`Set ${i+1}: ${set.files.length} files`);
console.log(` Total size: ${set.totalSize} bytes`);
});
Remove Duplicates, Keep Newest
const result = await removeDuplicates({
directories: '~/Documents',
options: {
keep: 'newest',
action: 'delete'
}
});
console.log(`Removed ${result.filesRemoved} files`);
console.log(`Saved ${result.spaceSaved} bytes`);
Move to Archive Instead of Delete
const result = await removeDuplicates({
directories: '~/Downloads',
options: {
keep: 'newest',
action: 'move',
archivePath: '~/Documents/Archive'
}
});
console.log(`Archived ${result.filesRemoved} files`);
console.log(`Safe in: ~/Documents/Archive`);
Dry-Run Preview Changes
const result = await removeDuplicates({
directories: '~/Documents',
options: {
dryRun: true // Just show what would happen
}
});
console.log('=== Dry Run Preview ===');
result.duplicates.forEach((set, i) => {
console.log(`Would delete: ${set.toDelete.join(', ')}`);
});
Performance
Scanning Speed
- Small directories (<1000 files): <1s
- Medium directories (1000-10000 files): 1-5s
- Large directories (10000+ files): 5-20s
Detection Accuracy
- Content-based: 100% (exact duplicates)
- Size-based: Fast but may miss renamed files
- Name-based: Detects naming patterns only
Memory Usage
- Hash cache: ~1MB per 100,000 files
- Batch processing: Processes 1000 files at a time
- Peak memory: ~200MB for 1M files
Safety Features
Size Thresholding
Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.
Archive Mode
Move files to archive directory instead of deleting. No data loss, full recoverability.
Action Logging
All deletions/moves are logged to file for recovery and audit.
Undo Functionality
Log file can be used to restore accidentally deleted files (limited undo window).
Error Handling
Permission Errors
- Clear error message
- Suggest running with sudo
- Skip files that can't be accessed
File Lock Errors
- Detect locked files
- Skip and report
- Suggest closing applications using files
Space Errors
- Check available disk space before deletion
- Warn if space is critically low
- Prevent disk-full scenarios
Troubleshooting
Not Finding Expected Duplicates
- Check detection method (content vs size vs name)
- Verify exclude patterns aren't too broad
- Check if files are in whitelisted directories
- Try with includeSubdirs: false
Deletion Not Working
- Check write permissions on directories
- Verify action isn't 'delete' with autoConfirm: true
- Check size threshold isn't blocking all deletions
- Check file locks (is another program using files?)
Slow Scanning
- Reduce includeSubdirs scope
- Use size-based detection (faster)
- Exclude large directories (node_modules, .git)
- Process directories individually instead of batch
Tips
Best Results
- Use content-based detection for documents (100% accurate)
- Run dry-run first to preview changes
- Archive instead of delete for important files
- Check logs if anything unexpected deleted
Performance Optimization
- Process frequently used directories first
- Use size threshold to skip large media files
- Exclude hidden directories from scan
- Process directories in parallel when possible
Space Management
- Regular duplicate cleanup prevents storage bloat
- Delete temp directories regularly
- Clear download folders of installers
- Empty trash before large scans
Roadmap
- Duplicate detection by image similarity
- Near-duplicate detection (similar but not exact)
- Duplicate detection across network drives
- Cloud storage integration (S3, Google Drive)
- Automatic scheduling of scans
- Heuristic duplicate detection (ML-based)
- Recover deleted files from backup
- Duplicate detection by file content similarity (not just hash)
License
MIT
Find duplicates. Save space. Keep your system clean. ๐ฎ