A powerful utility for finding and safely removing duplicate images and media files with smart history tracking.
- Exact duplicate detection using MD5 hash comparison
- Similar file detection using intelligent filename pattern matching
- Smart file selection algorithm to choose which duplicates to keep
- Interactive preview before any deletion occurs
- Safety backup of all deleted files with timestamps
- Persistent history tracking to avoid reprocessing files
- Command line interface with flexible options
# Analyze a directory (will prompt if no path given)
python duplicate_analyzer.py
# Analyze specific directory
python duplicate_analyzer.py "C:\Path\To\Photos"
# Force fresh analysis (ignore previous history)
python duplicate_analyzer.py "C:\Path\To\Photos" --fresh# Basic usage - analyze directory
python duplicate_analyzer.py [directory_path] [options]
# Options:
--fresh Ignore previous history, reprocess all files
--status Show summary of last analysis
--preview-details Show detailed preview of deletion plan
--execute Execute deletion plan with backup
--directory path Specify directory to analyze (alternative syntax)
-d path Short form of --directory
# Examples:
python duplicate_analyzer.py "C:\Photos" --fresh
python duplicate_analyzer.py --directory "C:\Photos"
python duplicate_analyzer.py -d "C:\Photos" --fresh
python duplicate_analyzer.py --status
python duplicate_analyzer.py --executepython duplicate_analyzer.py "C:\Your\Photos\Directory"- Scans all image/video files in the directory
- Uses existing history to skip previously processed files (unless
--freshis used) - Identifies exact duplicates by MD5 hash
- Finds similar files based on filename patterns
- Generates a smart deletion plan
python duplicate_analyzer.py --preview-details- Review the generated
deletion_preview.jsonfile - Check which files will be kept vs deleted
- Verify the space savings estimate
python duplicate_analyzer.py --execute- Creates timestamped backup directory
- Safely deletes duplicate files
- Updates history tracking
- Provides detailed execution report
For each group of duplicates, the tool scores files based on:
- File size (larger files preferred - likely higher quality)
- Filename patterns (originals preferred over copies like "image(1).jpg")
- Edit indicators (non-edited preferred unless edited version is significantly larger)
- Directory location (organized folders preferred over generic "Mobile Uploads")
The highest-scoring file in each group is kept, others are marked for deletion.
The tool maintains analysis_history.json to track:
- Previously processed files - skipped in future runs for efficiency
- Deleted files - won't be reprocessed if they somehow reappear
- Analysis timestamps - track when operations were performed
# Check what's in your history
python duplicate_analyzer.py --status
# Force fresh analysis (ignore history)
python duplicate_analyzer.py "C:\Photos" --fresh
# Normal analysis (uses history)
python duplicate_analyzer.py "C:\Photos"- Backup before deletion - All deleted files are copied to timestamped backup folders
- Preview before action - Always generates a preview file first
- History persistence - Tracks all operations to prevent data loss
- Error handling - Graceful handling of locked or missing files
- Verbose output - Clear progress indication and status messages
- Images: .jpg, .jpeg, .png, .gif, .bmp, .tiff
- Videos: .mp4, .mov, .avi
- The script now provides verbose output at each step
- If analyzing a large directory, be patient - it will show progress every 500 files
- Run as administrator if needed
- Check that files aren't open in other applications
- Use the
--freshflag to ignore all previous history - Delete
analysis_history.jsonanddeletion_preview.jsonif you want to completely reset
============================================================
DUPLICATE ANALYZER - STARTING ANALYSIS
============================================================
Target directory: C:\Users\user\Photos
Fresh analysis mode: NO
Analysis started at: 2024-01-15 14:30:25
============================================================
✓ Found existing history with 1,234 previously processed files
These files will be skipped to improve performance.
Use --fresh to ignore history and reprocess all files.
STEP 1: Finding exact duplicates by MD5 hash...
--------------------------------------------------
Processed 500 files...
Processed 1000 files...
✓ Found 15 groups with exact duplicate files
STEP 2: Finding pattern-based duplicates...
--------------------------------------------------
Processed 500 files...
✓ Found 8 groups with similar filename patterns
STEP 3: Analyzing duplicates and generating deletion plan...
--------------------------------------------------
✓ Detailed deletion plan saved to: deletion_preview.json
Chad Kovac
- v1.2: Fixed command line argument parsing, removed debug output, production ready
- v1.1: Added command line argument support, verbose output, and fresh analysis mode
- v1.0: Initial release with basic duplicate detection and history tracking
- ✅ Fixed command line argument parsing for directory paths
- ✅ Improved path detection for Windows-style paths
- ✅ Removed debug output for cleaner production use
- ✅ Enhanced error handling and user feedback
- ✅ Verified compatibility with Google Photos Takeout exports
# Google Photos Takeout analysis
python duplicate_analyzer.py "C:\Users\username\Downloads\takeout-20241230T134448Z-001\Takeout\Google Photos\"
# Regular photo directory
python duplicate_analyzer.py "C:\Users\username\Pictures"
# Force fresh analysis (ignore history)
python duplicate_analyzer.py "C:\Photos" --fresh
# Check results from previous run
python duplicate_analyzer.py --status
# Execute deletion plan
python duplicate_analyzer.py --executefunction Invoke-DuplicateAnalysis { param([string]$Path = ".", [switch]$Fresh)
$scriptPath = "e:\DuplicateAnalyzer\duplicate_analyzer.py"
if ($Fresh) {
python $scriptPath --fresh --directory $Path
} else {
python $scriptPath --directory $Path
}
}
Set-Alias -Name "dupcheck" -Value Invoke-DuplicateAnalysis Set-Alias -Name "dup-status" -Value { python "e:\DuplicateAnalyzer\duplicate_analyzer.py" --status } Set-Alias -Name "dup-execute" -Value { python "e:\DuplicateAnalyzer\duplicate_analyzer.py" --execute }
.\Invoke-DuplicateAnalyzer.ps1 "C:\Photos" .\Invoke-DuplicateAnalyzer.ps1 -Fresh -Directory "C:\Photos" .\Invoke-DuplicateAnalyzer.ps1 -Status .\Invoke-DuplicateAnalyzer.ps1 -Execute
.\analyze.bat "C:\Photos" .\analyze-fresh.bat "C:\Photos" .\execute-deletion.bat
dupcheck "C:\Photos" dup-status dup-execute
python duplicate_analyzer.py "C:\Photos" python duplicate_analyzer.py --fresh --directory "C:\Photos" python duplicate_analyzer.py --status python duplicate_analyzer.py --execute