Skip to content

wrek/DuplicateAnalyzer

Repository files navigation

Duplicate Analyzer

A powerful utility for finding and safely removing duplicate images and media files with smart history tracking.

Features

  • Exact duplicate detection using MD5 hash comparison
  • Similar file detection using intelligent filename pattern matching
  • Smart file selection algorithm to choose which duplicates to keep
  • Interactive preview before any deletion occurs
  • Safety backup of all deleted files with timestamps
  • Persistent history tracking to avoid reprocessing files
  • Command line interface with flexible options

Quick Start

# Analyze a directory (will prompt if no path given)
python duplicate_analyzer.py

# Analyze specific directory
python duplicate_analyzer.py "C:\Path\To\Photos"

# Force fresh analysis (ignore previous history)
python duplicate_analyzer.py "C:\Path\To\Photos" --fresh

Command Line Options

# Basic usage - analyze directory
python duplicate_analyzer.py [directory_path] [options]

# Options:
--fresh              Ignore previous history, reprocess all files
--status             Show summary of last analysis
--preview-details    Show detailed preview of deletion plan
--execute            Execute deletion plan with backup
--directory path     Specify directory to analyze (alternative syntax)
-d path              Short form of --directory

# Examples:
python duplicate_analyzer.py "C:\Photos" --fresh
python duplicate_analyzer.py --directory "C:\Photos"
python duplicate_analyzer.py -d "C:\Photos" --fresh
python duplicate_analyzer.py --status
python duplicate_analyzer.py --execute

Workflow

1. Analysis Phase

python duplicate_analyzer.py "C:\Your\Photos\Directory"
  • Scans all image/video files in the directory
  • Uses existing history to skip previously processed files (unless --fresh is used)
  • Identifies exact duplicates by MD5 hash
  • Finds similar files based on filename patterns
  • Generates a smart deletion plan

2. Review Phase

python duplicate_analyzer.py --preview-details
  • Review the generated deletion_preview.json file
  • Check which files will be kept vs deleted
  • Verify the space savings estimate

3. Execution Phase

python duplicate_analyzer.py --execute
  • Creates timestamped backup directory
  • Safely deletes duplicate files
  • Updates history tracking
  • Provides detailed execution report

How the Smart Selection Works

For each group of duplicates, the tool scores files based on:

  • File size (larger files preferred - likely higher quality)
  • Filename patterns (originals preferred over copies like "image(1).jpg")
  • Edit indicators (non-edited preferred unless edited version is significantly larger)
  • Directory location (organized folders preferred over generic "Mobile Uploads")

The highest-scoring file in each group is kept, others are marked for deletion.

History Tracking

The tool maintains analysis_history.json to track:

  • Previously processed files - skipped in future runs for efficiency
  • Deleted files - won't be reprocessed if they somehow reappear
  • Analysis timestamps - track when operations were performed

History Commands

# Check what's in your history
python duplicate_analyzer.py --status

# Force fresh analysis (ignore history)
python duplicate_analyzer.py "C:\Photos" --fresh

# Normal analysis (uses history)
python duplicate_analyzer.py "C:\Photos"

Safety Features

  • Backup before deletion - All deleted files are copied to timestamped backup folders
  • Preview before action - Always generates a preview file first
  • History persistence - Tracks all operations to prevent data loss
  • Error handling - Graceful handling of locked or missing files
  • Verbose output - Clear progress indication and status messages

Supported File Types

  • Images: .jpg, .jpeg, .png, .gif, .bmp, .tiff
  • Videos: .mp4, .mov, .avi

Troubleshooting

"No output" or script seems stuck

  • The script now provides verbose output at each step
  • If analyzing a large directory, be patient - it will show progress every 500 files

"Permission denied" errors

  • Run as administrator if needed
  • Check that files aren't open in other applications

Want to start over completely

  • Use the --fresh flag to ignore all previous history
  • Delete analysis_history.json and deletion_preview.json if you want to completely reset

Example Output

============================================================
DUPLICATE ANALYZER - STARTING ANALYSIS
============================================================
Target directory: C:\Users\user\Photos
Fresh analysis mode: NO
Analysis started at: 2024-01-15 14:30:25
============================================================
✓ Found existing history with 1,234 previously processed files
  These files will be skipped to improve performance.
  Use --fresh to ignore history and reprocess all files.

STEP 1: Finding exact duplicates by MD5 hash...
--------------------------------------------------
  Processed 500 files...
  Processed 1000 files...
✓ Found 15 groups with exact duplicate files

STEP 2: Finding pattern-based duplicates...
--------------------------------------------------
  Processed 500 files...
✓ Found 8 groups with similar filename patterns

STEP 3: Analyzing duplicates and generating deletion plan...
--------------------------------------------------
✓ Detailed deletion plan saved to: deletion_preview.json

Author

Chad Kovac

Version History

  • v1.2: Fixed command line argument parsing, removed debug output, production ready
  • v1.1: Added command line argument support, verbose output, and fresh analysis mode
  • v1.0: Initial release with basic duplicate detection and history tracking

Recent Updates (v1.2)

  • ✅ Fixed command line argument parsing for directory paths
  • ✅ Improved path detection for Windows-style paths
  • ✅ Removed debug output for cleaner production use
  • ✅ Enhanced error handling and user feedback
  • ✅ Verified compatibility with Google Photos Takeout exports

Usage Examples

# Google Photos Takeout analysis
python duplicate_analyzer.py "C:\Users\username\Downloads\takeout-20241230T134448Z-001\Takeout\Google Photos\"

# Regular photo directory
python duplicate_analyzer.py "C:\Users\username\Pictures"

# Force fresh analysis (ignore history)
python duplicate_analyzer.py "C:\Photos" --fresh

# Check results from previous run
python duplicate_analyzer.py --status

# Execute deletion plan
python duplicate_analyzer.py --execute

Add to your PowerShell profile ($PROFILE)

Function for duplicate analysis

function Invoke-DuplicateAnalysis { param([string]$Path = ".", [switch]$Fresh)

$scriptPath = "e:\DuplicateAnalyzer\duplicate_analyzer.py"

if ($Fresh) {
    python $scriptPath --fresh --directory $Path
} else {
    python $scriptPath --directory $Path
}

}

Aliases

Set-Alias -Name "dupcheck" -Value Invoke-DuplicateAnalysis Set-Alias -Name "dup-status" -Value { python "e:\DuplicateAnalyzer\duplicate_analyzer.py" --status } Set-Alias -Name "dup-execute" -Value { python "e:\DuplicateAnalyzer\duplicate_analyzer.py" --execute }

Using the PowerShell script

.\Invoke-DuplicateAnalyzer.ps1 "C:\Photos" .\Invoke-DuplicateAnalyzer.ps1 -Fresh -Directory "C:\Photos" .\Invoke-DuplicateAnalyzer.ps1 -Status .\Invoke-DuplicateAnalyzer.ps1 -Execute

Using batch files

.\analyze.bat "C:\Photos" .\analyze-fresh.bat "C:\Photos" .\execute-deletion.bat

Using aliases (if added to profile)

dupcheck "C:\Photos" dup-status dup-execute

Direct Python execution

python duplicate_analyzer.py "C:\Photos" python duplicate_analyzer.py --fresh --directory "C:\Photos" python duplicate_analyzer.py --status python duplicate_analyzer.py --execute

About

Smart duplicate file detector with safe deletion and backup capabilities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors