This project was used to learn Rust and I am now archiving it because I don't have the time to continue working on it. The real-world performance of the latest version of DuFF on our lab's storage [100's of TBs] was sub-par with the internal data structures requiring a significant amount of RAM.
fclones seems like a much more mature duplicate file finder, so it is likely a good alternative.
DuFF [Duplicate File Finder] is a
small program written in Rust to find duplicate files in specified directories on a file system in parallel.
DuFF features:
- Size filtering [min, max, or both!]
- Extension filtering
- Parallel processing
I had originally been implementing all of this in bash, but since I wanted to learn Rust, this seemed ike a good program to use to learn it!
cargo install duff
./duff -d /home/mike/Desktop -o /home/mike/duff_output -j 4
The only required argument is where we should search for duplicate files.
- -d [--dir]: The directories you want to search for duplicate files as a comma separated list
Ex: -d /home/mike/Desktop,/home/rufus
- -a [--archive]: Tells DuFF to save a copy of all calculated hashes to use in a future DuFF run.
- -g [--log]: Saves the DuFF log file which can be used to resume a DuFF run.
- -p [--prog]: Hides progress information
- -s [--silent]: Hide all console output
- -l [--lowlim]: Only examine files larger than specified value.
- -u [--uplim]: Only examine files smaller than specified value.
- -j [--jobs]: Tell DuFF the number of threads to use (defaults to 1)
- -e [--ext]: Only examine files with the specified extensions, input as comma separated list.
- -o [--out]: The directory where DuFF should store the output files (defaults to current working directory)
- -r [--resume]: Tell DuFF to skip the directory traversal and instead resume prior run using input log file.
- -x [--hash]: Point DuFF to a set of previously calculated hashes for files. As long as the mtime is the same, DuFF will not re-calculate hashes for the listed files.
Ability to read in previously computed hashes [-hash arg]
Resume functionality (skip file size examination using user provided previous working file)
Resume should allow multiple re-entry points depending on if user wants to search dirs again or just skip to hashing... needs more thought.
Need to deal with issues when we traverse into same directory twice.
Could definitely filter baged on path in file_res, if a path isn't unique delete all but 1 FileResult instance for this path.
Verify DuFF pasts all tests mentioned in this rmlint blog post:
- Add tests for extension filtering
- Add tests for size filtering
- Test Windows compatibility
- Verify we handle all I/O errors appropriately