This project was used to learn Rust and I am now archiving it because I don't have the time to continue working on it. The real-world performance of the latest version of DuFF on our lab's storage [100's of TBs] was sub-par with the internal data structures requiring a significant amount of RAM.
fclones seems like a much more mature duplicate file finder, so it is likely a good alternative.
DuFF [Duplicate File Finder] is a
small program written in Rust to find duplicate files in specified directories on a file system in parallel.
DuFF features:
- Size filtering [min, max, or both!]
- Extension filtering
- Parallel processing
I had originally been implementing all of this in bash, but since I wanted to learn Rust, this seemed ike a good program to use to learn it!
cargo install duff
./duff -d /home/mike/Desktop -o /home/mike/duff_output -j 4
The only required argument is where we should search for duplicate files.
- -d [--dir]: The directories you want to search for duplicate files as a comma separated list
Ex: -d /home/mike/Desktop,/home/rufus
- -a [--archive]: Tells DuFF to save a copy of all calculated hashes to use in a future DuFF run.
- -g [--log]: Saves the DuFF log file which can be used to resume a DuFF run.
- -p [--prog]: Hides progress information
- -s [--silent]: Hide all console output
- -l [--lowlim]: Only examine files larger than specified value.
- -u [--uplim]: Only examine files smaller than specified value.
- -j [--jobs]: Tell DuFF the number of threads to use (defaults to 1)
- -e [--ext]: Only examine files with the specified extensions, input as comma separated list.
- -o [--out]: The directory where DuFF should store the output files (defaults to current working directory)
- -r [--resume]: Tell DuFF to skip the directory traversal and instead resume prior run using input log file.
- -x [--hash]: Point DuFF to a set of previously calculated hashes for files. As long as the mtime is the same, DuFF will not re-calculate hashes for the listed files.
-
Ability to read in previously computed hashes [-hash arg]
-
Resume functionality (skip file size examination using user provided previous working file)
Resume should allow multiple re-entry points depending on if user wants to search dirs again or just skip to hashing... needs more thought.
-
Need to deal with issues when we traverse into same directory twice.
Could definitely filter baged on path in file_res, if a path isn't unique delete all but 1 FileResult instance for this path.
-
Verify DuFF pasts all tests mentioned in this rmlint blog post: https://rmlint.readthedocs.io/en/latest/cautions.html
- Add tests for extension filtering
- Add tests for size filtering
- Test Windows compatibility
- Verify we handle all I/O errors appropriately