Project Archival

This project was used to learn Rust and I am now archiving it because I don't have the time to continue working on it. The real-world performance of the latest version of DuFF on our lab's storage [100's of TBs] was sub-par with the internal data structures requiring a significant amount of RAM.

fclones seems like a much more mature duplicate file finder, so it is likely a good alternative.

DuFF [Duplicate File Finder]

About

DuFF [Duplicate File Finder] is a small program written in Rust to find duplicate files in specified directories on a file system in parallel.
DuFF features:

Size filtering [min, max, or both!]
Extension filtering
Parallel processing

I had originally been implementing all of this in bash, but since I wanted to learn Rust, this seemed ike a good program to use to learn it!

Install

Work in Progress

cargo install duff

Usage

./duff -d /home/mike/Desktop -o /home/mike/duff_output -j 4

Required Parameters

The only required argument is where we should search for duplicate files.

-d [--dir]: The directories you want to search for duplicate files as a comma separated list
Ex: -d /home/mike/Desktop,/home/rufus

Optional Parameters

Flags

-a [--archive]: Tells DuFF to save a copy of all calculated hashes to use in a future DuFF run.
-g [--log]: Saves the DuFF log file which can be used to resume a DuFF run.
-p [--prog]: Hides progress information
-s [--silent]: Hide all console output

Arguments

-l [--lowlim]: Only examine files larger than specified value.
-u [--uplim]: Only examine files smaller than specified value.
-j [--jobs]: Tell DuFF the number of threads to use (defaults to 1)
-e [--ext]: Only examine files with the specified extensions, input as comma separated list.
-o [--out]: The directory where DuFF should store the output files (defaults to current working directory)
-r [--resume]: Tell DuFF to skip the directory traversal and instead resume prior run using input log file.
-x [--hash]: Point DuFF to a set of previously calculated hashes for files. As long as the mtime is the same, DuFF will not re-calculate hashes for the listed files.

Left to implement

Ability to read in previously computed hashes [-hash arg]
Resume functionality (skip file size examination using user provided previous working file)
Resume should allow multiple re-entry points depending on if user wants to search dirs again or just skip to hashing... needs more thought.
Need to deal with issues when we traverse into same directory twice.
Could definitely filter baged on path in file_res, if a path isn't unique delete all but 1 FileResult instance for this path.
Verify DuFF pasts all tests mentioned in this rmlint blog post: https://rmlint.readthedocs.io/en/latest/cautions.html

Tests to write:

Add tests for extension filtering
Add tests for size filtering
Test Windows compatibility
Verify we handle all I/O errors appropriately

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
misc		misc
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Archival

DuFF [Duplicate File Finder]

About

Install

Work in Progress

Usage

Required Parameters

Optional Parameters

Flags

Arguments

Left to implement

Tests to write:

About

Releases

Packages

Languages

bioinformike/DuFF

Folders and files

Latest commit

History

Repository files navigation

Project Archival

DuFF [Duplicate File Finder]

About

Install

Work in Progress

Usage

Required Parameters

Optional Parameters

Flags

Arguments

Left to implement

Tests to write:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages