** This software deletes files. If it is used incorrectly, or if it has bugs, it may delete the wrong ones, causing data loss and/or rendering your system unusable. Make sure you have backed up all your data before proceeding. **
This julia module helps with finding and deleting duplicate files. There are already numerous pieces of software, both free and commercial, that do this. Most of them have nice-looking graphical user interfaces. This module does not have a graphical user interface, or even a command-line interface. It can only be used from within julia.
Instead of answering questions about which files to delete (which can be a very time-consuming process when there are lots of files to de-duplicate) the user writes a function that tells the software when one of two identical files should be deleted.
Let's assume that we want to find all duplicate files under the directory foo
,
and for each set of duplicates, keep one file. (We don't care which one.)
We're not allowed to use a decision function that simply returns true
for any
pair of files. It is required to be consistent in its choice of files to
delete. So we use alphabetical ordering of the file paths as a tie breaker.
list = deduplicate_files("foo", (a,b) -> a.realpath > b.realpath, verbose=true, dry_run=true)
We then examine the returned list. If all looks good, we re-run the command
without the dry_run
argument.
The typical way to use the software is to first call the function
list = deduplicate_files(startdirs, dfun, dry_run=true)
, where startdirs
is an array of directory paths, and dfun
is a "decision function" that tells
the software which file(s) to delete from a set up duplicates. The directories
will be searched recursively and the duplicate files that would be deleted will
be stored in a list. If the list looks correct, then the function can be called
again, without dry_run=true
causing the files to actually be deleted.
Otherwise, the decision function must be re-written, and a new dry run performed.
The decision function dfun(A,B)
must satisfy the following criteria:
- It takes as input arguments two
DeduplicationFile
structs. - It should return
true
if the file thatA
points to should be deleted after confirming that it is identical to the one thatB
points to. (Andfalse
otherwise.) - It must behave like an
isless
function, establishing a consistent order. For example,dfun(x,y)
anddfun(y,x)
may not both betrue
. (But they may both befalse
if neitherx
nory
should be deleted.) - It must not assume that the given files are identical. The software is allowed to call the decision function on suspected duplicates, deferring the byte-for-byte comparison until just before a duplicate is deleted.
- It should NOT delete the duplicate file. (This can have disastrous consequences if the user only intended to do a dry run.)
The DeduplicationFile
descriptors provided in the arguments to the decision
function have the following fields:
start
- The starting point given.realstart
- The absolute path ofstart
, after expanding symbolic links.relpath
- The relative path fromstart
to the file in question.realpath
- The absolute path to the file, after expanding symbolic links.dirname
- The directory part ofrealpath
.dirinode
- The inode number of the directory in which the file resides.basename
- The filename.stat
- TheStatStruct
for the file. (See the documentation forstat
.)
Let's assume that foo
and bar
are directories.
We want to delete files under foo
that have identical copies somewhere under
bar
, but only if they are larger than one kilobyte, the filenames are
identical, and they are not hard-linked to the same inode. (Deleting hard-linked
copies doesn't free up much disk space, so we keep them.)
We write a decision function for this:
const p1 = "/path/to/foo"
const p2 = "/path/to/bar"
dfun(a,b) = a.start == p1 && b.start == p2 &&
a.stat.size > 1024 && a.basename == b.basename &&
a.stat.inode != b.stat.inode
Then we test our decision function by doing a dry run.
list = deduplicate_files([p1, p2], dfun, dry_run=true)
We might store the returned list to disk in order to keep a record of the files that were deleted and where their identical copies were found.
** This software is provided as-is, without any warranty of any kind. **
I wrote it in my spare time, because I needed it myself, and I open-sourced it in case somebody else might find it useful.
I haven't done any extensive testing, other than using it for the specific task that I wrote it for. In particular, I've never tested this on a file-system with hard-linked directories. (Why would you use hard-linked directories?)
If using this software breaks your system, I will not be able to help you in any way, except to tell you to restore from backups. You did make backups, didn't you?