Implement Duplicate Finder #1107

dix0nym · 2024-11-01T17:37:03Z

Implement a feature to detect and delete duplicate archives in the library based on the hamming distance between thumbnail hashes (#338).

The minion job is parallelized and is way faster than the initial plugin, for my library (27k archives, 730GB) a full run took about 1h.
Additionally, I improved upon the script instead of only return duplicate pairs (e.g. we have 3 duplicate archives: id1 -> id2, id1 -> id3, id1 -> id3, now it returns one group [id1, id2, id3]).

view in /duplicates, it is currently not connected from anywhere else. Should be considered if it should be linked to from navbar or settings. Furthermore, the CSS is defined inline in the HTML, maybe it should be separated into separate file?

I could not figure out how to do the tooltip for the tags like it is on index page. Would be great to see the actual tags if you hover over the tag count. I would need some help there.

I implemented different ways to delete the duplicates:

manual - one by one
selecting based on attribute
- tag count
- file size
- page count
- date

A batch delete endpoint in the API would be great, the toasts are a bit annoying if you delete some archives in batch.

Tested on my library, detected and deleted 776 duplicate groups.

Would appreciate any feedback.

Difegue

Thanks a lot for this! Looks delightful.

I'll give this a more indepth review later, but I don't think the API should have any "batch" kind of endpoints if possible - Single/Atomic operations are easier to code on the server-side and clients can just loop a bunch of calls..

If the toasts are annoying I believe it'd be useful to add an optional parameter to the JS function for API calls to disable showing them on a successful operation.

chu-shen · 2024-12-31T06:04:08Z

Please add ignore option to the archive. For example, if you want to preserve duplicate archives with different translation styles.

Difegue

Sorry for having taken so long to get to reviewing this - It's not a lot of lines but due to being a brand new feature I wanted to give it a detailed review.

I think the overall logic and minion job part is solid, most of my comments are on cleaning up the UI.

Difegue · 2025-01-30T23:36:17Z

lib/LANraragi/Controller/Api/Other.pm

@@ -107,6 +107,26 @@ sub regen_thumbnails {
    );
 }

+# Queue the find_duplicates Minion job.
+sub find_duplicates {


You don't need this API endpoint -- I believe you could just use the existing /api/minion/find_duplicates/queue by passing the threshold in the args array parameter.

Difegue · 2025-01-30T23:43:28Z

lib/LANraragi/Controller/Duplicates.pm

+# Go through the archives in the content directory and build the template at the end.
+sub index {
+    my $self  = shift;
+    my $redis = $self->LRR_CONF->get_redis;


This should use get_redis_config so it doesn't pollute the ID list, imo.

Difegue · 2025-01-30T23:45:37Z

public/js/server.js

+                true,
+                (d) => {
+                    $(".find-duplicates").prop("disabled", false);
+                    LRR.toast({


Instead of a toast, I'd just refresh the window so that the newly found duplicates appear.
(While making sure that the refreshed URL doesn't contain delete=1 so it doesn't insta-delete the dupes we just found!)

Difegue · 2025-01-30T23:46:47Z

public/js/server.js

+ * Sends a POST request to queue a find_duplicates job,
+ * detecting archive duplicates based on their thumbnail hashes.
+ */
+Server.findDuplicates = function () {


This function probably belongs in duplicates.js?

Difegue · 2025-01-30T23:47:49Z

templates/duplicates.html.tt2

+		<h2 class="ih" style="text-align:center">Duplicates</h2>
+        <p>Found [% duplicates.size %] duplicate groups</p>
+
+        [% IF userlogged %]


You already require is_logged_in in Routing, so checking for userlogged is unnecessary here

Difegue · 2025-01-31T00:32:29Z

templates/duplicates.html.tt2

+<body>
+	<div class='ido' style='text-align:center; overflow-x:auto;'>
+		<h2 class="ih" style="text-align:center">Duplicates</h2>
+        <p>Found [% duplicates.size %] duplicate groups</p>


There should be some explanatory text here that details how the feature works.

("Found X duplicate groups" might be good to hide if there's no duplicate results? Not necessary though imo)

Difegue · 2025-01-31T00:33:00Z

templates/duplicates.html.tt2

+                <button type="button" class="stdbtn find-duplicates">Find Duplicates</button>
+                <button type="button" class="stdbtn clear-duplicates">Clear Duplicates</button>
+            </div>
+            <div class="select-btn-group">


I'd hide this div if there are no duplicate groups found/available.

Difegue · 2025-01-31T00:40:30Z

templates/duplicates.html.tt2

+                                <div class="thumbnail-wrapper">
+                                    <a href="/reader?id=[% archive.arcid %]" title="[% archive.title %]">
+                                        <img class="thumbnail" src="/api/archives/[% archive.arcid %]/thumbnail" width="100"/>
+                                    </a>
+                                    <div class="thumbnail-popover">
+                                        <img src="/api/archives/[% archive.arcid %]/thumbnail" />
+                                    </div>
+                                </div>


I'd use the same tooltip mechanism as the main index here instead:

Suggested change

<div class="thumbnail-wrapper">

<a href="/reader?id=[% archive.arcid %]" title="[% archive.title %]">

<img class="thumbnail" src="/api/archives/[% archive.arcid %]/thumbnail" width="100"/>

</a>

<div class="thumbnail-popover">

<img src="/api/archives/[% archive.arcid %]/thumbnail" />

</div>

</div>

<div class="thumbnail-wrapper">

<a onmouseover="IndexTable.buildImageTooltip(this)" href="${new LRR.apiURL('/reader?id=[% archive.arcid %]')}" title="[% archive.title %]">

<img class="thumbnail" src="${new LRR.apiURL(`/api/archives/[% archive.arcid %]/thumbnail`)}" width="100"/>

</a>

<div class="caption" style="display: none;">

<img style="height:300px" src="${new LRR.apiURL('/api/archives/${data.arcid}/thumbnail')}"

onerror="this.src='${new LRR.apiURL('/img/noThumb.png')}'">

</div>

</div>

Difegue · 2025-01-31T00:50:43Z

lib/LANraragi/Utils/Minion.pm

+
+                        foreach my $id (@$_) {
+                            # Skip if this ID has already been processed in another thread
+                            next if $visited{$id};


Since you've used split_workload_by_cpu, normally each process/thread should have its own unique set of IDs.

I'm not sure you need visited as a result?

Difegue · 2025-01-31T00:55:16Z

templates/duplicates.html.tt2

+	<script src="[% c.url_for("/js/common.js?$version") %]" type="text/JAVASCRIPT"></script>
+	<script src="[% c.url_for("/js/server.js?$version") %]" type="text/JAVASCRIPT"></script>
+    <script src="[% c.url_for("/js/duplicates.js?$version") %]" type="text/JAVASCRIPT"></script>
+    <style>


Yeah I'm not a fan of having special CSS here, this will mess up custom themes.
Ideally I think everything here is doable with the base lrr.css and re-using Index bits for the tags and thumbnail popups.

It's fine to add a few extra classes to lrr.css if you'd need specific sizing.

psilabs-dev · 2025-02-19T11:21:30Z

Thanks for working on this problem!

I notice there's one new post API in routing. Is this fire and forget/browserless? I'd like to call the job via curl/API client, let it run on a handful of cpus, then collect the results after a couple days.

In my case I'm hesitant to delete based on thumbnail similarity alone, but this feature would be a great starting point for further duplicate analysis based on custom user logic and code. E.g., merge tags among dupes, adding dupe sources to kept archives, prioritizing archives with most likes/bookmarks (if supported) or based on existing sort namespace, keeping different translations, setting do-not-downloads, etc.

This is also a comment long past implementation, so I understand if it's out of scope now.

Difegue · 2025-02-19T23:30:36Z

The dupe scan job is just a minion task, so you can just fire and forget it, yeah.
I'll wrap up the PR myself eventually if the original writer doesn't come back - that's what I get for having taken 3 months to review it...

dix0nym · 2025-02-20T12:56:59Z

The dupe scan job is just a minion task, so you can just fire and forget it, yeah. I'll wrap up the PR myself eventually if the original writer doesn't come back

I'm just busy at the moment. Around April I will have a bit more time - if that isn't too late. Feel free to wrap it up yourself if that should be the case.

that's what I get for having taken 3 months to review it...

No worries, I really like your comprehensive review of my PR!

In my case I'm hesitant to delete based on thumbnail similarity alone, but this feature would be a great starting point for further duplicate analysis based on custom user logic and code.

I’d love to explore more advanced methods for identifying duplicates in the future. The current approach is the simplest and most lightweight beside simple hash comparison, but I’m curious - how would you envision integrating custom user logic and code? Would this be through additional filters and sorting options, or are you thinking of a deeper integration, some kind of plugin support?

Difegue · 2025-02-20T21:26:44Z

Oh, thanks! Feel free to take your time, I'm in no rush.

psilabs-dev · 2025-02-21T06:43:46Z

The current approach is the simplest and most lightweight beside simple hash comparison, but I’m curious - how would you envision integrating custom user logic and code? Would this be through additional filters and sorting options, or are you thinking of a deeper integration, some kind of plugin support?

It depends on the data source/plugin. For a site like Pixiv, the idea of "duplicate" doesn't really exist. Nhentai, on the other hand, has tons of true duplicates. Same manga, different translations; same translation, different translators; same series continuation, different to-be-continued chapters... Perhaps metadata plugins may be extended, e.g. nhentai-deduplication, e-hentai-deduplication, etc.

It would be difficult to incorporate all custom user logic, because different users have different definitions of what it means for two archives to be similar. It's better to keep two archives that may be duplicates than it is to delete a unique archive. So hands-off archive deletion of potentially tens of thousands of archives is not an easy ask.

The straighforward option is to offer an API interface, and let the users write their own duplicate detection in their language of choice. One API to trigger duplicate archive scanning, another to collect the results.

For example, trigger a scan on some endpoint:

curl -X POST http://localhost:3000/api/duplicates?cpus=4
# {"job_id": 123, "status": "success", "message": "Thumbnail duplicate scanning minion job queued!"}

then collect scan results:

curl -X GET http://localhost:3000/api/duplicates/123

return payload as list of similar archives by ID:

{
    "status": "success/in-progress/failed",
    "data/duplicate archive IDs": [
        ["abcd", "efgh"],
        ["ijkl", "mnop", "qrst"]
    ]
}

I also wonder if we can extract the "image similarity" method so we can choose methods besides levenshtein distance in the future...

To give a practical example: I have a number of tagged archives downloaded via nhentai-archivist and cleaned with an internal service. LRR has RO access to contents downloaded by archivist so it cannot delete archives. This is what I'd be eventually doing (on a privileged service).

collect results of the duplicate scans
categorize archives within each cluster by "language:*" tag
within a specific bucket, perform page-by-page image similarity checks: if an archive is a subset of another archive to a large extent (e.g. 80%-100% contained): the smaller archive can be deleted without much loss of information. Otherwise, they are not sufficiently similar. Then break these buckets down to page content-specific buckets
rank archives in these buckets by number of favorites, whether they have the "uncensored" tag or another indication of quality tag, whether they are in-progress, or whether they are in a static category. Basically find the "best quality" archive. Keep the archive with the highest ranking, queue deletion of the rest, and add some kind of "duplicate_source" tag for this kept archive pointing to the sources of deleted archives.
add the deleted archives to the nhentai-archivist do-not-delete catalog, then validate/execute deletion of archives

then add some basic logging, notifications, and run this once a month or something with a cron-job. Of course what I'm doing is subjective to my setup.

psilabs-dev · 2025-03-17T08:20:04Z

Just an update on this; I decided to go the route of deduplication by treating archives as sequences of images, and solving the problem by finding all sequences which are subsequences of another sequence. In this case, equality between images is replaced with sufficient similarity between resnet embeddings. Equal-length duplicate sequences are handled by an nhentai-specific metadata comparison algorithm to determine which archive to keep.

The job is taking a super-long time (attempt to multiproc caused race conditions I was too lazy/dumb to fix :p), but around 16-17% of archives are projected to be duplicates (which after checking, indeed they are) which is good enough for me since that would shave off a considerable number of gbs from my disk, so I'm just letting it run while I go on with my day

Add duplicate finder

de28e84

dix0nym force-pushed the duplicatefinder branch from a941899 to de28e84 Compare November 1, 2024 18:26

Difegue requested changes Nov 4, 2024

View reviewed changes

Difegue linked an issue Nov 5, 2024 that may be closed by this pull request

Create a script-type plugin to detect duplicate archives using thumbnails #338

Open

Difegue mentioned this pull request Dec 31, 2024

feat: add DuplicateFinder plugin #1137

Open

Difegue requested changes Jan 31, 2025

View reviewed changes

dix0nym closed this by deleting the head repository Mar 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Duplicate Finder #1107

Implement Duplicate Finder #1107

dix0nym commented Nov 1, 2024 •

edited

Loading

Difegue left a comment

chu-shen commented Dec 31, 2024

Difegue left a comment

Difegue Jan 30, 2025

Difegue Jan 30, 2025

Difegue Jan 30, 2025

Difegue Jan 30, 2025

Difegue Jan 30, 2025

Difegue Jan 31, 2025

Difegue Jan 31, 2025

Difegue Jan 31, 2025

Difegue Jan 31, 2025

Difegue Jan 31, 2025

psilabs-dev commented Feb 19, 2025

Difegue commented Feb 19, 2025

dix0nym commented Feb 20, 2025 •

edited

Loading

Difegue commented Feb 20, 2025

psilabs-dev commented Feb 21, 2025

psilabs-dev commented Mar 17, 2025 •

edited

Loading

Implement Duplicate Finder #1107

Implement Duplicate Finder #1107

Conversation

dix0nym commented Nov 1, 2024 • edited Loading

Difegue left a comment

Choose a reason for hiding this comment

chu-shen commented Dec 31, 2024

Difegue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psilabs-dev commented Feb 19, 2025

Difegue commented Feb 19, 2025

dix0nym commented Feb 20, 2025 • edited Loading

Difegue commented Feb 20, 2025

psilabs-dev commented Feb 21, 2025

psilabs-dev commented Mar 17, 2025 • edited Loading

dix0nym commented Nov 1, 2024 •

edited

Loading

dix0nym commented Feb 20, 2025 •

edited

Loading

psilabs-dev commented Mar 17, 2025 •

edited

Loading