Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Duplicate Finder #1107

Closed
wants to merge 1 commit into from
Closed

Conversation

dix0nym
Copy link

@dix0nym dix0nym commented Nov 1, 2024

Implement a feature to detect and delete duplicate archives in the library based on the hamming distance between thumbnail hashes (#338).

The minion job is parallelized and is way faster than the initial plugin, for my library (27k archives, 730GB) a full run took about 1h.
Additionally, I improved upon the script instead of only return duplicate pairs (e.g. we have 3 duplicate archives: id1 -> id2, id1 -> id3, id1 -> id3, now it returns one group [id1, id2, id3]).

view in /duplicates, it is currently not connected from anywhere else. Should be considered if it should be linked to from navbar or settings. Furthermore, the CSS is defined inline in the HTML, maybe it should be separated into separate file?

view-duplicates

I could not figure out how to do the tooltip for the tags like it is on index page. Would be great to see the actual tags if you hover over the tag count. I would need some help there.

I implemented different ways to delete the duplicates:

  • manual - one by one
  • selecting based on attribute
    • tag count
    • file size
    • page count
    • date

select

A batch delete endpoint in the API would be great, the toasts are a bit annoying if you delete some archives in batch.

Tested on my library, detected and deleted 776 duplicate groups.

Would appreciate any feedback.

Copy link
Owner

@Difegue Difegue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this! Looks delightful.

I'll give this a more indepth review later, but I don't think the API should have any "batch" kind of endpoints if possible - Single/Atomic operations are easier to code on the server-side and clients can just loop a bunch of calls..

If the toasts are annoying I believe it'd be useful to add an optional parameter to the JS function for API calls to disable showing them on a successful operation.

@chu-shen
Copy link
Contributor

Please add ignore option to the archive. For example, if you want to preserve duplicate archives with different translation styles.

Copy link
Owner

@Difegue Difegue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for having taken so long to get to reviewing this - It's not a lot of lines but due to being a brand new feature I wanted to give it a detailed review.

I think the overall logic and minion job part is solid, most of my comments are on cleaning up the UI.

@@ -107,6 +107,26 @@ sub regen_thumbnails {
);
}

# Queue the find_duplicates Minion job.
sub find_duplicates {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this API endpoint -- I believe you could just use the existing /api/minion/find_duplicates/queue by passing the threshold in the args array parameter.

# Go through the archives in the content directory and build the template at the end.
sub index {
my $self = shift;
my $redis = $self->LRR_CONF->get_redis;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use get_redis_config so it doesn't pollute the ID list, imo.

true,
(d) => {
$(".find-duplicates").prop("disabled", false);
LRR.toast({
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a toast, I'd just refresh the window so that the newly found duplicates appear.
(While making sure that the refreshed URL doesn't contain delete=1 so it doesn't insta-delete the dupes we just found!)

* Sends a POST request to queue a find_duplicates job,
* detecting archive duplicates based on their thumbnail hashes.
*/
Server.findDuplicates = function () {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function probably belongs in duplicates.js?

<h2 class="ih" style="text-align:center">Duplicates</h2>
<p>Found [% duplicates.size %] duplicate groups</p>

[% IF userlogged %]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already require is_logged_in in Routing, so checking for userlogged is unnecessary here

<body>
<div class='ido' style='text-align:center; overflow-x:auto;'>
<h2 class="ih" style="text-align:center">Duplicates</h2>
<p>Found [% duplicates.size %] duplicate groups</p>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some explanatory text here that details how the feature works.

("Found X duplicate groups" might be good to hide if there's no duplicate results? Not necessary though imo)

<button type="button" class="stdbtn find-duplicates">Find Duplicates</button>
<button type="button" class="stdbtn clear-duplicates">Clear Duplicates</button>
</div>
<div class="select-btn-group">
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hide this div if there are no duplicate groups found/available.

Comment on lines +205 to +212
<div class="thumbnail-wrapper">
<a href="/reader?id=[% archive.arcid %]" title="[% archive.title %]">
<img class="thumbnail" src="/api/archives/[% archive.arcid %]/thumbnail" width="100"/>
</a>
<div class="thumbnail-popover">
<img src="/api/archives/[% archive.arcid %]/thumbnail" />
</div>
</div>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use the same tooltip mechanism as the main index here instead:

Suggested change
<div class="thumbnail-wrapper">
<a href="/reader?id=[% archive.arcid %]" title="[% archive.title %]">
<img class="thumbnail" src="/api/archives/[% archive.arcid %]/thumbnail" width="100"/>
</a>
<div class="thumbnail-popover">
<img src="/api/archives/[% archive.arcid %]/thumbnail" />
</div>
</div>
<div class="thumbnail-wrapper">
<a onmouseover="IndexTable.buildImageTooltip(this)" href="${new LRR.apiURL('/reader?id=[% archive.arcid %]')}" title="[% archive.title %]">
<img class="thumbnail" src="${new LRR.apiURL(`/api/archives/[% archive.arcid %]/thumbnail`)}" width="100"/>
</a>
<div class="caption" style="display: none;">
<img style="height:300px" src="${new LRR.apiURL('/api/archives/${data.arcid}/thumbnail')}"
onerror="this.src='${new LRR.apiURL('/img/noThumb.png')}'">
</div>
</div>


foreach my $id (@$_) {
# Skip if this ID has already been processed in another thread
next if $visited{$id};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you've used split_workload_by_cpu, normally each process/thread should have its own unique set of IDs.

I'm not sure you need visited as a result?

<script src="[% c.url_for("/js/common.js?$version") %]" type="text/JAVASCRIPT"></script>
<script src="[% c.url_for("/js/server.js?$version") %]" type="text/JAVASCRIPT"></script>
<script src="[% c.url_for("/js/duplicates.js?$version") %]" type="text/JAVASCRIPT"></script>
<style>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not a fan of having special CSS here, this will mess up custom themes.
Ideally I think everything here is doable with the base lrr.css and re-using Index bits for the tags and thumbnail popups.

It's fine to add a few extra classes to lrr.css if you'd need specific sizing.

@psilabs-dev
Copy link
Contributor

Thanks for working on this problem!

I notice there's one new post API in routing. Is this fire and forget/browserless? I'd like to call the job via curl/API client, let it run on a handful of cpus, then collect the results after a couple days.

In my case I'm hesitant to delete based on thumbnail similarity alone, but this feature would be a great starting point for further duplicate analysis based on custom user logic and code. E.g., merge tags among dupes, adding dupe sources to kept archives, prioritizing archives with most likes/bookmarks (if supported) or based on existing sort namespace, keeping different translations, setting do-not-downloads, etc.

This is also a comment long past implementation, so I understand if it's out of scope now.

@Difegue
Copy link
Owner

Difegue commented Feb 19, 2025

The dupe scan job is just a minion task, so you can just fire and forget it, yeah.
I'll wrap up the PR myself eventually if the original writer doesn't come back - that's what I get for having taken 3 months to review it...

@dix0nym
Copy link
Author

dix0nym commented Feb 20, 2025

The dupe scan job is just a minion task, so you can just fire and forget it, yeah. I'll wrap up the PR myself eventually if the original writer doesn't come back

I'm just busy at the moment. Around April I will have a bit more time - if that isn't too late. Feel free to wrap it up yourself if that should be the case.

that's what I get for having taken 3 months to review it...

No worries, I really like your comprehensive review of my PR!

In my case I'm hesitant to delete based on thumbnail similarity alone, but this feature would be a great starting point for further duplicate analysis based on custom user logic and code.

I’d love to explore more advanced methods for identifying duplicates in the future. The current approach is the simplest and most lightweight beside simple hash comparison, but I’m curious - how would you envision integrating custom user logic and code? Would this be through additional filters and sorting options, or are you thinking of a deeper integration, some kind of plugin support?

@Difegue
Copy link
Owner

Difegue commented Feb 20, 2025

Oh, thanks! Feel free to take your time, I'm in no rush.

@psilabs-dev
Copy link
Contributor

The current approach is the simplest and most lightweight beside simple hash comparison, but I’m curious - how would you envision integrating custom user logic and code? Would this be through additional filters and sorting options, or are you thinking of a deeper integration, some kind of plugin support?

It depends on the data source/plugin. For a site like Pixiv, the idea of "duplicate" doesn't really exist. Nhentai, on the other hand, has tons of true duplicates. Same manga, different translations; same translation, different translators; same series continuation, different to-be-continued chapters... Perhaps metadata plugins may be extended, e.g. nhentai-deduplication, e-hentai-deduplication, etc.

It would be difficult to incorporate all custom user logic, because different users have different definitions of what it means for two archives to be similar. It's better to keep two archives that may be duplicates than it is to delete a unique archive. So hands-off archive deletion of potentially tens of thousands of archives is not an easy ask.

The straighforward option is to offer an API interface, and let the users write their own duplicate detection in their language of choice. One API to trigger duplicate archive scanning, another to collect the results.

For example, trigger a scan on some endpoint:

curl -X POST http://localhost:3000/api/duplicates?cpus=4
# {"job_id": 123, "status": "success", "message": "Thumbnail duplicate scanning minion job queued!"}

then collect scan results:

curl -X GET http://localhost:3000/api/duplicates/123

return payload as list of similar archives by ID:

{
    "status": "success/in-progress/failed",
    "data/duplicate archive IDs": [
        ["abcd", "efgh"],
        ["ijkl", "mnop", "qrst"]
    ]
}

I also wonder if we can extract the "image similarity" method so we can choose methods besides levenshtein distance in the future...

To give a practical example: I have a number of tagged archives downloaded via nhentai-archivist and cleaned with an internal service. LRR has RO access to contents downloaded by archivist so it cannot delete archives. This is what I'd be eventually doing (on a privileged service).

  1. collect results of the duplicate scans
  2. categorize archives within each cluster by "language:*" tag
  3. within a specific bucket, perform page-by-page image similarity checks: if an archive is a subset of another archive to a large extent (e.g. 80%-100% contained): the smaller archive can be deleted without much loss of information. Otherwise, they are not sufficiently similar. Then break these buckets down to page content-specific buckets
  4. rank archives in these buckets by number of favorites, whether they have the "uncensored" tag or another indication of quality tag, whether they are in-progress, or whether they are in a static category. Basically find the "best quality" archive. Keep the archive with the highest ranking, queue deletion of the rest, and add some kind of "duplicate_source" tag for this kept archive pointing to the sources of deleted archives.
  5. add the deleted archives to the nhentai-archivist do-not-delete catalog, then validate/execute deletion of archives

then add some basic logging, notifications, and run this once a month or something with a cron-job. Of course what I'm doing is subjective to my setup.

@psilabs-dev
Copy link
Contributor

psilabs-dev commented Mar 17, 2025

Just an update on this; I decided to go the route of deduplication by treating archives as sequences of images, and solving the problem by finding all sequences which are subsequences of another sequence. In this case, equality between images is replaced with sufficient similarity between resnet embeddings. Equal-length duplicate sequences are handled by an nhentai-specific metadata comparison algorithm to determine which archive to keep.

The job is taking a super-long time (attempt to multiproc caused race conditions I was too lazy/dumb to fix :p), but around 16-17% of archives are projected to be duplicates (which after checking, indeed they are) which is good enough for me since that would shave off a considerable number of gbs from my disk, so I'm just letting it run while I go on with my day

@dix0nym dix0nym closed this by deleting the head repository Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a script-type plugin to detect duplicate archives using thumbnails
4 participants