Factors affecting performance of Jplag being run on more than 2,000 submissions #1900

FaizAlam · 2024-08-01T15:32:56Z

FaizAlam
Aug 1, 2024

Hi, I am currently working on a project to generate plagiarism report for coding contests submissions. Typically we get around 1-2 thousand submissions per language but sometimes it crosses 2,000. I have observed that for large number of submissions the pair combination is in 10^6 and the time taken to generate plagiarism report is much higher. I wanted to know what are the factors that might be affecting the report generation and what is they average speed of comparison(lets consider average 30 lines of code).
Also, I have disabled clustering and using -m 0.5 , so the time taken is still lesser. But can we optimize this any further!

I have also observed that by default --shown-comparisons is set to 500 i.e 500 comparison files are generated but is there any way to make sure that we are able to generate atleast the top comparisons for every submissions without generating the entire combination files?

Thanks,
Faiz

Answered by tsaglam

Aug 2, 2024

Disabling clusters helps with performance, also, you can adjust --shown-comparisons if the report generation takes too long. Avoid additional features like --normalize or --match-merging, as they can increase the runtime significantly. For two submissions with 30 LOC, the comparison should be a few milliseconds. Finally, you can also increase the min token match with -t, but this also adjusts the matching sensitivity. I would only do that if the submissions are larger and the results are still good afterward.

At its core, two factors affect the performance of JPlag: The number of submissions (exponential factor due to pairwise comparison) and the size of the submissions (especially affect…

View full answer

tsaglam · 2024-08-02T09:26:11Z

tsaglam
Aug 2, 2024
Maintainer

Disabling clusters helps with performance, also, you can adjust --shown-comparisons if the report generation takes too long. Avoid additional features like --normalize or --match-merging, as they can increase the runtime significantly. For two submissions with 30 LOC, the comparison should be a few milliseconds. Finally, you can also increase the min token match with -t, but this also adjusts the matching sensitivity. I would only do that if the submissions are larger and the results are still good afterward.

At its core, two factors affect the performance of JPlag: The number of submissions (exponential factor due to pairwise comparison) and the size of the submissions (especially affects parsing).

is there any way to make sure that we are able to generate atleast the top comparisons for every submissions without generating the entire combination files?

Currently, that is not possible. However, you could implement that for your own version of JPlag by adapting the corresponding methods in JPlagResult.

1 reply

tsaglam Aug 2, 2024
Maintainer

Sidenote from my side; when running a Java dataset of 2504 submissions (~220 LOC each) it takes ~47 seconds (on my M1 MacBook Pro):

33 seconds parsing
13 seconds comparison
2 seconds writing 500 results

For other languages, it might be slower, as we use JavaC instead of ANTLR to parse Java code. Also, when the individual submissions are larger, the comparison takes more time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factors affecting performance of Jplag being run on more than 2,000 submissions #1900

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Factors affecting performance of Jplag being run on more than 2,000 submissions #1900

FaizAlam Aug 1, 2024

Replies: 1 comment · 1 reply

tsaglam Aug 2, 2024 Maintainer

tsaglam Aug 2, 2024 Maintainer

FaizAlam
Aug 1, 2024

Replies: 1 comment 1 reply

tsaglam
Aug 2, 2024
Maintainer

tsaglam Aug 2, 2024
Maintainer