GH-600 parallel parsing of NQUADS and N-Triples #601

hmottestad · 2025-03-20T14:03:44Z

Issue resolved (if any): #600

Description of this pull request:

Please check all the lines before posting the pull request:

I've created tests for all my changes
My pull request isn't fixing or changing multiple unlinked elements (please create one pull request for each element)
I've applied the code formatter (mvn formatter:format on the backend, npm run format on the frontend) before posting my pull request, mvn formatter:validate to validate the formatting on the backend, npm run validate on the frontend
All my commits have relevant names
I've squashed my commits (if necessary)

…d approach to parsing NQUADS and N-Triples files. Also implement more concurrent intermediary structures instead of refactoring the code to support multiple ElemStringBuffer buffers when parsing.

hmottestad · 2025-03-20T14:04:58Z

Timed conversion of latest-lexemes.nt.gz from https://dumps.wikimedia.org/wikidatawiki/entities/ . Tested on an M3 Max with 16 cores. Originally 11 minutes, now 7 minutes.

Before

After

hmottestad · 2025-03-22T09:18:45Z

A few of the tests assumed that the RDF parser would return statements in a fixed and predictable order.

I fixed up a couple of them, but then found out that it's probably best to have a way to enable/disable parallel parsing.

Now all the tests are passing, but I'll need to double check the performance now to see that it's still as good as expected.

Can you start testing it @ate47 ?

ate47 · 2025-05-06T12:52:02Z

qendpoint-core/src/main/java/com/the_qa_company/qendpoint/core/rdf/parsers/RDFParserRIOT.java

+
+			Thread e1 = new Thread(() -> {
+				RDFParser.source(bnodes).base(baseUri).lang(lang).labelToNode(LabelToNode.createUseLabelAsGiven())
+						.parse(buffer);


What is the point of having a custom stream for the bnodes? And to disable parallel parsing when it does not keep them

ate47 · 2025-05-06T12:59:26Z

...-core/src/main/java/com/the_qa_company/qendpoint/core/rdf/parsers/ConcurrentInputStream.java

+import java.io.PipedOutputStream;
+import java.nio.charset.StandardCharsets;
+
+public class ConcurrentInputStream {


A better error handling would be better, if the parsing fails it seems to create an exception and close the streams. Ok it'll fallback on the other threads, but with a dead stream ioexception and on the user side nothing seems to exist?

ate47 · 2025-05-06T13:22:18Z

qendpoint-core/src/test/java/com/the_qa_company/qendpoint/core/util/UnicodeEscapeTest.java

@@ -26,20 +27,20 @@ public void encodeTest() throws ParserException {
 		RDFParserCallback factory2 = RDFParserFactory.getParserCallback(RDFNotation.NTRIPLES,
 				HDTOptions.of(Map.of(HDTOptionsKeys.NT_SIMPLE_PARSER_KEY, "false")));

-		Set<TripleString> ts1 = new TreeSet<>(Comparator.comparing(t -> {
+		Set<TripleString> ts1 = Collections.synchronizedSet(new TreeSet<>(Comparator.comparing(t -> {


Knowing the parallel parsing is only for streamed files, is this useful?

ate47 · 2025-05-06T13:24:47Z

...nt-core/src/main/java/com/the_qa_company/qendpoint/core/hdt/impl/TempHDTImporterOnePass.java

@@ -53,7 +53,7 @@ public TripleAppender(TempDictionary dict, TempTriples triples, ProgressListener
 		}

 		@Override
-		public void processTriple(TripleString triple, long pos) {
+		synchronized public void processTriple(TripleString triple, long pos) {


I think it would be better to propose a sync and unsync version (or a wrapper) to avoid a sync during a single thread usage

ate47 · 2025-05-15T12:03:03Z

I think you can also get a look at the ExceptionThread class, I've made it to bind threads together while keeping track of the exceptions.

the-qa-companyGH-600 implement a splitting mechanism and multithreade…

1c92fd9

…d approach to parsing NQUADS and N-Triples files. Also implement more concurrent intermediary structures instead of refactoring the code to support multiple ElemStringBuffer buffers when parsing.

hmottestad added 4 commits March 21, 2025 12:20

add some synchronization

3649b4e

fix test

7c38f0e

fix test

a15e5ce

wip

7c63bfe

ate47 self-requested a review May 6, 2025 12:49

ate47 requested changes May 6, 2025

View reviewed changes

add some tests

62c7416

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-600 parallel parsing of NQUADS and N-Triples #601

GH-600 parallel parsing of NQUADS and N-Triples #601

Uh oh!

hmottestad commented Mar 20, 2025

Uh oh!

hmottestad commented Mar 20, 2025 •

edited

Loading

Uh oh!

hmottestad commented Mar 22, 2025

Uh oh!

ate47 May 6, 2025

Uh oh!

ate47 May 6, 2025

Uh oh!

ate47 May 6, 2025

Uh oh!

ate47 May 6, 2025

Uh oh!

ate47 commented May 15, 2025

Uh oh!

Uh oh!

GH-600 parallel parsing of NQUADS and N-Triples #601

Are you sure you want to change the base?

GH-600 parallel parsing of NQUADS and N-Triples #601

Uh oh!

Conversation

hmottestad commented Mar 20, 2025

Uh oh!

hmottestad commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

hmottestad commented Mar 22, 2025

Uh oh!

ate47 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

ate47 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

ate47 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

ate47 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

ate47 commented May 15, 2025

Uh oh!

Uh oh!

hmottestad commented Mar 20, 2025 •

edited

Loading