hdt::QueryProcessor.searchJoin() gives incorrect results #265

donpellegrino · 2022-08-29T18:18:15Z

Filing this issue for hdt-cpp work on RDFLib/rdflib-hdt#14. See other issue for test case and additional details.

… BasicVarBindingString class to its own pair of implementation files to improve readability.

mielvds · 2022-08-30T07:35:57Z

Strange indeed. Are you following up on this?

donpellegrino · 2022-08-30T13:21:18Z

I have read through some of the code, but I am still investigating the cause. My next step is to trace through the execution of the test case and see if I can find where the logic breaks. I have a sandbox setup with the Python and C++ repositories working together. So far, I have just run clang-format on the relevant C++ classes and moved BasicVarBindingString to its own .hpp/.cpp files. Based on the comments in the code, it looks like hdt::QueryProcessor.searchJoin() was a work-in-progress and never fully implemented.

@mielvds - if you or anyone else knows the history of the QueryProcessor.searchJoin(), please let me know.

mielvds · 2022-08-30T13:27:06Z

only @MarioAriasGa and if you're lucky @LaurensRietveld might know more.

donpellegrino · 2022-08-30T15:53:55Z

Around https://github.com/rdfhdt/hdt-cpp/blob/develop/libhdt/src/sparql/QueryProcessor.cpp#L90, I suspect the triplePatID variable is assigning "0" for cases that should be distinct. A "0" is used when a subject, predicate, or object is a variable and will therefore match anything. However, a "0" is also used when the string does not match anything from the dictionary. Thus, strings that are non-matches are effectively treated as variables that match anything.

mielvds · 2022-08-31T07:32:37Z

TBH, I didn't even know that there was a (partial) SPARQL implementation in the HDT-CPP. My guess is that it is used nowhere and was probably never finished. In the Java version, the query processing is offloaded to Jena, maybe something similar is possible with oxigraph or even rdflib.

donpellegrino · 2022-08-31T13:15:20Z

It looks like hdt::QueryProcessor is limited to processing Basic Graph Patterns (BGP), so it is still short of a SPARQL implementation. I understand that one branch of code queries triples via a single BGP at a time. The RDFLib/rdflib-hdt library uses that approach by default. The hdt:QueryProcessor appears to extend that capability to add efficiencies for the case of multiple BGPs at once. This is critical for performance and leveraging the Dictionary (index). The rdflib-hdt library has an optimize_sparql() function that causes it to use the QueryProcessor for multiple BGPs instead of querying one BGP at a time and then aggregating them in the rdflib SPARQL engine.

I suspect that any pluggable SPARQL engine sitting on top of HDT (e.g., Apache Jena ARQ, Python rdflib, etc.) can interface with the HDT function for querying a single BGP at a time. But, anytime that approach is taken, performance will be left on the table as the Dictionary may remain underutilized for specific optimizations. An HDT function (hdt::QueryProcessor.searchJoin) that can provide an interface to a set of BGP and give an efficient response does seem like it would be an essential interface underneath any SPARQL engine.

It would be interesting to compare how the HDT Java version handles this. If anyone familiar with that codebase could confirm my assumptions of how things work and point me to the relevant Java code for comparison, that would be very helpful.

donpellegrino added a commit to DeciSym/hdt-cpp that referenced this issue Aug 29, 2022

WIP rdfhdt#265 Applied clang-format to QueryProcessor.cpp/.hpp. Moved…

7178df5

… BasicVarBindingString class to its own pair of implementation files to improve readability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdt::QueryProcessor.searchJoin() gives incorrect results #265

hdt::QueryProcessor.searchJoin() gives incorrect results #265

donpellegrino commented Aug 29, 2022

mielvds commented Aug 30, 2022

donpellegrino commented Aug 30, 2022

mielvds commented Aug 30, 2022 •

edited

Loading

donpellegrino commented Aug 30, 2022

mielvds commented Aug 31, 2022

donpellegrino commented Aug 31, 2022

hdt::QueryProcessor.searchJoin() gives incorrect results #265

hdt::QueryProcessor.searchJoin() gives incorrect results #265

Comments

donpellegrino commented Aug 29, 2022

mielvds commented Aug 30, 2022

donpellegrino commented Aug 30, 2022

mielvds commented Aug 30, 2022 • edited Loading

donpellegrino commented Aug 30, 2022

mielvds commented Aug 31, 2022

donpellegrino commented Aug 31, 2022

mielvds commented Aug 30, 2022 •

edited

Loading