Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify need to define baseIRI #46

Open
pmcb55 opened this issue Feb 6, 2021 · 5 comments
Open

Clarify need to define baseIRI #46

pmcb55 opened this issue Feb 6, 2021 · 5 comments

Comments

@pmcb55
Copy link

pmcb55 commented Feb 6, 2021

Hi,

This error might be related to issue: #20, but it's super-easy to reproduce when attempting to parse the W3C Bookmark vocabulary here.

To reproduce the error, simply save the trivial JavaScript below as index.js and then execute:

npm init
npm i rdfxml-streaming-parser
curl http://www.w3.org/2002/01/bookmark# -o bookmark.rdf
node index

Index.js:

const RdfXmlParser = require("rdfxml-streaming-parser").RdfXmlParser;
const  fs = require('fs');

const myParser = new RdfXmlParser();

fs.createReadStream('./bookmark.rdf')
  .pipe(myParser)
  .on('data', console.log)
  .on('error', console.error)
  .on('end', () => console.log('All triples were parsed!'));

I see the following error:

[rdfxml]$ node index
Error: Found invalid baseIRI '' for value ''
    at Object.resolve (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/relative-to-absolute-iri/lib/Resolve.js:22:19)
    at RdfXmlParser.valueToUri (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/rdfxml-streaming-parser/lib/RdfXmlParser.js:153:63)
    at RdfXmlParser.onTagResource (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/rdfxml-streaming-parser/lib/RdfXmlParser.js:335:73)
    at RdfXmlParser.onTag (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/rdfxml-streaming-parser/lib/RdfXmlParser.js:231:18)
    at SAXStream.emit (events.js:314:20)
    at SAXParser.me._parser.<computed> [as onopentag] (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/sax/lib/sax.js:258:17)
    at emit (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/sax/lib/sax.js:624:35)
    at emitNode (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/sax/lib/sax.js:629:5)
    at openTag (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/sax/lib/sax.js:825:5)
    at SAXParser.write (/home/pmcb55/Work/Projects/tmp/rdfxml/node_modules/sax/lib/sax.js:1391:13)
[rdfxml]$ 

I'm using RdfXmlParser in a generic vocab processing library, which parses a user-configurable list of vocabularies, so configuring RdfXmlParser just to specifically handle this particular vocabulary wouldn't be pretty (i.e., the out-of-box RdfXmlParser has worked fine for any RDF/XML vocabs I've encountered so far, so getting it to work with http://www.w3.org/2002/01/bookmark# without changing anything would be really good!).

BTW, I assume http://www.w3.org/2002/01/bookmark# is a valid RDF/XML vocab (I refuse to even try to grok RDF/XML!), as EasyRDF converts that URL to Turtle without any complaint. (So my workaround right now is to copy that converted Turtle to a local file, and configure my tool to parse that local Turtle instead of the official RDF/XML served up from http://www.w3.org/2002/01/bookmark# :( !)

@rubensworks
Copy link
Member

This document requires a baseIRI to be set via the parser's constructor.

I'll leave this issue open as a note to myself to document this better, as this not clear enough.

@rubensworks rubensworks changed the title Error parsing bookmark.rdf Clarify need to define baseIRI Feb 7, 2021
@pmcb55
Copy link
Author

pmcb55 commented Feb 8, 2021

Hi Ruben - yeah, that was the problem alright. Setting the baseIRI manually in the constructor resolves this.

Unfortunately, now the issue I have is that I was instantiating my RDF/XML parser just once and registering it in an RDF/JS SinkMap of parsers for processing multiple resources. I can't do that anymore, because I now need to instantiate just the RDF/XML parser on every resource I parse, just in case that resource is RDF/XML, and might need a baseIRI set.

My workaround code below works now, but just the RDF/XML parser is making it kinda ugly :( ! Would it be valid to have the parser default the baseIRI value to be the parsed resource IRI if the resource explicitly sets it's baseIRI to be the empty string (as the bookmark RDF does)...? Or would it be possible to expose a setBaseIRI(iri) method on the parser (so that at least I'd just be calling a setter instead of instantiating a whole new parser instance for each resource I wish to parse)...?

const rdf = require("rdf-ext");
const rdfFetch = require("@rdfjs/fetch-lite");
const rdfFormats = require("@rdfjs/formats-common");

const ParserN3 = require("@rdfjs/parser-n3");
const ParserJsonld = require("@rdfjs/parser-jsonld");
const ParserRdfXml = require("rdfxml-streaming-parser").RdfXmlParser;
const SinkMap = require("@rdfjs/sink-map");

const formats = {
  parsers: new SinkMap([
    ["text/turtle", new ParserN3()],
    ["application/ld+json", new ParserJsonld()],

    // NO POINT DOING THIS ANYMORE - SINCE I'LL HAVE TO CONSTRUCT
    // AN INSTANCE PER RESOURCE TO ALLOW ME SET THE `baseIRI`!
    // ["application/rdf+xml", new ParserRdfXml()]
  ]),
};

function parseResource(resource) {
  // I NEED TO CONSTRUCT JUST THE RDF/XML PARSER PER RESOURCE,
  // JUST TO SET THE `baseIRI`, EVEN THOUGH THE RESOURCE COULD
  // BE Turtle OR JSON-LD - D'OH!
  formats.parsers.set(
    "application/rdf+xml",
    new ParserRdfXml({ baseIRI: resource })
  );

  rdfFetch(resource, { factory: rdf, formats })
    .then((resource) => {
      return resource.dataset();
    })
    .then((dataset) => {
      console.log(`Parsed [${dataset.size}] triples from resource [${resource}]...`);
    });
}

const resources = [ "http://www.w3.org/2002/01/bookmark#", "http://www.w3.org/1999/02/22-rdf-syntax-ns#" ];
resources.map((resource) => { parseResource(resource) });

@rubensworks
Copy link
Member

@pmcb55 I'm not very familiar with the SinkMap API, so not really sure how to solve that.

I typically re-create parsers upon every request (they are very lightweight anyways). This is abstracted in rdf-parse, which ships with parsers for all mostly used RDF formats.

rdf-parse doesn't allow you to just use a select set of parsers though, so if that's a requirement for you, I would suggest investigating SinkMap further by posting a question on their issue tracker.

@pmcb55
Copy link
Author

pmcb55 commented Feb 12, 2021

Well, I don't think you (or I) need to know how SinkMap works at all - my suggestion above is simply that your parser defaults to setting it's baseIRI value to be the "parsed resource URL" if the RDF contents of the resource itself explicitly set the baseIRI to the empty string (as the Bookmark vocab happens to do).

This seems like acceptable generic behaviour for a parser (although I would say that any vocabulary setting the 'baseIRI' to 'nothing' (i.e. the empty string) is kinda weird, but that's the weird and wonderful world of the InterWeb I suppose).
Does that make sense (i.e., that this is nothing to do with SinkMap at all, but just default behaviour for your RDF/XML parser)...?

This difference in parser behaviour is why (I assume) EasyRDF has no problem parsing this Bookmark vocab, whereas your parser blows up on it (see screenshot).

EasyRDf-DefaultsBaseIRI

@rubensworks
Copy link
Member

rubensworks commented Feb 13, 2021

parser defaults to setting it's baseIRI value to be the "parsed resource URL"

You're right, but this is what is parser is already doing (as specified by the RDF/XML spec, which this parser complies to).
However, in order to be able to do this, one must set the resource's IRI as baseIRI via the constructor,
otherwise the parser is not able to know the resource IRI.

This is because the parser only receives a string stream as input, while libraries such as sink-map, rdf-dereference and rdf-parse are responsible for fetching HTTP(S) documents from URLs, and forwarding the response stream to the parser.
It it therefore the responsibility of sink-map and rdf-parse to make sure that their parser's baseIRI is properly defined.

For example, the following playground uses rdf-dereference and this parser in the background with proper delegation of the baseIRI, and is able to parse your document correctly: https://rdf-play.rubensworks.net/#url=http%3A%2F%2Fwww.w3.org%2F2002%2F01%2Fbookmark&proxy=https%3A%2F%2Fproxy.linkeddatafragments.org%2F

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants