Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when parsing a valid XML file #44

Open
whiver opened this issue Dec 11, 2017 · 9 comments
Open

Error when parsing a valid XML file #44

whiver opened this issue Dec 11, 2017 · 9 comments

Comments

@whiver
Copy link

whiver commented Dec 11, 2017

Hi,
I am trying to parse a sample document into a Protobuf Message, using the AddressBook schema from Google examples:

Here is the document:

<AddressBook>
    <people>
        <name>John Doe</name>
        <id>42</id>
        <email>[email protected]</email>
    </people>
    <people>
        <name>Jane Doe</name>
        <id>41</id>
    </people>
</AddressBook>

Here is the code:

// All this initialization stuff is tested
InputStream inputData = XMLMapperTest.class.getResourceAsStream("/data/AddressBook_several.xml");
DynamicSchema schema = SchemaParser.parseSchema(XMLMapperTest.class.getResource("/schemas/AddressBook.desc").getPath(), false);
Descriptors.Descriptor descriptor = schema.getMessageDescriptor("AddressBook");

DynamicMessage.Builder builder = DynamicMessage.newBuilder(descriptor);

XmlFormat xmlFormat = new XmlFormat();
// Here is the instruction that raises the exception
xmlFormat.merge(inputData, StandardCharsets.UTF_8, builder);

Though, I get the following error:

com.googlecode.protobuf.format.ProtobufFormatter$ParseException: 5:21: Expected ">".

	at com.googlecode.protobuf.format.XmlFormat$Tokenizer.parseException(XmlFormat.java:619)
	at com.googlecode.protobuf.format.XmlFormat$Tokenizer.consume(XmlFormat.java:418)
	at com.googlecode.protobuf.format.XmlFormat.consumeClosingElement(XmlFormat.java:680)
	at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:764)
	at com.googlecode.protobuf.format.XmlFormat.handleObject(XmlFormat.java:882)
	at com.googlecode.protobuf.format.XmlFormat.handleValue(XmlFormat.java:775)
	at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:755)
	at com.googlecode.protobuf.format.XmlFormat.merge(XmlFormat.java:663)
	at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:75)
	at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:53)
	at com.googlecode.protobuf.format.ProtobufFormatter.merge(ProtobufFormatter.java:141)
[...]

I tried with UTF-8 and ISO-8859-1 encoding but I still get the error. Then I tried to remove the dots in the email address in my XML doc and I now parse successfully.

This is the working XML:

<AddressBook>
    <people>
        <name>John Doe</name>
        <id>42</id>
        <email>johndoe@examplecom</email>
    </people>
    <people>
        <name>Jane Doe</name>
        <id>41</id>
    </people>
</AddressBook>

If you want, I can also join the Protobuf schema if you want to try by yourself.

whiver added a commit to whiver/nifi-protobuf-processor that referenced this issue Dec 14, 2017
It seems that the XML parser cannot parse dots in XML files. Won't merge
into develop while a fix has not been found.
@whiver
Copy link
Author

whiver commented Dec 14, 2017

In fact it seems that even that header tag : <?xml version="1.0" encoding="UTF-8"?> causes a parsing error. The same error occurs with UTF-8 characters, such as é, è or à.
I am trying to figure it out but it's quite a pain to read the parser code :p

I also tried to use a String instead of an InputStream (I cannot find a signature corresponding to the example given in the Readme!):

String xml = IOUtils.toString(inputData, StandardCharsets.UTF_8);
xmlFormat.merge(xml, ExtensionRegistry.newInstance(), builder);

But I still get the same error. I printed my xml string and I shows the original one, so the problem comes right from the parsing process.

@whiver
Copy link
Author

whiver commented Dec 14, 2017

I identified the error: in fact the nextToken() method from Tokenizer, and more precisely the TOKEN constant which matches the next token is far too restrictive as it allows only a very few values.
For example, dots are not allowed, neither any non-ascii character.

The current Regex is:

extension|[a-zA-Z_\s;@][0-9a-zA-Z_\s;@+-]*+|[.]?[0-9+-][0-9a-zA-Z_.+-]*+|<\/|[\\0-9]++|"([^"
\\]|\\.)*+("|\\?$)|'([^'
\\]|\\.)*+('|\\?$)

@scr
Copy link
Collaborator

scr commented Dec 14, 2017

Would you mind putting up a fix with a test & update RELEASE-NOTES.md?

@whiver
Copy link
Author

whiver commented Dec 14, 2017

Yep I'll try, but I need to find the right regex first, it might not be simple. When I find a solution for sure I'll create a pull request.

whiver added a commit to whiver/protobuf-java-format that referenced this issue Dec 19, 2017
whiver added a commit to whiver/nifi-protobuf-processor that referenced this issue Dec 19, 2017
whiver added a commit to whiver/nifi-protobuf-processor that referenced this issue Dec 19, 2017
@bouviervj
Copy link

Is this change integrated in the master branch ? - I had the same issue and I wonder if the modifications are working - the code simplifies a lot the tokenization regexps.

@whiver
Copy link
Author

whiver commented Jan 12, 2018

No it's not merged yet, since simplifying the regex has side effects. In fact I think that the whole parser should be refactored, so I left it as is for the moment, if you find a way to make it work, feel free to fork my repo :)

@bouviervj
Copy link

One thing I don't understand is why they had to code their own XML parser instead of using standard ones ?

@whiver
Copy link
Author

whiver commented Jan 13, 2018

I don't know, that's why I gave up trying to debug it, since the whole parser is quite restrictive and does not handle every cases. I think the only clean solution would be to reimplement everything using an existing parser but I did not have time to do it yet.

@oboleka
Copy link

oboleka commented May 14, 2020

Hi , I have the same problem. Is this issue still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants