Error when parsing a valid XML file #44

whiver · 2017-12-11T14:13:08Z

Hi,
I am trying to parse a sample document into a Protobuf Message, using the AddressBook schema from Google examples:

Here is the document:

<AddressBook>
    <people>
        <name>John Doe</name>
        <id>42</id>
        <email>[email protected]</email>
    </people>
    <people>
        <name>Jane Doe</name>
        <id>41</id>
    </people>
</AddressBook>

Here is the code:

// All this initialization stuff is tested
InputStream inputData = XMLMapperTest.class.getResourceAsStream("/data/AddressBook_several.xml");
DynamicSchema schema = SchemaParser.parseSchema(XMLMapperTest.class.getResource("/schemas/AddressBook.desc").getPath(), false);
Descriptors.Descriptor descriptor = schema.getMessageDescriptor("AddressBook");

DynamicMessage.Builder builder = DynamicMessage.newBuilder(descriptor);

XmlFormat xmlFormat = new XmlFormat();
// Here is the instruction that raises the exception
xmlFormat.merge(inputData, StandardCharsets.UTF_8, builder);

Though, I get the following error:

com.googlecode.protobuf.format.ProtobufFormatter$ParseException: 5:21: Expected ">".

	at com.googlecode.protobuf.format.XmlFormat$Tokenizer.parseException(XmlFormat.java:619)
	at com.googlecode.protobuf.format.XmlFormat$Tokenizer.consume(XmlFormat.java:418)
	at com.googlecode.protobuf.format.XmlFormat.consumeClosingElement(XmlFormat.java:680)
	at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:764)
	at com.googlecode.protobuf.format.XmlFormat.handleObject(XmlFormat.java:882)
	at com.googlecode.protobuf.format.XmlFormat.handleValue(XmlFormat.java:775)
	at com.googlecode.protobuf.format.XmlFormat.mergeField(XmlFormat.java:755)
	at com.googlecode.protobuf.format.XmlFormat.merge(XmlFormat.java:663)
	at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:75)
	at com.googlecode.protobuf.format.AbstractCharBasedFormatter.merge(AbstractCharBasedFormatter.java:53)
	at com.googlecode.protobuf.format.ProtobufFormatter.merge(ProtobufFormatter.java:141)
[...]

I tried with UTF-8 and ISO-8859-1 encoding but I still get the error. Then I tried to remove the dots in the email address in my XML doc and I now parse successfully.

This is the working XML:

<AddressBook>
    <people>
        <name>John Doe</name>
        <id>42</id>
        <email>johndoe@examplecom</email>
    </people>
    <people>
        <name>Jane Doe</name>
        <id>41</id>
    </people>
</AddressBook>

If you want, I can also join the Protobuf schema if you want to try by yourself.

The text was updated successfully, but these errors were encountered:

It seems that the XML parser cannot parse dots in XML files. Won't merge into develop while a fix has not been found.

whiver · 2017-12-14T15:27:54Z

In fact it seems that even that header tag : <?xml version="1.0" encoding="UTF-8"?> causes a parsing error. The same error occurs with UTF-8 characters, such as é, è or à.
I am trying to figure it out but it's quite a pain to read the parser code :p

I also tried to use a String instead of an InputStream (I cannot find a signature corresponding to the example given in the Readme!):

String xml = IOUtils.toString(inputData, StandardCharsets.UTF_8);
xmlFormat.merge(xml, ExtensionRegistry.newInstance(), builder);

But I still get the same error. I printed my xml string and I shows the original one, so the problem comes right from the parsing process.

whiver · 2017-12-14T19:22:44Z

I identified the error: in fact the nextToken() method from Tokenizer, and more precisely the TOKEN constant which matches the next token is far too restrictive as it allows only a very few values.
For example, dots are not allowed, neither any non-ascii character.

The current Regex is:

extension|[a-zA-Z_\s;@][0-9a-zA-Z_\s;@+-]*+|[.]?[0-9+-][0-9a-zA-Z_.+-]*+|<\/|[\\0-9]++|"([^"
\\]|\\.)*+("|\\?$)|'([^'
\\]|\\.)*+('|\\?$)

scr · 2017-12-14T19:33:28Z

Would you mind putting up a fix with a test & update RELEASE-NOTES.md?

whiver · 2017-12-14T20:33:46Z

Yep I'll try, but I need to find the right regex first, it might not be simple. When I find a solution for sure I'll create a pull request.

… as soon as bivas/protobuf-java-format#44 is fixed.

bouviervj · 2018-01-12T19:36:58Z

Is this change integrated in the master branch ? - I had the same issue and I wonder if the modifications are working - the code simplifies a lot the tokenization regexps.

whiver · 2018-01-12T20:58:46Z

No it's not merged yet, since simplifying the regex has side effects. In fact I think that the whole parser should be refactored, so I left it as is for the moment, if you find a way to make it work, feel free to fork my repo :)

bouviervj · 2018-01-12T23:42:04Z

One thing I don't understand is why they had to code their own XML parser instead of using standard ones ?

whiver · 2018-01-13T14:16:51Z

I don't know, that's why I gave up trying to debug it, since the whole parser is quite restrictive and does not handle every cases. I think the only clean solution would be to reimplement everything using an existing parser but I did not have time to do it yet.

oboleka · 2020-05-14T10:51:47Z

Hi , I have the same problem. Is this issue still open?

whiver added a commit to whiver/nifi-protobuf-processor that referenced this issue Dec 14, 2017

Add tests for XML support. Known bug: bivas/protobuf-java-format#44

1bf6b96

It seems that the XML parser cannot parse dots in XML files. Won't merge into develop while a fix has not been found.

whiver mentioned this issue Dec 18, 2017

Add some documentation to contribute to the project #45

Closed

whiver added a commit to whiver/protobuf-java-format that referenced this issue Dec 19, 2017

Simplify the parser tokenizer regex to fix bivas#44

f6f910e

whiver added a commit to whiver/nifi-protobuf-processor that referenced this issue Dec 19, 2017

Remove the XML header as it is not parsed by the library. Should work…

3081438

… as soon as bivas/protobuf-java-format#44 is fixed.

whiver added a commit to whiver/nifi-protobuf-processor that referenced this issue Dec 19, 2017

Remove the XML header as it is not parsed by the library. Should work…

4217f77

… as soon as bivas/protobuf-java-format#44 is fixed.

bouviervj mentioned this issue Jan 12, 2018

Exception on Special Characteres "//" #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when parsing a valid XML file #44

Error when parsing a valid XML file #44

whiver commented Dec 11, 2017

whiver commented Dec 14, 2017 •

edited

Loading

whiver commented Dec 14, 2017

scr commented Dec 14, 2017

whiver commented Dec 14, 2017

bouviervj commented Jan 12, 2018

whiver commented Jan 12, 2018

bouviervj commented Jan 12, 2018

whiver commented Jan 13, 2018

oboleka commented May 14, 2020

Error when parsing a valid XML file #44

Error when parsing a valid XML file #44

Comments

whiver commented Dec 11, 2017

whiver commented Dec 14, 2017 • edited Loading

whiver commented Dec 14, 2017

scr commented Dec 14, 2017

whiver commented Dec 14, 2017

bouviervj commented Jan 12, 2018

whiver commented Jan 12, 2018

bouviervj commented Jan 12, 2018

whiver commented Jan 13, 2018

oboleka commented May 14, 2020

whiver commented Dec 14, 2017 •

edited

Loading