text_format parsing does not correctly handle non-ascii chars?

It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:

https://github.com/stepancheg/rust-protobuf/blob/16c9dc509267a6673f29563f9a01cc3026cc2144/protobuf-support/src/lexer/lexer_impl.rs#L443-L479

basically this is consuming `chars` but only ever returning bytes, and it is converting a `char` (which represents a unicode scalar value) directly into a `u8` which will then be interpreted as `utf-8`; but outside of ascii the integer value of a `char` does not correspond to the utf-8 encoding of a `char`. (for instance the char `À` has a unicode scalar value of 192, but is encoded as `0xC3, 0x80` in utf-8).

I don't think this is hard to fix; you just need to stay in `chars` the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from `&str`, which is always valid utf-8) it should not be possible for a string literal to ever _not_ be valid utf-8.

I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.

	pub fn next_byte_value(&mut self) -> LexerResult<u8> {
	match self.next_char()? {
	'\\' => {
	match self.next_char()? {
	'\'' => Ok(b'\''),
	'"' => Ok(b'"'),
	'\\' => Ok(b'\\'),
	'a' => Ok(b'\x07'),
	'b' => Ok(b'\x08'),
	'f' => Ok(b'\x0c'),
	'n' => Ok(b'\n'),
	'r' => Ok(b'\r'),
	't' => Ok(b'\t'),
	'v' => Ok(b'\x0b'),
	'x' => {
	let d1 = self.next_hex_digit()? as u8;
	let d2 = self.next_hex_digit()? as u8;
	Ok(((d1 << 4) \| d2) as u8)
	}
	d if d >= '0' && d <= '7' => {
	let mut r = d as u8 - b'0';
	for _ in 0..2 {
	match self.next_octal_digit() {
	Err(_) => break,
	Ok(d) => r = (r << 3) + d as u8,
	}
	}
	Ok(r)
	}
	// https://github.com/google/protobuf/issues/4562
	// TODO: overflow
	c => Ok(c as u8),
	}
	}
	'\n' \| '\0' => Err(LexerError::IncorrectInput),
	// TODO: check overflow
	c => Ok(c as u8),

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

text_format parsing does not correctly handle non-ascii chars? #729

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

text_format parsing does not correctly handle non-ascii chars? #729

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions