-
Couldn't load subscription status.
- Fork 395
Description
It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:
rust-protobuf/protobuf-support/src/lexer/lexer_impl.rs
Lines 443 to 479 in 16c9dc5
| pub fn next_byte_value(&mut self) -> LexerResult<u8> { | |
| match self.next_char()? { | |
| '\\' => { | |
| match self.next_char()? { | |
| '\'' => Ok(b'\''), | |
| '"' => Ok(b'"'), | |
| '\\' => Ok(b'\\'), | |
| 'a' => Ok(b'\x07'), | |
| 'b' => Ok(b'\x08'), | |
| 'f' => Ok(b'\x0c'), | |
| 'n' => Ok(b'\n'), | |
| 'r' => Ok(b'\r'), | |
| 't' => Ok(b'\t'), | |
| 'v' => Ok(b'\x0b'), | |
| 'x' => { | |
| let d1 = self.next_hex_digit()? as u8; | |
| let d2 = self.next_hex_digit()? as u8; | |
| Ok(((d1 << 4) | d2) as u8) | |
| } | |
| d if d >= '0' && d <= '7' => { | |
| let mut r = d as u8 - b'0'; | |
| for _ in 0..2 { | |
| match self.next_octal_digit() { | |
| Err(_) => break, | |
| Ok(d) => r = (r << 3) + d as u8, | |
| } | |
| } | |
| Ok(r) | |
| } | |
| // https://github.com/google/protobuf/issues/4562 | |
| // TODO: overflow | |
| c => Ok(c as u8), | |
| } | |
| } | |
| '\n' | '\0' => Err(LexerError::IncorrectInput), | |
| // TODO: check overflow | |
| c => Ok(c as u8), |
basically this is consuming chars but only ever returning bytes, and it is converting a char (which represents a unicode scalar value) directly into a u8 which will then be interpreted as utf-8; but outside of ascii the integer value of a char does not correspond to the utf-8 encoding of a char. (for instance the char À has a unicode scalar value of 192, but is encoded as 0xC3, 0x80 in utf-8).
I don't think this is hard to fix; you just need to stay in chars the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from &str, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.
I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.