You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ಠ is text that, when rendered, looks like a single character. Rendered characters are called grapheme. To us, it looks like it has a width of 1. However, to store ಠ in unicode format, it requires 3 unicode points. So from unicode's perspective, it has a width of 3. In Rust, each char is a unicode point, not a grapheme. So when we call something like
fn peek(&self) -> Option<char> {
self.source[self.current as usize..].chars().next()
}
we're actually getting the unicode point, not the grapheme. This is fine because inside Lexer.next , we increment the current character pointer by the unicode width.
self.current += c.len_utf8() as u32;
In the case of ಠ, len_utf8 returns 3. If we were to only do self.current += 1; , for example, we'd get a panic when trying to read the next character.
The problem I'm having is that the Python lexer does appear to track offsets using graphemes, not unicode points.
>>> len("ಠ")
1
This means that there's a fundamental difference in offsets between enderpy and the official Python tokenize library. Ruff also has this difference, since the project counts characters by unicode points (probably because it's written in Rust, and it's easer to use unicode points). In fact, there have been some issues raised in the Ruff repo because it "miscounts" the number of Japanese/Chinese/Korean characters in a line and warns about a line being too long too early.
Solution
The solution is to track grapheme offset in addition to unicode offset. We only need this for testing, so we'd want to put it behind a feature flag.
parser/Cargo.toml
[dependencies]
unicode-width = "0.1"# new
parser/src/lexer/mod.rs
use unicode_id_start::{is_id_continue, is_id_start};use unicode_width::UnicodeWidthChar; # new
pubstructLexer<'a>{current_grapheme:u32, # new
}impl<'a>Lexer<'a>{
current_grapheme:0, # new
pubfnnext_token(&mutself) -> Token{ifself.next_token_is_dedent > 0{self.next_token_is_dedent -= 1;returnToken{kind:Kind::Dedent,value:TokenValue::None,start:self.current,end:self.current,grapheme_start:self.current_grapheme, # new
grapheme_end:self.current_grapheme, # new
};}let start = self.current;let grapheme_start = self.current_grapheme; # new
# ...let value = self.parse_token_value(kind, start);let end = self.current;let grapheme_end = self.current_grapheme;Token{
kind,
value,
start,
end,
grapheme_start,
grapheme_end,}}fnnext(&mutself) -> Option<char>{let c = self.peek();ifletSome(c) = c {self.current += c.len_utf8()asu32;self.current_grapheme += UnicodeWidthChar::width_cjk(c).unwrap()asu32; # new
}
c
}}
parser/src/token.rs
pubstructToken{pubkind:Kind,// Value might be deleted in the futurepubvalue:TokenValue,pubstart:u32,pubend:u32,pubgrapheme_start:u32, # new
pubgrapheme_end:u32, # new
}
The text was updated successfully, but these errors were encountered:
Problem
ಠ
is text that, when rendered, looks like a single character. Rendered characters are called grapheme. To us, it looks like it has a width of 1. However, to storeಠ
in unicode format, it requires 3 unicode points. So from unicode's perspective, it has a width of 3. In Rust, eachchar
is a unicode point, not a grapheme. So when we call something likewe're actually getting the unicode point, not the grapheme. This is fine because inside
Lexer.next
, we increment the current character pointer by the unicode width.In the case of
ಠ
,len_utf8
returns 3. If we were to only doself.current += 1;
, for example, we'd get a panic when trying to read the next character.The problem I'm having is that the Python lexer does appear to track offsets using graphemes, not unicode points.
This means that there's a fundamental difference in offsets between enderpy and the official Python tokenize library. Ruff also has this difference, since the project counts characters by unicode points (probably because it's written in Rust, and it's easer to use unicode points). In fact, there have been some issues raised in the Ruff repo because it "miscounts" the number of Japanese/Chinese/Korean characters in a line and warns about a line being too long too early.
Solution
The solution is to track grapheme offset in addition to unicode offset. We only need this for testing, so we'd want to put it behind a feature flag.
parser/Cargo.toml
parser/src/lexer/mod.rs
parser/src/token.rs
The text was updated successfully, but these errors were encountered: