You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The code uses .len() on strings which counts bytes, not Unicode characters
For example in get_node_text() and throughout DensityNode calculations
This could lead to incorrect density calculations for non-ASCII text like Chinese characters, emojis, or combining diacritical marks
Text Normalization:
The code trims whitespace but doesn't normalize Unicode text
Different Unicode representations of the same text (e.g. é vs e + ́ ) could lead to inconsistent character counts
No handling of zero-width characters, bidirectional text markers, etc.
Specific Risk Areas:
char_count in DensityNode uses raw byte length
Text density calculations could be skewed for content in different scripts
Link text extraction might break with URLs containing Unicode characters
No handling of HTML entities that represent Unicode characters
Potential Real-World Issues:
News articles in non-Latin scripts might get incorrect density scores
Mixed-script content could have unbalanced density calculations
Content with heavy emoji usage could have inflated character counts
International domain names in links might be counted incorrectly
The text was updated successfully, but these errors were encountered:
Character Counting Issues:
The code uses .len() on strings which counts bytes, not Unicode characters
For example in get_node_text() and throughout DensityNode calculations
This could lead to incorrect density calculations for non-ASCII text like Chinese characters, emojis, or combining diacritical marks
Text Normalization:
The code trims whitespace but doesn't normalize Unicode text
Different Unicode representations of the same text (e.g. é vs e + ́ ) could lead to inconsistent character counts
No handling of zero-width characters, bidirectional text markers, etc.
Specific Risk Areas:
char_count in DensityNode uses raw byte length
Text density calculations could be skewed for content in different scripts
Link text extraction might break with URLs containing Unicode characters
No handling of HTML entities that represent Unicode characters
Potential Real-World Issues:
News articles in non-Latin scripts might get incorrect density scores
Mixed-script content could have unbalanced density calculations
Content with heavy emoji usage could have inflated character counts
International domain names in links might be counted incorrectly
The text was updated successfully, but these errors were encountered: