Unicode handling #25

oiwn · 2025-01-30T09:47:57Z

Character Counting Issues:

The code uses .len() on strings which counts bytes, not Unicode characters
For example in get_node_text() and throughout DensityNode calculations
This could lead to incorrect density calculations for non-ASCII text like Chinese characters, emojis, or combining diacritical marks

Text Normalization:

The code trims whitespace but doesn't normalize Unicode text
Different Unicode representations of the same text (e.g. é vs e + ́ ) could lead to inconsistent character counts
No handling of zero-width characters, bidirectional text markers, etc.

Specific Risk Areas:

char_count in DensityNode uses raw byte length
Text density calculations could be skewed for content in different scripts
Link text extraction might break with URLs containing Unicode characters
No handling of HTML entities that represent Unicode characters

Potential Real-World Issues:

News articles in non-Latin scripts might get incorrect density scores
Mixed-script content could have unbalanced density calculations
Content with heavy emoji usage could have inflated character counts
International domain names in links might be counted incorrectly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode handling #25

Unicode handling #25

oiwn commented Jan 30, 2025

Unicode handling #25

Unicode handling #25

Comments

oiwn commented Jan 30, 2025

Character Counting Issues:

Text Normalization:

Specific Risk Areas:

Potential Real-World Issues: