Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode handling #25

Open
oiwn opened this issue Jan 30, 2025 · 0 comments
Open

Unicode handling #25

oiwn opened this issue Jan 30, 2025 · 0 comments

Comments

@oiwn
Copy link
Owner

oiwn commented Jan 30, 2025

Character Counting Issues:

The code uses .len() on strings which counts bytes, not Unicode characters
For example in get_node_text() and throughout DensityNode calculations
This could lead to incorrect density calculations for non-ASCII text like Chinese characters, emojis, or combining diacritical marks

Text Normalization:

The code trims whitespace but doesn't normalize Unicode text
Different Unicode representations of the same text (e.g. é vs e + ́ ) could lead to inconsistent character counts
No handling of zero-width characters, bidirectional text markers, etc.

Specific Risk Areas:

char_count in DensityNode uses raw byte length
Text density calculations could be skewed for content in different scripts
Link text extraction might break with URLs containing Unicode characters
No handling of HTML entities that represent Unicode characters

Potential Real-World Issues:

News articles in non-Latin scripts might get incorrect density scores
Mixed-script content could have unbalanced density calculations
Content with heavy emoji usage could have inflated character counts
International domain names in links might be counted incorrectly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant