Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow \u202f in White Space Normalization #692

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

addie9800
Copy link
Collaborator

@addie9800 addie9800 commented Jan 28, 2025

The Publisher Coverage caught an error for this article: https://freebeacon.com/trump-administration/trump-slaps-tariffs-on-colombia-after-its-israel-bashing-marxist-president-rejects-us-deportation-flights/. It turns out that the reason is that the unicode symbol \u202f is embedded in the links of the second image, which for some implementations of regex (also apparently of the re package) is covered under \s, where as for others it is not leading to unpredictable behavior. I'm not really sure, whether to fix it or let this article slide. Then again, not sure how common the issue is. You can observe the issue here: https://regex101.com/r/rv9Zxp/1 You see this matcher does not find any matches (at least in Chrome), if you then go to regexr.com and use the pattern \s on this string:    You will find that it will match 2 characters.

@addie9800 addie9800 requested a review from MaxDall January 28, 2025 14:23
@MaxDall
Copy link
Collaborator

MaxDall commented Feb 3, 2025

@addie9800 Honestly I don't know why we're normalizing whitespaces here in the first place.

@addie9800
Copy link
Collaborator Author

Ok 🤣, in this case, I would suggest removing it. It does not seem to have any effect on existing test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants