Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent URL in Link header from including invalid characters #1802

Conversation

AhmarZaidi
Copy link
Contributor

@AhmarZaidi AhmarZaidi commented Jan 14, 2025

Summary

This PR addresses an issue where non-ASCII characters in URL filenames caused HTTP headers to break reverse proxies and violate standards. The solution encodes only the filename part of URLs, ensuring compliance with ISO-8859-1 character requirements. This change maintains URL integrity while preventing potential issues with reverse proxies.

Example URL : https://testsite.com/wp-content/uploads/2025/01/חנות-scaled.avif
Corrected URL: https://testsite.com/wp-content/uploads/2025/01/%D7%97%D7%A0%D7%95%D7%AA-scaled.avif

Fixes #1775

Relevant technical choices

  • Implemented a solution to address the issue where non-ASCII characters in URLs, such as Hebrew characters in filenames, were causing HTTP headers to break reverse proxies and violate HTTP standards.
  • Updated the URL encoding logic to use a regular expression that matches any character not allowed by RFC 3986, ensuring that all non-ASCII and disallowed ASCII characters are percent-encoded.
  • Implemented preg_replace_callback() with a static closure to dynamically encode characters using rawurlencode(). This ensures that only characters outside the allowed set are encoded.
  • Add test cases to include international domain names, non-ASCII paths, and URLs with special characters.

@AhmarZaidi AhmarZaidi changed the title Fix: Optimize URL encoding logic in get_response_header Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards Jan 14, 2025
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.97%. Comparing base (e793289) to head (c3c2512).
Report is 12 commits behind head on trunk.

Additional details and impacted files
@@            Coverage Diff             @@
##            trunk    #1802      +/-   ##
==========================================
+ Coverage   65.92%   65.97%   +0.04%     
==========================================
  Files          88       88              
  Lines        6885     6895      +10     
==========================================
+ Hits         4539     4549      +10     
  Misses       2346     2346              
Flag Coverage Δ
multisite 65.97% <100.00%> (+0.04%) ⬆️
single 38.17% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left a couple alternative suggestions.

Please add some test coverage for the lines not covered be tests.

@westonruter
Copy link
Member

The solution encodes only the filename part of URLs

What about when an internationalized domain name is used? Couldn't this also cause problems with encoding?

@westonruter westonruter added [Type] Bug An existing feature is broken [Plugin] Optimization Detective Issues for the Optimization Detective plugin labels Jan 14, 2025
@westonruter
Copy link
Member

Additionally, on multisite subdirectory installs, in theory the path before wp-content could also include non-ASCII chars:

https://testsite.com/חנות/wp-content/uploads/2025/01/example-scaled.avif

@westonruter
Copy link
Member

@AhmarZaidi Hey, are you intending to pick this up again?

@AhmarZaidi
Copy link
Contributor Author

@westonruter Yes, I'll be working on the changes right away.

@AhmarZaidi
Copy link
Contributor Author

AhmarZaidi commented Jan 30, 2025

Instead of parsing the URL and then re-constructing it, what if you just check if the href has any non-ASCII chars and then encode the entire URL

@westonruter I've implemented the changes: If the path contains any non-ascii characters then we encode the whole URL.

However we can use preg_replace_callback to encode only the non-ascii characters in the URL. Let me know if that approach will be better.

@westonruter
Copy link
Member

What you did looks good. Could you add some test cases for international domain names and non-ASCII paths to ensure the expected output?

@AhmarZaidi
Copy link
Contributor Author

@westonruter The old solution was encoding the :// part also, so I've update the code slightly.
Current solution: Encode complete url after :// part.

Potential Issue: This solution converts the slashes (/) to %2F so the file path would be incorrect.
For example https://xn--fsq.com/תמונה.jpg will be encoded to https://xn--fsq.com%2F%D7%AA%D7%9E%D7%95%D7%A0%D7%94.jpg
Note: I've written the tests according to the above output.


If we use something like:

$decoded_url = urldecode( $link['href'] );
$encoded_url = preg_replace_callback(
        '/[^\x00-\x7F]/',
        fn( $matches ) => rawurlencode( $matches[0] ),
        $decoded_url
);

Then https://xn--fsq.com/תמונה.jpg will be encoded to https://xn--fsq.com/%D7%AA%D7%9E%D7%95%D7%A0%D7%94.jpg

Please feel free to let me know if we should do any further changes to the approach.

@westonruter
Copy link
Member

If we use something like:

That looks good!

@AhmarZaidi
Copy link
Contributor Author

Implemented the changes.

@westonruter
Copy link
Member

So this is no longer a draft and is ready for review, correct?

@AhmarZaidi AhmarZaidi marked this pull request as ready for review January 31, 2025 20:03
Copy link

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Unlinked Accounts

The following contributors have not linked their GitHub and WordPress.org accounts: @amitay-elementor.

Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Unlinked contributors: amitay-elementor.

Co-authored-by: AhmarZaidi <[email protected]>
Co-authored-by: westonruter <[email protected]>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@westonruter
Copy link
Member

Relevant technical choices

Could you update the description based on the latest revisions?

@westonruter westonruter changed the title Fix: Optimization detective can return non-ascii characters in the Link header, breaking some reverse proxies and HTTP standards Prevent URL in Link header from including invalid characters Feb 4, 2025
Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@felixarntz Anything we missed here?

Copy link
Member

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @AhmarZaidi, LGTM!

@westonruter westonruter merged commit 40e8c31 into WordPress:trunk Feb 5, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Plugin] Optimization Detective Issues for the Optimization Detective plugin [Type] Bug An existing feature is broken
Projects
None yet
3 participants