Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url: handle "unsafe" characters properly in pathToFileURL #54545

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

aduh95
Copy link
Contributor

@aduh95 aduh95 commented Aug 24, 2024

Fixes: #54515

Given the number of characters to cover, I leaning towards using one regex to deal with them all rather than adding more special cases. Let's check what the benchmark says.

FWIW According to RFC1738, they should be encoded (see the bold):

Characters can be unsafe for a number of reasons. The space
character is unsafe
because significant spaces may disappear and
insignificant spaces may be introduced when URLs are transcribed or
typeset or subjected to the treatment of word-processing programs.
The characters "<" and ">" are unsafe because they are used as the
delimiters around URLs in free text; the quote mark (""") is used to
delimit URLs in some systems. The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it. The character "%" is unsafe because it is used for
encodings of other characters. Other characters are unsafe because
gateways and other transport agents are known to sometimes modify
such characters. These characters are "{", "}", "|", "", "^", "~",
"[", "]", and "`".

All unsafe characters must always be encoded within a URL. For
example, the character "#" must be encoded within URLs even in
systems that do not normally deal with fragment or anchor
identifiers, so that if the URL is copied into another system that
does use them, it will not be necessary to change the URL encoding.

Originally posted by @RedYetiDev in #54515 (comment)

I took the tests from #54516, so I added @EarlyRiser42 as co-author.

@aduh95 aduh95 added the needs-benchmark-ci PR that need a benchmark CI run. label Aug 24, 2024
@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/loaders
  • @nodejs/url

@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. process Issues and PRs related to the process subsystem. whatwg-url Issues and PRs related to the WHATWG URL implementation. labels Aug 24, 2024
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.03%. Comparing base (be4babb) to head (dcb75fa).
Report is 28 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #54545   +/-   ##
=======================================
  Coverage   88.03%   88.03%           
=======================================
  Files         652      652           
  Lines      183761   183817   +56     
  Branches    35863    35873   +10     
=======================================
+ Hits       161765   161816   +51     
- Misses      15229    15241   +12     
+ Partials     6767     6760    -7     
Files with missing lines Coverage Δ
lib/internal/process/execution.js 98.79% <100.00%> (ø)
lib/internal/url.js 97.92% <100.00%> (+0.02%) ⬆️

... and 24 files with indirect coverage changes

@aduh95 aduh95 added author ready PRs that have at least one approval, no pending requests for changes, and a CI started. request-ci Add this label to start a Jenkins CI on a PR. and removed author ready PRs that have at least one approval, no pending requests for changes, and a CI started. request-ci Add this label to start a Jenkins CI on a PR. labels Aug 24, 2024
@aduh95

This comment was marked as outdated.

@targos
Copy link
Member

targos commented Aug 25, 2024

How does it have an impact on fileURLToPath ?

@EarlyRiser42
Copy link
Contributor

EarlyRiser42 commented Aug 25, 2024

There was a typo in the current benchmark causing only pathToFileURL to run. The typo fileUrlOrPath instead of fileUrlToPath in main led to this issue, even though both functions were passed as inputs in createbenchmark. I submitted a PR to fix this (#54190).

@aduh95

This comment was marked as outdated.

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 25, 2024

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1618/

                                                                                                                 confidence improvement accuracy (*)   (**)  (***)
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool' method='pathToFileURL'                    ***    -15.54 %       ±0.23% ±0.31% ±0.40%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool#hash' method='pathToFileURL'               ***    -14.31 %       ±0.31% ±0.42% ±0.55%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null' method='pathToFileURL'                                   ***     -7.84 %       ±0.37% ±0.49% ±0.63%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='fileURLToPath'                     -0.17 %       ±0.44% ±0.58% ±0.76%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='pathToFileURL'             ***    -11.61 %       ±0.61% ±0.81% ±1.06%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='fileURLToPath'                -0.35 %       ±0.40% ±0.54% ±0.70%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='pathToFileURL'        ***    -14.09 %       ±0.49% ±0.66% ±0.86%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='fileURLToPath'                                     0.02 %       ±0.40% ±0.54% ±0.70%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='pathToFileURL'                            ***     -7.37 %       ±0.45% ±0.60% ±0.78%

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 25, 2024

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1619/

                                                                                                                 confidence improvement accuracy (*)   (**)  (***)
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool' method='pathToFileURL'                    ***    -11.39 %       ±0.32% ±0.43% ±0.56%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool#hash' method='pathToFileURL'               ***     -7.01 %       ±0.40% ±0.53% ±0.69%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null' method='pathToFileURL'                                   ***     -7.76 %       ±0.31% ±0.42% ±0.55%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='fileURLToPath'                     -0.09 %       ±0.27% ±0.36% ±0.47%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='pathToFileURL'             ***     -8.94 %       ±0.73% ±0.98% ±1.30%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='fileURLToPath'                -0.18 %       ±0.35% ±0.47% ±0.61%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='pathToFileURL'        ***     -9.00 %       ±0.43% ±0.57% ±0.75%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='fileURLToPath'                                    -0.33 %       ±1.10% ±1.47% ±1.94%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='pathToFileURL'                            ***     -6.85 %       ±0.42% ±0.56% ±0.74%

Copy link
Member

@anonrig anonrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't land this PR with this much impact on almost all Node.js operations.

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 25, 2024

V8 performance lesson of the day: multiple regexes are better than a single one apparently 🤷‍♂️

@EarlyRiser42

This comment was marked as duplicate.

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 26, 2024

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1620/

                                                                                                                 confidence improvement accuracy (*)   (**)  (***)
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool' method='pathToFileURL'                    ***     -8.38 %       ±0.43% ±0.58% ±0.75%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool#hash' method='pathToFileURL'               ***     -9.95 %       ±0.50% ±0.66% ±0.87%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null' method='pathToFileURL'                                   ***    -15.48 %       ±0.52% ±0.70% ±0.91%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='fileURLToPath'                      0.35 %       ±0.37% ±0.50% ±0.65%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='pathToFileURL'             ***     -7.29 %       ±0.49% ±0.66% ±0.86%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='fileURLToPath'                 0.50 %       ±0.81% ±1.09% ±1.43%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='pathToFileURL'        ***    -10.97 %       ±0.36% ±0.48% ±0.62%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='fileURLToPath'                                     0.11 %       ±0.69% ±0.92% ±1.19%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='pathToFileURL'                            ***     -9.62 %       ±0.42% ±0.55% ±0.72%

lib/internal/url.js Outdated Show resolved Hide resolved
@aduh95
Copy link
Contributor Author

aduh95 commented Aug 26, 2024

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1621/

                                                                                                                 confidence improvement accuracy (*)   (**)  (***)
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool' method='pathToFileURL'                    ***     -9.09 %       ±0.39% ±0.53% ±0.69%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool#hash' method='pathToFileURL'               ***    -10.66 %       ±0.45% ±0.60% ±0.78%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null' method='pathToFileURL'                                   ***    -15.32 %       ±0.29% ±0.39% ±0.51%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='fileURLToPath'                     -0.06 %       ±0.43% ±0.57% ±0.74%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='pathToFileURL'             ***     -8.11 %       ±0.67% ±0.91% ±1.20%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='fileURLToPath'                -0.07 %       ±0.46% ±0.61% ±0.79%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='pathToFileURL'        ***    -10.64 %       ±0.52% ±0.69% ±0.91%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='fileURLToPath'                                     0.69 %       ±0.82% ±1.10% ±1.44%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='pathToFileURL'                            ***     -7.96 %       ±0.31% ±0.41% ±0.54%

@RedYetiDev

This comment was marked as duplicate.

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 27, 2024

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1622/

                                                                                                                 confidence improvement accuracy (*)   (**)  (***)
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool' method='pathToFileURL'                    ***    -22.93 %       ±0.46% ±0.62% ±0.82%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool#hash' method='pathToFileURL'               ***    -24.39 %       ±0.29% ±0.39% ±0.51%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null' method='pathToFileURL'                                   ***    -33.14 %       ±0.40% ±0.54% ±0.70%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='fileURLToPath'                     -0.51 %       ±0.86% ±1.16% ±1.54%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='pathToFileURL'             ***    -18.79 %       ±0.51% ±0.68% ±0.89%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='fileURLToPath'                 0.19 %       ±0.30% ±0.40% ±0.52%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='pathToFileURL'        ***    -22.45 %       ±0.39% ±0.52% ±0.68%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='fileURLToPath'                                     0.11 %       ±0.38% ±0.50% ±0.65%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='pathToFileURL'                            ***    -22.06 %       ±0.46% ±0.62% ±0.80%

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 27, 2024

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1623/

                                                                                                                 confidence improvement accuracy (*)   (**)  (***)
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool' method='pathToFileURL'                    ***    -35.89 %       ±0.47% ±0.63% ±0.84%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool#hash' method='pathToFileURL'               ***    -32.21 %       ±0.33% ±0.44% ±0.58%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null' method='pathToFileURL'                                   ***    -51.54 %       ±0.53% ±0.71% ±0.94%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='fileURLToPath'                     -0.23 %       ±0.33% ±0.44% ±0.58%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='pathToFileURL'             ***    -29.42 %       ±0.60% ±0.80% ±1.04%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='fileURLToPath'                -0.20 %       ±0.31% ±0.41% ±0.53%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='pathToFileURL'        ***    -28.38 %       ±0.55% ±0.74% ±0.98%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='fileURLToPath'                                    -0.59 %       ±1.29% ±1.73% ±2.27%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='pathToFileURL'                            ***    -38.36 %       ±0.29% ±0.39% ±0.51%

@aduh95
Copy link
Contributor Author

aduh95 commented Aug 27, 2024

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1624/

                                                                                                                 confidence improvement accuracy (*)   (**)  (***)
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool' method='pathToFileURL'                    ***    -12.29 %       ±0.58% ±0.78% ±1.01%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null?key=param&bool#hash' method='pathToFileURL'               ***    -15.91 %       ±0.46% ±0.61% ±0.79%
url/whatwg-url-to-and-from-path.js n=5000000 input='/dev/null' method='pathToFileURL'                                   ***    -15.43 %       ±0.26% ±0.35% ±0.45%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='fileURLToPath'               *     -0.38 %       ±0.37% ±0.50% ±0.66%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool' method='pathToFileURL'             ***     -9.85 %       ±0.66% ±0.89% ±1.17%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='fileURLToPath'                 0.06 %       ±0.17% ±0.22% ±0.29%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null?key=param&bool#hash' method='pathToFileURL'        ***    -14.37 %       ±0.69% ±0.93% ±1.21%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='fileURLToPath'                                     0.14 %       ±0.49% ±0.65% ±0.85%
url/whatwg-url-to-and-from-path.js n=5000000 input='file:///dev/null' method='pathToFileURL'                            ***     -9.18 %       ±0.34% ±0.45% ±0.59%

@EarlyRiser42
Copy link
Contributor

As @RedYetiDev originally mentioned in #54515, both | and ~ need to be encoded. However, I noticed that on my local machines (Windows and Linux), they are not encoded when used in a file URL, which has caused some confusion. Additionally, since { and } are encoded by the URL constructor (bindingurl.parse), perhaps we don't need to include them in the regex. Feel free to ignore..

@jasnell
Copy link
Member

jasnell commented Sep 8, 2024

CI is failing but it's unclear if they are all flaky failures or if some subset of the failures are caused by the changes here.

@jasnell jasnell removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Sep 8, 2024
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@aduh95
Copy link
Contributor Author

aduh95 commented Sep 11, 2024

CI is failing but it's unclear if they are all flaky failures or if some subset of the failures are caused by the changes here.

I've restarted a run, and AFAICT the failures are all infrastructure related (no space left on device, connection failed), AFAICT there's no "new" CI failure in this PR.

@aduh95 aduh95 added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Sep 19, 2024
@nodejs-github-bot
Copy link
Collaborator

@aduh95 aduh95 added the request-ci Add this label to start a Jenkins CI on a PR. label Sep 19, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Sep 19, 2024
@nodejs-github-bot
Copy link
Collaborator

@joyeecheung
Copy link
Member

joyeecheung commented Sep 19, 2024

I don't think pathToFileURL is used in any hot path (I had a quick look at the 53 occurrences of it in the lib/ folder), I'd be surprised if it had any significant impact on the other parts of the codebase. It's rather rare than we need to convert a path to a file URL, but when we do, I think we want to generate a URL with those "unsafe" char correctly encoded.

Heads up that this would affect nodejs/loaders#198 because the hooks are exposing URLs, even though for CJS, they are always paths, so there needs to be one extra pathToFileURL per module. I think this would be almost inevitable as we try to improve CommonJS/ESM interop, when CommonJS always uses paths (which are require.cache keys, sigh), and ESM uses URLs, so the conversion will always end up on a hot path one day, no matter if it's for universal hooks or other purposes.

@aduh95 aduh95 removed the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Sep 19, 2024
@joyeecheung
Copy link
Member

joyeecheung commented Sep 19, 2024

Actually it seems pathToFileURL is already invoked per-ESM in

resolved =
pathToFileURL(real + (StringPrototypeEndsWith(path, sep) ? '/' : ''));
, and per-package.json in
const packageJSONUrl = pathToFileURL(packageConfig.pjsonPath);

It's invoked 16 times when loading npm CLI, even though the npm CLI is mostly CommonJS (but it does load several package.json and uses dynamic import to load chalk)

@aduh95
Copy link
Contributor Author

aduh95 commented Sep 19, 2024

I think as Yagiz mentioned, we should probably move the implementation to C++ (we need to call the URL constructor anyway, so given we have to cross into native land anyway, it would only makes sense to use C++ rather than JS regexes to do the string manipulation). I don't know if I'm the right person to work on this, but I can certainly try – I first need to fix the current implementation, as it looks like it's failing on Windows.

@nodejs-github-bot
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-benchmark-ci PR that need a benchmark CI run. needs-ci PRs that need a full CI run. process Issues and PRs related to the process subsystem. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pathToFileURL function in url fails to handle special characters properly