Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds urlpattern to ada #381

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

miguelteixeiraa
Copy link
Contributor

@miguelteixeiraa miguelteixeiraa commented May 9, 2023

TL;DR

WIP!


So i started to implement https://wicg.github.io/urlpattern/
Seems that this is an work-in-progress spec, so maybe we'll have some freestyle stuff along the way of the implementation.
I'm using https://github.com/kenchris/urlpattern-polyfill and https://github.com/denoland/rust-urlpattern as references.

Don't know if there is a "right way" to do this, but I get the unicode identifier start/part ranges using the script:

"use strict";

const fs = require("fs");

const regexIdentifierStart = /[$_\p{ID_Start}]/u;

function isValidIdStart(num) {
  const c = String.fromCharCode(num);
  return regexIdentifierStart.test(c);
}

const validRanges = [];
let start = null;

for (let c = 256; c < 0x10ffff; c++) {
  const isValid = isValidIdStart(c);
  if (isValid && start === null) {
    start = c;
  } else if (!isValid && start !== null) {
    validRanges.push([start, c - 1]);
    start = null;
  }
}

const writeRangeInFile = (filename, ranges) => {
  let file = "";
  for (const [base, upper] of ranges) {
    file += `\{${base}, ${upper}\},\n`;
  }
  fs.writeFile(filename, file, (err) => {});
};

console.log(validRanges.length);

writeRangeInFile("id_start_ranges.txt", validRanges);

++ something similar for the identifier-part with the regex /[$_\u200C\u200D\p{ID_Continue}]/u
I got those regexes in https://github.com/kenchris/urlpattern-polyfill/blob/main/src/path-to-regex-modified.ts#LL70C1-L70C50

Then I used the ranges to create bitwise masks in the unicode.cpp.

Nothing is tested yet

It will took some time to finish!

ref nodejs/node#40844

@miguelteixeiraa miguelteixeiraa marked this pull request as draft May 9, 2023 20:41
bool ignore_case = false;
};

struct component_result {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ada::component_result seems a really vague naming, since it might be mistaken with ada::url_components

include/ada/urlpattern.h Outdated Show resolved Hide resolved
include/ada/urlpattern.h Outdated Show resolved Hide resolved
src/urlpattern.cpp Outdated Show resolved Hide resolved
include/ada/unicode.h Show resolved Hide resolved
namespace ada::urlpattern {
struct urlpattern_component_result {
std::string_view input;
std::unordered_map<std::string_view, std::optional<std::string_view>> groups;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lemire How's the performance comparison of using std::optional in here versus std::variant<std::nullopt, std::string_view>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect it matters much.

namespace ada::urlpattern {

struct urlpattern_options {
std::string_view delimiter = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string_view delimiter = "";
std::string_view delimiter{};


std::u32string_view input;
std::vector<token> token_list;
size_t component_start = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t component_start = 0;
size_t component_start{0};


namespace ada::urlpattern {
// https://wicg.github.io/urlpattern/#component
struct urlpattern_component {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is already under ada::urlpattern namespace, why not call this struct component?

std::string final_utf8_url(utf8_size, '\0');
ada::idna::utf32_to_utf8(url.data(), url.size(), final_utf8_url.data());

if (ada::can_parse(final_utf8_url)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a todo in here to optimize the can_parse function?

ada_really_inline bool constructor_string_parser::is_group_open() {
// If parser’s token list[parser’s token index]'s type is "open", then
// return true. Else return false.
return token_list[token_index].type == TOKEN_TYPE::OPEN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe. We should add development asserts to make sure token_index is smaller than token_list length.

while (index < input.size()) {
size_t pos = input.find_first_of(U".+*?^${}()[]|/\\)");
if (pos == std::string_view::npos) {
result = result += input.substr(index, input.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is weird.

// 1. Set type to "full-wildcard".
type = PART_TYPE::FULL_WILDCARD;
// 2. Set regexp value to the empty string.
regexp_value.clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are assigning and later clearing this value, which is creating performance degregation. We should eventually optimize it to reduce unnecessary allocations.

}

// 3. Set token to the result of running try to consume a token given parser
// and "asterisk".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there was a function like t.value_or("default val")


// If name token is null and token is null, then set token to the result of
// running try to consume a token given parser and "asterisk".
if (!name_token.has_value() && !regexp_or_wildcard.has_value()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is fast path here. If name_token does not have a value, you dont need to try to consume token right?

@bricss
Copy link

bricss commented Dec 5, 2023

Is there any way to crank 🔧 this up for nodejs/node#51060 needs? 🤔

@bricss
Copy link

bricss commented Jan 15, 2024

Houston, do you read me? 📡

@lemire
Copy link
Member

lemire commented Jan 15, 2024

@bricss Are you available to help push this forward?

@bricss
Copy link

bricss commented Jan 15, 2024

Yes, with only one exception, I don't have big/deep experience with C++ coding 🤷‍♂️ atm 🙄

@lemire
Copy link
Member

lemire commented Jan 15, 2024

@bricss Lack of knowledge of C++ could be a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants