Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds urlpattern to ada #381

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
e466165
urlpattern: adds id_start and id_pard lookup table
miguelteixeiraa May 8, 2023
4aa5f55
urlpattern: adds id_start & id_part + tokenizer sketch
miguelteixeiraa May 9, 2023
842f5b1
urlpattern: adds missing eof
miguelteixeiraa May 9, 2023
dadb665
merge branch 'main' into urlpattern
miguelteixeiraa May 9, 2023
ee41c6f
urlpattern: adds bitset lib
miguelteixeiraa May 9, 2023
df58be4
urlpattern: adds comments to unicode.h
miguelteixeiraa May 10, 2023
582a649
urlpattern: component_result -> urlpattern_component_result
miguelteixeiraa May 10, 2023
0516de4
urlpattern: wip constructors
miguelteixeiraa May 11, 2023
179a065
urlpattern: WIP contructor_string_parser
miguelteixeiraa May 13, 2023
2e80ab3
urlpattern: introducing canonicalize_protocol
miguelteixeiraa May 14, 2023
17ee73d
urlpatter: adds pragma regions
miguelteixeiraa May 14, 2023
9ae30c4
urlpattern: breakdown in multiple files
miguelteixeiraa May 15, 2023
95e5c70
urlpattern: minor fixes
miguelteixeiraa May 15, 2023
b9041fb
urlpattern: fixup to make it compile
miguelteixeiraa May 15, 2023
50d6d09
Merge branch 'main' into urlpattern
miguelteixeiraa May 15, 2023
5651a1f
urlpattern: adds missing cassert lib + fixes
miguelteixeiraa May 15, 2023
ca49555
urlpattern: fix assert for token type
miguelteixeiraa May 15, 2023
58342b3
urlpattern: introducing tests for tokenizer
miguelteixeiraa May 20, 2023
3f5fac8
urlpattern: update id_start and id_part tables
miguelteixeiraa Jun 7, 2023
824725d
urlpattern: WIP fix tokenizer
miguelteixeiraa Jun 7, 2023
88453a5
urlpattern: update tokenizer
miguelteixeiraa Jun 7, 2023
6e14684
urlpattern: updates tokenizer's test
miguelteixeiraa Jun 7, 2023
81fa16d
urlpattern: make tokenizer pass the tests
miguelteixeiraa Jun 8, 2023
5016786
urlpattern: WIP compile a component
miguelteixeiraa Jun 9, 2023
8c8942a
urlpattern: WIP pattern parser
miguelteixeiraa Jun 10, 2023
aa8522f
Merge branch 'main' into urlpattern
miguelteixeiraa Jun 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions include/ada.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,13 @@
#include "ada/url_components.h"
#include "ada/url_aggregator.h"
#include "ada/url_aggregator-inl.h"
#include "ada/urlpattern_base.h"
#include "ada/urlpattern_canonicalization.h"
#include "ada/urlpattern_internals.h"
#include "ada/urlpattern_tokenizer.h"
#include "ada/urlpattern_constructor_string_parser.h"
#include "ada/urlpattern_pattern_parser.h"
#include "ada/urlpattern.h"

// Public API
#include "ada/ada_version.h"
Expand Down
2 changes: 1 addition & 1 deletion include/ada/ada_idna.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/* auto-generated on 2023-05-07 19:12:14 -0400. Do not edit! */
/* auto-generated on 2023-04-26 14:14:42 -0400. Do not edit! */
/* begin file include/idna.h */
#ifndef ADA_IDNA_H
#define ADA_IDNA_H
Expand Down
17 changes: 17 additions & 0 deletions include/ada/unicode.h
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,23 @@ ada_really_inline size_t percent_encode_index(const std::string_view input,
* Return true if the content was ASCII.
*/
constexpr bool to_lower_ascii(char* input, size_t length) noexcept;

/**
* Returns true if the Unicode code point is a valid JavaScript identifier part
* (IdentifierPart). A JavaScript IdentifierPart is either an UnicodeIDContinue,
* $, <ZWNJ>, <ZWJ> or an UnicodeEscapeSequence.
* @see https://tc39.es/ecma262/#prod-IdentifierPart
*/
bool is_valid_identifier_part(const char32_t& c) noexcept;
anonrig marked this conversation as resolved.
Show resolved Hide resolved

/**
* Returns true if the Unicode code point is a valid JavaScript identifier start
* (IdentifierStart). A JavaScript IdentifierStart is either an UnicodeIDStart,
* $, _ or an UnicodeEscapeSequence.
* @see https://tc39.es/ecma262/#prod-IdentifierStart
*/
bool is_valid_identifier_start(const char32_t& c) noexcept;

} // namespace ada::unicode

#endif // ADA_UNICODE_H
1 change: 1 addition & 0 deletions include/ada/url.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#include "ada/unicode.h"
#include "ada/url_base.h"
#include "ada/url_components.h"
#include "ada/helpers.h"

#include <algorithm>
#include <charconv>
Expand Down
51 changes: 51 additions & 0 deletions include/ada/urlpattern.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#ifndef ADA_URL_PATTERN_H
#define ADA_URL_PATTERN_H

#include "ada/common_defs.h"

#include "ada/urlpattern_base.h"

#include <unordered_map>
#include <string_view>
#include <optional>

namespace ada::urlpattern {
struct urlpattern_component_result {
std::string_view input;
std::unordered_map<std::string_view, std::optional<std::string_view>> groups;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lemire How's the performance comparison of using std::optional in here versus std::variant<std::nullopt, std::string_view>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect it matters much.

};

template <typename urlpattern_input>
struct urlpattern_result {
urlpattern_component_result protocol;
urlpattern_component_result username;
urlpattern_component_result password;
urlpattern_component_result hostname;
urlpattern_component_result port;
urlpattern_component_result pathname;
urlpattern_component_result search;
urlpattern_component_result hash;
urlpattern_input input[];
};

struct urlpattern {
// TODO: improve DX.. maybe create one constructor for each case and
// make them call the right/following constructors 'under the hood'
urlpattern(std::string_view input, std::optional<std::string_view> base_url,
std::optional<urlpattern_options> &options);
urlpattern(urlpattern_init &input,
std::optional<urlpattern_options> &options);

const std::string_view protocol;
const std::string_view username;
const std::string_view password;
const std::string_view hostname;
const std::string_view port;
const std::string_view pathname;
const std::string_view search;
const std::string_view hash;
};

} // namespace ada::urlpattern

#endif
36 changes: 36 additions & 0 deletions include/ada/urlpattern_base.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#ifndef ADA_URLPATTERN_BASE_H
#define ADA_URLPATTERN_BASE_H

#include "ada/helpers.h"

#include <string_view>

namespace ada::urlpattern {

struct urlpattern_options {
std::string_view delimiter = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string_view delimiter = "";
std::string_view delimiter{};

std::string_view prefix = "";
bool ignore_case = false;
};

struct u32urlpattern_options {
std::u32string_view delimiter = U"";
std::u32string_view prefix = U"";
bool ignore_case = false;
};

struct urlpattern_init {
std::string_view protocol;
std::string_view username;
std::string_view password;
std::string_view hostname;
std::string_view port;
std::string_view pathname;
std::string_view search;
std::string_view hash;
std::string_view base_url;
};

} // namespace ada::urlpattern

#endif
14 changes: 14 additions & 0 deletions include/ada/urlpattern_canonicalization.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#ifndef ADA_URLPATTERN_CANONICALIZATION_H
#define ADA_URLPATTERN_CANONICALIZATION_H

#include "ada/helpers.h"
#include <string_view>

namespace ada::urlpattern {

// Encoding Callbacks
std::u32string_view canonicalize_protocol(std::u32string_view protocol);

} // namespace ada::urlpattern

#endif
105 changes: 105 additions & 0 deletions include/ada/urlpattern_constructor_string_parser.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#ifndef ADA_URLPATTERN_CONSTRUCTOR_STRING_PARSER_H
#define ADA_URLPATTERN_CONSTRUCTOR_STRING_PARSER_H

#include <string_view>
#include "ada/helpers.h"

#include "ada/urlpattern_base.h"
#include "ada/urlpattern_tokenizer.h"

namespace ada::urlpattern {

enum class PARSER_STATE : uint8_t {
INIT,
PROTOCOL,
AUTHORITY,
USERNAME,
PASSWORD,
HOSTNAME,
PORT,
PATHNAME,
SEARCH,
HASH,
DONE
};

// https://wicg.github.io/urlpattern/#constructor-string-parser
struct constructor_string_parser {
ada_really_inline constructor_string_parser(std::u32string_view input);

// https://wicg.github.io/urlpattern/#change-state
ada_really_inline void change_state(PARSER_STATE new_state, size_t skip);

// https://wicg.github.io/urlpattern/#rewind
ada_really_inline void rewind();

// https://wicg.github.io/urlpattern/#rewind-and-set-state
ada_really_inline void rewind_and_set_state(PARSER_STATE new_state);

// https://wicg.github.io/urlpattern/#is-a-hash-prefix
ada_really_inline bool is_hash_prefix();

// https://wicg.github.io/urlpattern/#is-a-search-prefix
ada_really_inline bool is_search_prefix();

// https://wicg.github.io/urlpattern/#is-a-non-special-pattern-char
ada_really_inline bool is_nonspecial_pattern_char(size_t index,
const char32_t *value);

// https://wicg.github.io/urlpattern/#is-an-identity-terminator
ada_really_inline bool is_identity_terminator();

// https://wicg.github.io/urlpattern/#is-a-pathname-start
ada_really_inline bool is_pathname_start();

// https://wicg.github.io/urlpattern/#is-a-password-prefix
ada_really_inline bool is_password_prefix();

// https://wicg.github.io/urlpattern/#get-a-safe-token
ada_really_inline token *get_safe_token(size_t &index);

// https://wicg.github.io/urlpattern/#is-a-group-open
ada_really_inline bool is_group_open();

// https://wicg.github.io/urlpattern/#is-an-ipv6-open
ada_really_inline bool is_ipv6_open();

// https://wicg.github.io/urlpattern/#is-an-ipv6-close
ada_really_inline bool is_ipv6_close();

// https://wicg.github.io/urlpattern/#is-a-port-prefix
ada_really_inline bool is_port_prefix();

// https://wicg.github.io/urlpattern/#is-a-group-close
ada_really_inline bool is_group_close();

// https://wicg.github.io/urlpattern/#is-a-protocol-suffix
ada_really_inline bool is_protocol_suffix();

// https://wicg.github.io/urlpattern/#compute-protocol-matches-a-special-scheme-flag
ada_really_inline void compute_protocol_matches_special_scheme_flag();

// https://wicg.github.io/urlpattern/#make-a-component-string
ada_really_inline std::u32string_view make_component_string();

/// https://wicg.github.io/urlpattern/#next-is-authority-slashes
ada_really_inline bool next_is_authority_slashes();

std::u32string_view input;
std::vector<token> token_list;
size_t component_start = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_t component_start = 0;
size_t component_start{0};

size_t token_index = 0;
size_t token_increment = 0;
size_t group_depth = 0;
size_t hostname_ipv6_bracket_depth = 0;
bool protocol_matches_special_scheme_flag = false;
urlpattern_init result = urlpattern_init();
PARSER_STATE state = PARSER_STATE::INIT;
};

// https://wicg.github.io/urlpattern/#parse-a-constructor-string
urlpattern_init parse_contructor_string(std::u32string_view input);

} // namespace ada::urlpattern

#endif
25 changes: 25 additions & 0 deletions include/ada/urlpattern_internals.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#ifndef ADA_URLPATTERN_INTERNALS_H
#define ADA_URLPATTERN_INTERNALS_H

#include "ada/urlpattern_base.h"
#include <string_view>
#include "regex"
#include "vector"

namespace ada::urlpattern {
// https://wicg.github.io/urlpattern/#component
struct urlpattern_component {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is already under ada::urlpattern namespace, why not call this struct component?

std::u32string_view pattern_string;
// TODO: use a more performant lib, eg RE2
std::regex regular_expression;
std::vector<std::string_view> group_name_list;
};

// https://wicg.github.io/urlpattern/#component
std::string_view compile_component(std::u32string_view input,
std::function<std::u32string_view> &callback,
u32urlpattern_options &options);

} // namespace ada::urlpattern

#endif
83 changes: 83 additions & 0 deletions include/ada/urlpattern_pattern_parser.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
#ifndef ADA_URLPATTERN_PATTERN_PARSER_H
#define ADA_URLPATTERN_PATTERN_PARSER_H

#include "ada/helpers.h"

#include "ada/urlpattern_base.h"
#include "ada/urlpattern_tokenizer.h"

#include <string>
#include <vector>

namespace ada::urlpattern {

enum class PART_TYPE : uint8_t {
FIXED_TEXT,
REGEXP,
SEGMENT_WILDCARD,
FULL_WILDCARD
};
enum class PART_MODIFIER : uint8_t {
NONE,
OPTIONAL,
ZERO_OR_MORE,
ONE_OR_MORE
};

// https://wicg.github.io/urlpattern/#part
struct part {
PART_TYPE type;
PART_MODIFIER modifier;
std::u32string_view value;
std::u32string_view name{};
std::u32string_view prefix{};
std::u32string_view suffix{};
};

// https://wicg.github.io/urlpattern/#pattern-parser
struct pattern_parser {
ada_really_inline pattern_parser(
std::function<std::u32string_view(std::u32string_view)> &encoding,
std::u32string_view wildcard_regexp);

// https://wicg.github.io/urlpattern/#try-to-consume-a-token
ada_really_inline std::optional<token *> try_to_consume_token(
TOKEN_TYPE type);

// https://wicg.github.io/urlpattern/#try-to-consume-a-modifier-tokenssss
ada_really_inline std::optional<token *> try_to_consume_modifier_token();

// https://wicg.github.io/urlpattern/#try-to-consume-a-regexp-or-wildcard-token
ada_really_inline std::optional<token *>
try_to_consume_regexp_or_wildcard_token(std::optional<token *> &name_token);

// https://wicg.github.io/urlpattern/#maybe-add-a-part-from-the-pending-fixed-value
ada_really_inline void maybe_add_part_from_pendind_fixed_value();

// https://wicg.github.io/urlpattern/#add-a-part
ada_really_inline void add_part(
std::u32string_view prefix, std::optional<token *> &name_token,
std::optional<token *> &regexp_or_wildcard_token,
std::u32string_view suffix, std::optional<token *> &modifier_token);

// https://wicg.github.io/urlpattern/#is-a-duplicate-name
ada_really_inline bool is_duplicate_name(std::u32string_view name);

std::vector<token> token_list;
std::function<std::u32string_view(std::u32string_view)> encoding_callback;
std::u32string segment_wildcard_regexp;
std::u32string pending_fixed_value{};
std::u32string full_wildcard_regexp_value = U".*";
size_t index = 0;
size_t next_numeric_name = 0;
std::vector<part> part_list{};
};

// https://wicg.github.io/urlpattern/#parse-a-pattern-string
std::vector<part> parse_pattern_string(
std::u32string_view input, u32urlpattern_options &options,
std::function<std::string_view(std::u32string_view)> &encoding);

} // namespace ada::urlpattern

#endif
Loading