-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow all entities from HTML5 spec. #76
base: master
Are you sure you want to change the base?
Conversation
Hey /u/nandhp, thanks for the PR! The gperf changes look good, but I'm going to have to think about this one a bit. Up until now we've tried to restrict snudown's output to valid XHTML1 for a few reasons. One is that we use snudown's output verbatim in RSS feeds, and even though XHTML1's named entities aren't valid in either RSS or Atom, they're well-supported by most feed readers. The other is that we currently support browsers and parsers that don't understand that |
Here's a thought. Reddit must perform transformations (Snudown -> min(X{,HT}ML)) on the data users submit in order to present it for display. What if " |
Yep, doing that in gperf would make sense. Right now we're just using gperf to check for membership in a set, but I believe it could be changed to do an |
Converting non-XHTML entities to numerics was actually pretty easy to implement, since the JSON from W3 includes the codepoints for each entity. I'm not quite sure why there's two output paths -- When/if you are ready to accept this pull request, let me know and I'll squash these commits together so that the |
assert(is_valid_numeric_entity(entity_val)); | ||
/* Render codepoint to an entity. */ | ||
entitystr_len = snprintf(entitystr, entitystr_size, "&#x%X;", entity_val); | ||
assert(entitystr_len < entitystr_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for checked snprintf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, that check is mostly for form, since I can't figure out how to get the code compiled with asserts enabled. setup.py build --debug
still passes -DNDEBUG
to gcc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, in other places I'd do something like
if (entitystr_len >= entitystr_size) {
assert(0);
return;
}
but that could silently leave the output buffer with just one codepoint appended... I'm not sure what standard practice is for handling fatal errors is in production C these days. Printing to stderr and intentionally dereferencing NULL
? @spladug Any preference here?
Ahhhh, I see what I was missing now. XHTML 1 entities will be output verbatim as before, but HTML5 entities will be output hex-encoded. |
Overall, this looks pretty good! My only other thought is that the function's flow is a bit hard to read. Right now it's something like:
where the in between the validation and output steps for numeric entities and named entities that may be output verbatim, we have the verification and output for named entities that must be output as numeric entities. IMO it'd be better to have a clean split between the verification and output steps, like
|
👓 @JordanMilne @spladug (Feel free to duck out! Just thought this could use some extra C-understanding eyes.) |
Hi, Thank you for taking such a detailed look at the code! Yes, the logic is a little convoluted. I suppose I did it that way to avoid excessive changes to the existing codepaths, as well as to minimize changes the entities themselves. Since numeric entities have to be partially rewritten anyway (to normalize
Does that seem good? In this pseudocode, I use an entity value of |
IMO not changing the existing entities was the right approach. Some consumers of Snudown's rendered HTML do iffy things, like use hand-rolled HTML parsers and manually decode HTML entities like
Ehhh, that still feels a little confusing to me. The use of a magic value is a good indicator that the raw entity stuff doesn't really fit in the loop. Ultimately we have 3 types of entities we want to render: numeric entities, named entities which may be output verbatim, and named entities which must have their endpoints output as numeric entities. Making numeric entities share a code path with the named entity output stuff is a little confusing as well. I forgot to mention this, but the So I'm thinking something like: if (rndr->cb.entity) {
work.data = data;
work.size = end;
rndr->cb.entity(ob, &work, rndr->opaque);
} else {
// maybe instead of looking at entities[0] we could add an
// `.output_encoded` member to the struct? The gperf table's
// pretty small, so I doubt it'd cause any performance issues.
if ( resolved_entity && resolved_entity.entities[0] ) {
// ... loop over / output entities
}
/* Necessary so we can normalize `>` to `>` */
bufputc(ob, '&');
if (numeric)
bufputc(ob, '#');
if (hex)
bufputc(ob, 'x');
bufput(ob, data + content_start, end - content_start);
} IMO that'd be a bit clearer, and we keep the benefit of only having to do a single string write for numeric entities and entities which may be output verbatim. |
By passing the input directly to
|
Yep, that seems sane to me. Only thing is I'd change |
I'm going to fuzz this and have someone else take a look to make sure that I didn't miss anything, but other than a couple nits it looks good to me! Thanks! |
I renamed the lookup function. Let me know if you have any other feedback. I would also be happy to squish the commits down to just one or two, if for no other reason than the |
Hello,
This post in /r/RESissues reported that the list of allowed HTML entities was incomplete. To resolve this, I implemented code in
setup.py
to build the list of allowed entities dynamically from the JSON file released as part of the HTML5 specification.Notes:
setup.py
download the JSON file automatically would be a nice touch, but would violate the principle of least astonishment.gperf
and is not stored in a file.setup.py
script to respond to changes to thehtml_entities.h
by rebuilding the extension, as expected. In addition, the header file will only be rebuilt if any of the files used to generate it change.html_entities.gperf
. However, it should have the advantage of less work for future updates (e.g. HTML5.1) — simply updatehtml_entities.json
with the copy from the newer HTML specification.Thank you for your consideration of this pull request.