You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 5, 2024. It is now read-only.
When the function diff_match_patch.prototype.patch_addContext_ adds context to a patch, it increments/decreases the index by a constant, Patch_Margin = 4. However, since JavaScript's substring function operates with UTF-16 code unit indexing, there's a chance that Patch_Margin may split a Unicode surrogate pair.
The output is "\uddee **" (🧮 corresponds to "\ud83e\uddee").
If you attempt to use diff_match_patch.patch_obj.prototype.toString on this patch, it leads to a crash. encodeURI will throw a URIError if URI contains a lone surrogate.
importdiff_match_patchfrom"diff-match-patch";constdiff=newdiff_match_patch();console.log(JSON.stringify(diff.patch_toText(diff.patch_make("🧮 **a","🧮 **"))// URIError: URI malformed));
A straightforward solution might involve adding a verification step after applying Patch_Margin to ensure the indices remain valid. I can start a PR, but I've noticed that Patch_Margin is used in many places, and I'm unsure about the best way to make changes.
this is a good catch, @kkshinkai - I think I've seen this reported elsewhere too but I can't find it now. there should be a relatively easy operation here, which is to backup and include a previous code unit if we're on a split surrogate.
this oversimplifies it because we could start a document on a low surrogate, and then we'd need to cut it out, but I think all that has to happen is include up to one additional character. it's not changing the contract of making a margin either because that additional code unit is still part of a character we've already expanded into.
happy to review any patch you want. I haven't looked at Patch_Margin yet, but maybe we could handle this better by creating a helper function to return a margin length.
in other words, a get_margin_before( offset, desired_margin ) and get_margin_after( offset, desired_margin ) function so that everything which currently gets a static margin can get one that won't break URIEncode.
Does it mean that the library is not yet safe to use with Emoji?
Yes and no. It works for many inputs, but we need to merge in the fixes to eliminate splitting surrogate pairs when generating these outputs.
diff-match-patch doesn't particularly care about Unicode, which is where some of the issues creep in. It's the URL-encoding that cares and crashes or produces corruption, such as the Unicode replacement character � when it should be an Emoji that was split in half.
It's not only Emoji that are affected, of course, but Emoji are the most notable code points. Everything outside the Basic Multilingual Plane will experience this problem.
There's another sneaky problem for deleted spans of text when using versions of diff-match-patch from different programming languages (and in some cases, builds of a programming language, like Python2 compiled with and without wide character support). They may report different lengths because each library represents the basic meaning of character length as returned from functions like String.length, and those lengths differ.
In #80 I've fixed the delta format we use in Simplenote, but that's not merged into the main code. I also keep saying I'm going to fork this repo, but without any official announcement in the README that this library is unmaintained it seems silly to maintain a second copy that nobody will find 🤷♂️. You are free to use that PR's branch and if you propose a new PR to fix the add_context functions I'll be happy to review the work and provide feedback.
Thanks for your work dmsnell, you went deep on this! I'm not sure I've wrapped my head around all that's being discussed, but I used your version here and it seems to fix my problem. So thank you. I'll keep an eye out for further updates.
Activity
dmsnell commentedon Oct 28, 2023
this is a good catch, @kkshinkai - I think I've seen this reported elsewhere too but I can't find it now. there should be a relatively easy operation here, which is to backup and include a previous code unit if we're on a split surrogate.
this oversimplifies it because we could start a document on a low surrogate, and then we'd need to cut it out, but I think all that has to happen is include up to one additional character. it's not changing the contract of making a margin either because that additional code unit is still part of a character we've already expanded into.
happy to review any patch you want. I haven't looked at
Patch_Margin
yet, but maybe we could handle this better by creating a helper function to return a margin length.in other words, a
get_margin_before( offset, desired_margin )
andget_margin_after( offset, desired_margin )
function so that everything which currently gets a static margin can get one that won't break URIEncode.jcubic commentedon Dec 20, 2023
Does it mean that the library is not yet safe to use with Emoji?
keizo commentedon Jan 4, 2024
I think I've just run into this issue. Not sure yet, but seems to make sense. Will report back if I learn anything on closer inspection.
dmsnell commentedon Jan 4, 2024
Yes and no. It works for many inputs, but we need to merge in the fixes to eliminate splitting surrogate pairs when generating these outputs.
diff-match-patch
doesn't particularly care about Unicode, which is where some of the issues creep in. It's the URL-encoding that cares and crashes or produces corruption, such as the Unicode replacement character�
when it should be an Emoji that was split in half.It's not only Emoji that are affected, of course, but Emoji are the most notable code points. Everything outside the Basic Multilingual Plane will experience this problem.
There's another sneaky problem for deleted spans of text when using versions of
diff-match-patch
from different programming languages (and in some cases, builds of a programming language, like Python2 compiled with and without wide character support). They may report different lengths because each library represents the basic meaning of character length as returned from functions likeString.length
, and those lengths differ.In #80 I've fixed the delta format we use in Simplenote, but that's not merged into the main code. I also keep saying I'm going to fork this repo, but without any official announcement in the README that this library is unmaintained it seems silly to maintain a second copy that nobody will find 🤷♂️. You are free to use that PR's branch and if you propose a new PR to fix the
add_context
functions I'll be happy to review the work and provide feedback.keizo commentedon Jan 4, 2024
Thanks for your work dmsnell, you went deep on this! I'm not sure I've wrapped my head around all that's being discussed, but I used your version here and it seems to fix my problem. So thank you. I'll keep an eye out for further updates.