Return error on invalid unicode sequences #14666

lukaszsamson · 2025-07-23T09:04:09Z

I revisited #14589 and attempted to make the tokenizer and parser API more consistent. Code.string_to_quoted should not raise for valid string input. If the string has parsing errors it should always return an error tuple.

I collected all the cases where invalid UTF byte sequence may be used

cases = [
  # binstring
  "\"\\xFF\"",
  "\"\\xFF\#{some()}\"",
  # charlist
  "'\\xFF'",
  "'\\xFF\#{some()}'",
  # bin heredoc
  "\"\"\"\n\\xFF\n\"\"\"",
  "\"\"\"\n\\xFF\#{some()}\n\"\"\"",
  # list heredoc
  "'''\n\\xFF\n'''",
  "'''\n\\xFF\#{some()}\n'''",
  # quoted atom
  ":\"\\xFF\"",
  ":'\\xFF'",
  ":\"\\xFF\#{some()}\"",
  ":'\\xFF\#{some()}'",
  # quoted keyword identifier
  "[\"\\xFF\": 1]",
  "['\\xFF': 1]",
  "[\"\\xFF\#{some()}\": 1]",
  "['\\xFF\#{some()}': 1]",
  # quoted dot call
  "Foo.\"\\xFF\"",
  "Foo.'\\xFF'",
  "Foo.\"\\xFF\#{some()}\"",
  "Foo.'\\xFF\#{some()}'",
  # sigil
  "~s\"\\xFF\"",
  "~s'\\xFF'",
  "~s\"\\xFF\#{some()}\"",
  "~s'\\xFF\#{some()}'",
  "~S\"\\xFF\"",
  "~S'\\xFF'",
  "~S\"\\xFF\#{some()}\"",
  "~S'\\xFF\#{some()}'",
]

Before this PR the tokenizer used to raise on quoted atom/kw/dot call. The parser used to raise on list string and list heredoc.

Note that invalid byte sequences are still allowed in:

bin string (AST node and token has raw binary)
bin heredoc (AST node and token has raw binary)
sigil (AST node and token has raw binary, may crash at runtime)
anything with interpolation (AST has function calls with raw binary arg, will crash at runtime)

josevalim · 2025-07-27T17:04:20Z

💚 💙 💜 💛 ❤️

michallepicki · 2025-07-28T06:17:00Z

lib/elixir/src/elixir_tokenizer.erl

+        error:#{'__struct__' := 'Elixir.UnicodeConversionError', message := Message} ->
+          Sigil = [$~, S, H],
+          Message = " (for sigil ~ts starting at line ~B)",
+          interpolation_error(Message, [$~] ++ SigilName ++ Original, Scope, Tokens, Message, [Sigil, Line], Line, Column, [H], [sigil_terminator(H)])


Shouldn't there be a test added for this change? Dialyzer says this function call will fail, the first argument of interpolation_error should be a 5 element tuple

Thank you, I will revert it for now.

Especially because this is wrapping a large chunk of code instead of a small one, similar to others.

Try catch is not needed here. It calls tokens_to_binary which may throw on liststring but it will only get binary args here

Return error on invalid unicode sequences

29973cb

josevalim merged commit 23776d9 into elixir-lang:main Jul 27, 2025
13 checks passed

josevalim pushed a commit that referenced this pull request Jul 27, 2025

Return error on invalid unicode sequences (#14666)

71e1ddc

michallepicki reviewed Jul 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Return error on invalid unicode sequences #14666

Return error on invalid unicode sequences #14666

lukaszsamson commented Jul 23, 2025

Uh oh!

Uh oh!

josevalim commented Jul 27, 2025

Uh oh!

michallepicki Jul 28, 2025

Uh oh!

josevalim Jul 28, 2025

Uh oh!

josevalim Jul 28, 2025

Uh oh!

lukaszsamson Jul 28, 2025

Uh oh!

Uh oh!

Return error on invalid unicode sequences #14666

Return error on invalid unicode sequences #14666

Conversation

lukaszsamson commented Jul 23, 2025

Uh oh!

Uh oh!

josevalim commented Jul 27, 2025

Uh oh!

michallepicki Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

josevalim Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

josevalim Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

lukaszsamson Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!