Skip to content

Return error on invalid unicode sequences #14666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 27, 2025

Conversation

lukaszsamson
Copy link
Contributor

I revisited #14589 and attempted to make the tokenizer and parser API more consistent. Code.string_to_quoted should not raise for valid string input. If the string has parsing errors it should always return an error tuple.

I collected all the cases where invalid UTF byte sequence may be used

cases = [
  # binstring
  "\"\\xFF\"",
  "\"\\xFF\#{some()}\"",
  # charlist
  "'\\xFF'",
  "'\\xFF\#{some()}'",
  # bin heredoc
  "\"\"\"\n\\xFF\n\"\"\"",
  "\"\"\"\n\\xFF\#{some()}\n\"\"\"",
  # list heredoc
  "'''\n\\xFF\n'''",
  "'''\n\\xFF\#{some()}\n'''",
  # quoted atom
  ":\"\\xFF\"",
  ":'\\xFF'",
  ":\"\\xFF\#{some()}\"",
  ":'\\xFF\#{some()}'",
  # quoted keyword identifier
  "[\"\\xFF\": 1]",
  "['\\xFF': 1]",
  "[\"\\xFF\#{some()}\": 1]",
  "['\\xFF\#{some()}': 1]",
  # quoted dot call
  "Foo.\"\\xFF\"",
  "Foo.'\\xFF'",
  "Foo.\"\\xFF\#{some()}\"",
  "Foo.'\\xFF\#{some()}'",
  # sigil
  "~s\"\\xFF\"",
  "~s'\\xFF'",
  "~s\"\\xFF\#{some()}\"",
  "~s'\\xFF\#{some()}'",
  "~S\"\\xFF\"",
  "~S'\\xFF'",
  "~S\"\\xFF\#{some()}\"",
  "~S'\\xFF\#{some()}'",
]

Before this PR the tokenizer used to raise on quoted atom/kw/dot call. The parser used to raise on list string and list heredoc.

Note that invalid byte sequences are still allowed in:

  • bin string (AST node and token has raw binary)
  • bin heredoc (AST node and token has raw binary)
  • sigil (AST node and token has raw binary, may crash at runtime)
  • anything with interpolation (AST has function calls with raw binary arg, will crash at runtime)

@josevalim josevalim merged commit 23776d9 into elixir-lang:main Jul 27, 2025
13 checks passed
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

error:#{'__struct__' := 'Elixir.UnicodeConversionError', message := Message} ->
Sigil = [$~, S, H],
Message = " (for sigil ~ts starting at line ~B)",
interpolation_error(Message, [$~] ++ SigilName ++ Original, Scope, Tokens, Message, [Sigil, Line], Line, Column, [H], [sigil_terminator(H)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be a test added for this change? Dialyzer says this function call will fail, the first argument of interpolation_error should be a 5 element tuple

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I will revert it for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially because this is wrapping a large chunk of code instead of a small one, similar to others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try catch is not needed here. It calls tokens_to_binary which may throw on liststring but it will only get binary args here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants