feat: encoding/utf16 #2743

peter-jerry-ye · 2025-09-18T10:48:20Z

This PR adds UTF16 encoding/decoding, and deprecate any other String to Bytes conversions as they are misleading.

coveralls · 2025-09-18T10:50:47Z

Pull Request Test Coverage Report for Build 1636

Details

65 of 83 (78.31%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.09%) to 89.775%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
encoding/utf16/decode.mbt	46	64	71.88%

Totals
Change from base Build 1635:	-0.09%
Covered Lines:	9842
Relevant Lines:	10963

💛 - Coveralls

bobzhang · 2025-10-12T02:41:34Z

encoding/utf16le/pkg.generated.mbti

+fn decode_lossy(BytesView, ignore_bom? : Bool) -> String
+
+fn encode(StringView, bom? : Bool) -> Bytes
+


fn encode(StringView, bom? : Bool, endianness : Endian = Little) -> Bytes

This will likely double the code size, introduce unnecessary branching.

bobzhang · 2025-10-27T02:37:52Z

string/methods.mbt

+#deprecated("Use `@encoding/utf16le.encode` instead")
+#coverage.skip
 pub fn StringView::to_bytes(self : StringView) -> Bytes {
  let array = FixedArray::make(self.length() * 2, Byte::default())


in most cases, this is not what we need, encode is mostly what we need

peter-jerry-ye-code-review · 2025-10-28T02:55:42Z

Potential buffer overflow in big-endian decode when converting surrogate pairs

Category
Correctness
Code Snippet
string_bytes[i] = (higher & 0xFF).to_byte()
string_bytes[i + 1] = (higher >> 8).to_byte()
string_bytes[i + 2] = (lower & 0xFF).to_byte()
string_bytes[i + 3] = (lower >> 8).to_byte()
i += 4
Recommendation
Add bounds checking before writing to string_bytes array, or pre-calculate the required buffer size considering surrogate pairs
Reasoning
The code assumes each surrogate pair needs 4 bytes but the initial buffer is allocated as bytes.length() which may not account for the expansion from UTF-16 surrogate pairs to the internal string representation

Inefficient byte-by-byte copying in big-endian encoding

Category
Performance
Code Snippet
for i in 0..<str.length() {
let code_unit = str[i]
arr[i * 2] = (code_unit >> 8).to_byte()
arr[i * 2 + 1] = (code_unit & 0xFF).to_byte()
}
Recommendation
Use bulk memory operations or SIMD instructions when available, similar to how little-endian uses blit_from_string
Reasoning
The big-endian path manually loops through each character and performs bit operations, while little-endian uses optimized blit_from_string. This creates a performance disparity between endianness modes

Duplicated BOM handling and validation logic between decode functions

Category
Maintainability
Code Snippet
let bytes = if ignore_bom {
if endianness is Little && bytes is [.. "\xff\xfe", .. rest] {
rest
} else if endianness is Big && bytes is [.. "\xfe\xff", .. rest] {
rest
} else {
bytes
}
} else {
bytes
}
Recommendation
Extract BOM handling into a separate helper function: fn strip_bom(bytes: BytesView, endianness: Endian) -> BytesView
Reasoning
The exact same BOM stripping logic is duplicated in both decode and decode_lossy functions, violating DRY principle and making maintenance harder

peter-jerry-ye requested review from Yu-zh and bobzhang September 18, 2025 10:48

peter-jerry-ye linked an issue Sep 19, 2025 that may be closed by this pull request

change String::to_bytes() to String::encode() #2616

Closed

bobzhang reviewed Oct 12, 2025

View reviewed changes

peter-jerry-ye force-pushed the zihang/encodings branch 2 times, most recently from 142c3d4 to 5a46252 Compare October 17, 2025 02:40

bobzhang force-pushed the zihang/encodings branch from 37a44a6 to 0bd00a1 Compare October 20, 2025 08:48

bobzhang reviewed Oct 27, 2025

View reviewed changes

peter-jerry-ye added 6 commits October 28, 2025 10:52

feat: encoding/utf16

eff759f

chore: deprecate string->bytes

fd0f59a

chore: fmt

ebda798

refactor: move from utf16le to utf16

bfd44f3

feat: add utf16be support

fcff525

chore: update deprecation message

ef81bec

peter-jerry-ye force-pushed the zihang/encodings branch from 0bd00a1 to ef81bec Compare October 28, 2025 02:55

peter-jerry-ye enabled auto-merge (rebase) October 28, 2025 02:56

peter-jerry-ye merged commit ddbf0fa into main Oct 28, 2025
10 checks passed

peter-jerry-ye deleted the zihang/encodings branch October 28, 2025 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: encoding/utf16 #2743

feat: encoding/utf16 #2743

peter-jerry-ye commented Sep 18, 2025

Uh oh!

coveralls commented Sep 18, 2025 •

edited

Loading

Uh oh!

bobzhang Oct 12, 2025

Uh oh!

peter-jerry-ye Oct 17, 2025

Uh oh!

bobzhang Oct 27, 2025

Uh oh!

peter-jerry-ye-code-review bot commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		fn decode_lossy(BytesView, ignore_bom? : Bool) -> String

		fn encode(StringView, bom? : Bool) -> Bytes

Uh oh!

feat: encoding/utf16 #2743

feat: encoding/utf16 #2743

Conversation

peter-jerry-ye commented Sep 18, 2025

Uh oh!

coveralls commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 1636

Details

💛 - Coveralls

Uh oh!

bobzhang Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

peter-jerry-ye Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

bobzhang Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

peter-jerry-ye-code-review bot commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Sep 18, 2025 •

edited

Loading