Skip to content

Conversation

@peter-jerry-ye
Copy link
Collaborator

This PR adds UTF16 encoding/decoding, and deprecate any other String to Bytes conversions as they are misleading.

@coveralls
Copy link
Collaborator

coveralls commented Sep 18, 2025

Pull Request Test Coverage Report for Build 1636

Details

  • 65 of 83 (78.31%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.09%) to 89.775%

Changes Missing Coverage Covered Lines Changed/Added Lines %
encoding/utf16/decode.mbt 46 64 71.88%
Totals Coverage Status
Change from base Build 1635: -0.09%
Covered Lines: 9842
Relevant Lines: 10963

💛 - Coveralls

@peter-jerry-ye peter-jerry-ye linked an issue Sep 19, 2025 that may be closed by this pull request
fn decode_lossy(BytesView, ignore_bom? : Bool) -> String

fn encode(StringView, bom? : Bool) -> Bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn encode(StringView, bom? : Bool, endianness : Endian = Little) -> Bytes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely double the code size, introduce unnecessary branching.

@peter-jerry-ye peter-jerry-ye force-pushed the zihang/encodings branch 2 times, most recently from 142c3d4 to 5a46252 Compare October 17, 2025 02:40
#deprecated("Use `@encoding/utf16le.encode` instead")
#coverage.skip
pub fn StringView::to_bytes(self : StringView) -> Bytes {
let array = FixedArray::make(self.length() * 2, Byte::default())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in most cases, this is not what we need, encode is mostly what we need

@peter-jerry-ye-code-review
Copy link

Potential buffer overflow in big-endian decode when converting surrogate pairs

Category
Correctness
Code Snippet
string_bytes[i] = (higher & 0xFF).to_byte()
string_bytes[i + 1] = (higher >> 8).to_byte()
string_bytes[i + 2] = (lower & 0xFF).to_byte()
string_bytes[i + 3] = (lower >> 8).to_byte()
i += 4
Recommendation
Add bounds checking before writing to string_bytes array, or pre-calculate the required buffer size considering surrogate pairs
Reasoning
The code assumes each surrogate pair needs 4 bytes but the initial buffer is allocated as bytes.length() which may not account for the expansion from UTF-16 surrogate pairs to the internal string representation

Inefficient byte-by-byte copying in big-endian encoding

Category
Performance
Code Snippet
for i in 0..<str.length() {
let code_unit = str[i]
arr[i * 2] = (code_unit >> 8).to_byte()
arr[i * 2 + 1] = (code_unit & 0xFF).to_byte()
}
Recommendation
Use bulk memory operations or SIMD instructions when available, similar to how little-endian uses blit_from_string
Reasoning
The big-endian path manually loops through each character and performs bit operations, while little-endian uses optimized blit_from_string. This creates a performance disparity between endianness modes

Duplicated BOM handling and validation logic between decode functions

Category
Maintainability
Code Snippet
let bytes = if ignore_bom {
if endianness is Little && bytes is [.. "\xff\xfe", .. rest] {
rest
} else if endianness is Big && bytes is [.. "\xfe\xff", .. rest] {
rest
} else {
bytes
}
} else {
bytes
}
Recommendation
Extract BOM handling into a separate helper function: fn strip_bom(bytes: BytesView, endianness: Endian) -> BytesView
Reasoning
The exact same BOM stripping logic is duplicated in both decode and decode_lossy functions, violating DRY principle and making maintenance harder

@peter-jerry-ye peter-jerry-ye enabled auto-merge (rebase) October 28, 2025 02:56
@peter-jerry-ye peter-jerry-ye merged commit ddbf0fa into main Oct 28, 2025
10 checks passed
@peter-jerry-ye peter-jerry-ye deleted the zihang/encodings branch October 28, 2025 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

change String::to_bytes() to String::encode()

3 participants