Skip to content

Conversation

@juntae6942
Copy link

@juntae6942 juntae6942 commented Sep 13, 2025

Closes: #35426

Currently, HtmlUtils.htmlUnescape() does not correctly handle numeric character references for Unicode supplementary characters (e.g., emojis).

For example, an entity like 😀 (😀) is incorrectly converted to a garbled character corresponding to U+F600 due to data truncation.

Step to Reproduce

public static void main(String[] args) {
        // Test character: 'Grinning Face' emoji (😀)
        // Unicode code point: U+1F600
        // Hexadecimal: 1F600
        // Decimal: 128512

        // 1. Input value as a decimal HTML entity
        String inputDecimal = "😀";

        // 2. Input value as a hexadecimal HTML entity
        String inputHex = "😀";

        // 3. The expected result after correct conversion
        String expectedOutput = "😀";

        System.out.println("--- Decimal HTML Entity Test ---");
        System.out.println("Input: " + inputDecimal);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputDecimal = HtmlUtils.htmlUnescape(inputDecimal);

        System.out.println("Actual Output: " + actualOutputDecimal);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputDecimal));

        System.out.println("\n--- Hexadecimal HTML Entity Test ---");
        System.out.println("Input: " + inputHex);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputHex = HtmlUtils.htmlUnescape(inputHex);

        System.out.println("Actual Output: " + actualOutputHex);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputHex));
    }
스크린샷 2025-09-13 오후 11 29 37

Cause

The root cause was a problematic cast to a 16-bit char in the HtmlCharacterEntityDecoder. This operation truncated any Unicode code point value greater than U+FFFF, leading to the loss of the most significant bits.

Solution

This PR resolves the issue by replacing the direct (char) cast with a call to StringBuilder.appendCodePoint().

The appendCodePoint() method is designed to handle the full range of Unicode code points. It correctly converts supplementary characters into a two-character surrogate pair, ensuring that all characters are unescaped without data loss. A corresponding unit test has been added to verify this fix.

@juntae6942 juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from bc095df to 369ffe4 Compare September 14, 2025 04:41
@juntae6942 juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from 369ffe4 to a6efa2a Compare September 14, 2025 04:48
@rstoyanchev rstoyanchev added the in: web Issues in web modules (web, webmvc, webflux, websocket) label Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in: web Issues in web modules (web, webmvc, webflux, websocket) status: waiting-for-triage An issue we've not yet triaged or decided on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HtmlUtils.htmlUnescape() incorrect for numeric character references >= 𐀀 / 𐀀

3 participants