Skip to content

Conversation

roomote-agent
Copy link
Collaborator

@roomote-agent roomote-agent commented Jun 20, 2025

This PR fixes issue #4950 where Claude Sonnet 4 was converting Unicode apostrophes and quotes to ASCII equivalents when editing files using the apply_diff tool.

Root Cause

The diff strategies were normalizing Unicode characters to ASCII for comparison but not preserving the original Unicode characters in the replacement content.

Solution

Added a preserveUnicodeCharacters function to both diff strategies that maps Unicode characters from the original content and preserves them in replacements.

Changes

  • Added preserveUnicodeCharacters function to multi-search-replace.ts and multi-file-search-replace.ts
  • Integrated Unicode preservation into replacement logic
  • Added comprehensive tests for Unicode character preservation
  • Handles Unicode quotes (U+201C, U+201D) and apostrophes (U+2018, U+2019)

Testing

  • Added unicode-preservation.test.ts with multiple test scenarios
  • Verified Unicode characters are preserved during diff operations
  • Confirmed backward compatibility with ASCII-only content

Closes #4950


Important

Fixes Unicode character conversion issue in apply_diff by preserving original Unicode characters during replacements.

  • Behavior:
    • Fixes Unicode character conversion issue in apply_diff tool by preserving original Unicode characters during replacements.
    • Handles Unicode quotes (U+201C, U+201D) and apostrophes (U+2018, U+2019).
  • Functions:
    • Adds preserveUnicodeCharacters function to multi-search-replace.ts and multi-file-search-replace.ts.
    • Integrates Unicode preservation into replacement logic in both files.
  • Testing:
    • Adds unicode-preservation.test.ts with tests for Unicode character preservation.
    • Verifies preservation of Unicode characters and backward compatibility with ASCII content.

This description was created by Ellipsis for a1821c7. You can customize this summary. It will automatically update as commits are pushed.

- Added preserveUnicodeCharacters function to both multi-search-replace and multi-file-search-replace diff strategies
- The function maps Unicode characters from original content to replacement content to prevent conversion to ASCII
- Fixes issue where Unicode apostrophes (') and quotes () were being converted to ASCII equivalents (' and ") during diff operations
- Added comprehensive tests to verify Unicode character preservation
@roomote-agent roomote-agent requested review from mrubens, cte and jr as code owners June 20, 2025 18:20
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jun 20, 2025
const unicodeChars = ["\u201C", "\u201D", "\u2018", "\u2019"] // ""''
const asciiChars = ['"', '"', "'", "'"]

for (let i = 0; i < unicodeChars.length; i++) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preserveUnicodeCharacters function here uses a simple mapping that overwrites the mapping for ASCII quotes (" and ') when both opening and closing Unicode variants are present. This means all occurrences of a quote in the replacement will use the last mapped Unicode character, potentially losing the distinction between opening and closing quotes. Consider using a sequential approach or separate mappings to preserve paired quotes correctly. Also, the function is duplicated in another file; consider extracting it into a shared utility module.

This comment was generated because it violated a code review rule: irule_tTqpIuNs8DV0QFGj.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jun 20, 2025
@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Jun 21, 2025
@hannesrudolph hannesrudolph added PR - Needs Preliminary Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jun 21, 2025
@daniel-lxs
Copy link
Collaborator

This PR seems to address the symptoms rather than the root cause. While the model can correctly write Unicode to files, the only issue I’ve seen so far is with quotes being replaced.

The proposed solution feels a bit overengineered, especially since it adds ongoing checks for Unicode quotes that don’t appear to be common.

That’s not to say the issue isn’t valid, but it might be better to investigate the underlying cause and fix it directly instead.

@daniel-lxs daniel-lxs closed this Jun 22, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Jun 22, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jun 22, 2025
@cte cte deleted the fix-4950 branch July 31, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PR - Needs Preliminary Review size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

roo code claude sonnet 4 is changing unicode ’ to ascii ' when it edits a file
4 participants