fix: preserve multiline string indentation in cell loader#9877
fix: preserve multiline string indentation in cell loader#9877ChidiebereNjoku wants to merge 13 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
@mscolnick I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 4 files
Architecture diagram
sequenceDiagram
participant Compiler as compiler.py
participant Parser as parse.py (fixed_dedent)
participant Dedent as dedent.py (smart_dedent)
participant Tokenizer as Python tokenize
Note over Compiler,Dedent: Cell Loading & Dedentation Flow
Compiler->>Dedent: CHANGED: cell_factory(), toplevel_cell_factory(), context_cell_factory()
Note over Compiler: Was calling textwrap.dedent()
Compiler->>Dedent: smart_dedent(code) - cell code being compiled
Parser->>Dedent: CHANGED: fixed_dedent() - AI/edited code path
Note over Parser: Now calls smart_dedent() for final strip
rect rgb(240, 240, 240)
Note over Dedent: NEW: Token-Aware Dedent Logic
Dedent->>Dedent: Split code into lines
Dedent->>Tokenize: CHANGED: tokenize.generate_tokens()
Tokenize-->>Dedent: Token stream
Dedent->>Dedent: _get_protected_lines() - identify STRING tokens with newlines
Note over Dedent: Marks lines inside """ ... """ as protected
Dedent->>Dedent: Calculate min_indent from unprotected lines only
alt No indent needed (min_indent == 0 or infinity)
Dedent-->>Compiler: Return original code unchanged
else Indent found
Dedent->>Dedent: Strip min_indent from unprotected lines
Note over Dedent: Protected (string-content) lines untouched
Dedent-->>Compiler: Return dedented code with strings preserved
end
end
rect rgb(240, 235, 255)
Note over Parser: fixed_dedent() Additional Logic
Parser->>Dedent: Call _get_protected_lines()
Parser->>Parser: refill() skips protected lines
Note over Parser: AI-generated inconsistent indentation only fixed for code lines
Parser->>Dedent: Finally call smart_dedent() on result
end
Note over Compiler,Parser: Result: Multiline string literal content is never stripped of leading whitespace
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| except TokenError: | ||
| return protected | ||
| for tok_type, tok_string, tok_start, tok_end, _ in tokens: | ||
| if tok_type == tokenize.STRING and "\n" in tok_string: |
There was a problem hiding this comment.
this might not worth for f-strings
|
can we add some tests for this specific function? i dont think this works on f-strings currently (hard to tell from just reading the existing tests) |
dmadisetti
left a comment
There was a problem hiding this comment.
Thanks!
My largest concern is that this isn't exactly what the user sees in front end, but I understand how it's maybe useful for script mode comparison. We do rewrite the AST, but I wonder if one way conversions would be better and only if the line falls under the standard indent.
Happy to be wrong, but it does seem like our current doc strings will get formatted weird and round tripping is not supported. What happens to:
"""Cell to compute sum
e.g. a=1; b=2 \imples c=3
foo
"""
a + bUnder this scheme it should become
def _():
"""Cell to compute sum
e.g. a=1; b=2 \imples c=3
foo
"""
a + bThat seems a little counter intuitive and not currently in code. Conversely
def _():
"""Cell to compute sum
e.g. a=1; b=2 \imples c=3
foo
"""
a + bbecomes
"""Cell to compute sum
e.g. a=1; b=2 \imples c=3
foo
"""
a + bSo maybe we only trigger in the following case:
def _():
"""Cell to compute sum
e.g. a=1; b=2 \imples c=3 <--- under indented!
foo
"""
a + b"""Cell to compute sum
e.g. a=1; b=2 \imples c=3 <--- under indented!
foo
"""
a + bCan we also throw in a test for top level functions? e.g.
@app.function
def foo():
"""I wonder what
Happens to this ?
"""
return ...@app.function
def bar():
"""Also wonder what
Happens to that ?
"""
return ...| @@ -56,22 +56,31 @@ def ast_parse( | |||
|
|
|||
| def fixed_dedent(text: str) -> str: | |||
There was a problem hiding this comment.
Can you move this to dedent while we're here?
|
@mscolnick Thanks for flagging this. f-strings work correctly because Python's tokenize module classifies f"""...""" as a STRING token just like regular triple-quoted strings, the token type is the same regardless of the f, r, or b prefix. I'll add explicit tests for f-strings and top-level function docstrings to make this clear. |
|
@dmadisetti Good catches man. I'll move fixed_dedent into dedent.py to consolidate all dedent logic in one place. On the docstring concern, you're right that a docstring whose content lines start at column 0 (less indented than the function body) is a valid edge case. smart_dedent handles it correctly: those lines are marked protected and emitted byte-for-byte, so they won't be touched. I'll add a test covering this case. |
|
@mscolnick You were right. Python 3.12+ tokenizes f-strings as FSTRING_START/FSTRING_MIDDLE/FSTRING_END instead of a single STRING token, so they weren't being protected by the original code. Fixed _get_protected_lines to handle both cases. Added explicit tests for f-strings (test_fstring_interior_preserved, test_fstring_preserved) and top-level function docstrings as requested. All 14 new tests pass. |
|
@dmadisetti Done, fixed_dedent is now in dedent.py alongside smart_dedent and _get_protected_lines. Also added test_docstring_content_at_column_zero which confirms that docstring content lines less indented than the function body are correctly preserved byte-for-byte. |
textwrap.dedent blindly strips leading whitespace from all lines, including content inside triple-quoted string literals, mutating user data. Introduce a token-aware smart_dedent in marimo/_ast/dedent.py that marks lines inside multiline strings as protected before stripping base indentation. Changes: - Add marimo/_ast/dedent.py with smart_dedent and _get_protected_lines - Replace textwrap.dedent with smart_dedent in compiler.py cell_factory - Replace textwrap.dedent with smart_dedent in parse.py fixed_dedent, keeping the refill logic for AI-generated inconsistent indentation but skipping refill on protected (string-content) lines - Remove stray bare 'import tokenize' that shadowed the imported tokenize function with the module object, causing TypeError - Update test_fixed_dedent snapshot to reflect corrected behaviour
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
- Fix _get_protected_lines to handle Python 3.12+ f-string tokens (FSTRING_START/FSTRING_MIDDLE/FSTRING_END) in addition to regular STRING tokens, so f-string content lines are correctly protected - Eliminate duplicate tokenize loop by passing tokens into _get_protected_lines instead of re-tokenizing - Move fixed_dedent from parse.py into dedent.py to consolidate all dedent logic in one module (per dmadisetti's request) - Add tests/_ast/test_dedent.py with 14 tests covering: basic dedent, regular/f-string/r-string multiline preservation, docstrings with content at column zero, top-level functions, inconsistent indentation
for more information, see https://pre-commit.ci
9f465b7 to
44c442e
Compare
No worries. All conflicts resolved and CI is green again. Ready for re-review whenever you get a chance. @dmadisetti |
There was a problem hiding this comment.
Pull request overview
This PR fixes a correctness issue in marimo’s AST-based cell loading: dedenting extracted cell code should not mutate the contents of multiline (triple-quoted) string literals, preserving Python’s runtime string semantics.
Changes:
- Add token-aware
smart_dedent/fixed_dedentutilities to dedent code while preserving multiline string literal interior whitespace. - Switch cell/code extraction to use
smart_dedent/fixed_dedentinstead oftextwrap.dedent. - Update and extend tests to cover multiline string preservation and related edge cases.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
marimo/_ast/dedent.py |
New token-aware dedent implementation (smart_dedent, fixed_dedent, helpers). |
marimo/_ast/compiler.py |
Use smart_dedent when extracting cell code from source. |
marimo/_ast/parse.py |
Replace prior inline fixed_dedent with import from marimo._ast.dedent. |
tests/_ast/test_parse.py |
Update snapshots/expectations for corrected dedent behavior. |
tests/_ast/test_dedent.py |
New unit tests for split_source_lines, smart_dedent, and fixed_dedent. |
| min_indent: float = float("inf") | ||
| for i, line in enumerate(lines): | ||
| if protected[i]: | ||
| continue | ||
| stripped = line.lstrip() | ||
| if not stripped: | ||
| continue | ||
| min_indent = min(min_indent, len(line) - len(stripped)) | ||
|
|
| for tok_type, tok_string, tok_start, tok_end, _ in tokens: | ||
| # Regular triple-quoted strings: single STRING token spanning lines | ||
| if tok_type == tokenize.STRING and "\n" in tok_string: | ||
| start_line = tok_start[0] - 1 | ||
| end_line = tok_end[0] - 1 |
| Unlike `str.splitlines()`, this only treats `\n`, `\r`, and `\r\n` as | ||
| line breaks. `str.splitlines()` additionally splits on `\f`, `\v`, the | ||
| `\x1c`-`\x1e` separators, and Unicode line separators. |


Description:
Closes #9851
Problem
When a cell defines a multiline (triple-quoted) string with indented content,
marimo's cell loader strips leading whitespace from lines inside the string
literal. This mutates user data — the runtime value of the string differs from
what Python itself produces when importing the same file.
Repro:
Before (broken):
After (fixed):
Root Cause
textwrap.dedentand thefixed_dedenthelper inparse.pyare indentation-blind — they strip leading whitespace from every line uniformly, including lines that are content inside multiline string literals. Python's semantics require that indentation inside string literals is part of the string's value and must never be touched.Additionally, a stray
import tokenize(module) on line 43 ofcompiler.pyshadowed thefrom tokenize import TokenError, tokenize(function) on line 15, causing aTypeError: 'module' object is not callableinends_with_semicolon.Solution
Introduce
marimo/_ast/dedent.pywith a token-awaresmart_dedentfunction. It uses Python'stokenizemodule to identify lines that fall inside multiline string literals and marks them as protected. Only unprotected (code) lines participate in the base indentation calculation and stripping.Changes
marimo/_ast/dedent.py(new):smart_dedentand_get_protected_lines— token-aware dedent that preserves string literal whitespacemarimo/_ast/compiler.py: replacetextwrap.dedentwithsmart_dedentincell_factory; remove strayimport tokenizethat causedTypeErrormarimo/_ast/parse.py: replacetextwrap.dedentinfixed_dedentwithsmart_dedent; preserve existingrefilllogic for AI-generated inconsistent indentation but skiprefillon protected (string-content) linestests/_ast/test_parse.py: updatetest_fixed_dedentsnapshot to reflect corrected behaviourPre-Review Checklist