Skip to content

fix: preserve multiline string indentation in cell loader#9877

Open
ChidiebereNjoku wants to merge 13 commits into
marimo-team:mainfrom
ChidiebereNjoku:fix/smart-dedent-multiline-strings
Open

fix: preserve multiline string indentation in cell loader#9877
ChidiebereNjoku wants to merge 13 commits into
marimo-team:mainfrom
ChidiebereNjoku:fix/smart-dedent-multiline-strings

Conversation

@ChidiebereNjoku

@ChidiebereNjoku ChidiebereNjoku commented Jun 13, 2026

Copy link
Copy Markdown

Description:

Closes #9851

Problem

When a cell defines a multiline (triple-quoted) string with indented content,
marimo's cell loader strips leading whitespace from lines inside the string
literal. This mutates user data — the runtime value of the string differs from
what Python itself produces when importing the same file.

Repro:

@app.cell
def _():
    TEXT = """line0
  two_spaces
    four_spaces
      six_spaces"""
    print(TEXT)
    return (TEXT,)

Before (broken):

line0
  two_spaces
four_spaces      ← wrong, 2 spaces stripped from string content
  six_spaces

After (fixed):

line0
  two_spaces
    four_spaces
      six_spaces

Root Cause

textwrap.dedent and the fixed_dedent helper in parse.py are indentation-blind — they strip leading whitespace from every line uniformly, including lines that are content inside multiline string literals. Python's semantics require that indentation inside string literals is part of the string's value and must never be touched.

Additionally, a stray import tokenize (module) on line 43 of compiler.py shadowed the from tokenize import TokenError, tokenize (function) on line 15, causing a TypeError: 'module' object is not callable in ends_with_semicolon.

Solution

Introduce marimo/_ast/dedent.py with a token-aware smart_dedent function. It uses Python's tokenize module to identify lines that fall inside multiline string literals and marks them as protected. Only unprotected (code) lines participate in the base indentation calculation and stripping.

Changes

  • marimo/_ast/dedent.py (new): smart_dedent and _get_protected_lines — token-aware dedent that preserves string literal whitespace
  • marimo/_ast/compiler.py: replace textwrap.dedent with smart_dedent in cell_factory; remove stray import tokenize that caused TypeError
  • marimo/_ast/parse.py: replace textwrap.dedent in fixed_dedent with smart_dedent; preserve existing refill logic for AI-generated inconsistent indentation but skip refill on protected (string-content) lines
  • tests/_ast/test_parse.py: update test_fixed_dedent snapshot to reflect corrected behaviour

Pre-Review Checklist

  • Any AI generated code has been reviewed line-by-line by the human PR author, who stands by it.
  • Tests have been added for the changes made.
  • Documentation has been updated where applicable.

<!-- This is an auto-generated description by cubic. -->
<a href="https://cubic.dev/pr/marimo-team/marimo/pull/9877?utm_source=github" target="_blank" rel="noopener noreferrer" data-no-image-dialog="true"><picture><source media="(prefers-color-scheme: dark)" srcset="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://www.cubic.dev/buttons/review-in-cubic-light.svg"><img alt="Review in cubic" src="https://www.cubic.dev/buttons/review-in-cubic-dark.svg"></picture></a>
<!-- End of auto-generated description by cubic. -->

@vercel

vercel Bot commented Jun 13, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Jun 25, 2026 10:07am

Request Review

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@ChidiebereNjoku

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@mscolnick

Copy link
Copy Markdown
Contributor

@cubic-ai

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@cubic-ai

@mscolnick I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Architecture diagram
sequenceDiagram
    participant Compiler as compiler.py
    participant Parser as parse.py (fixed_dedent)
    participant Dedent as dedent.py (smart_dedent)
    participant Tokenizer as Python tokenize

    Note over Compiler,Dedent: Cell Loading & Dedentation Flow

    Compiler->>Dedent: CHANGED: cell_factory(), toplevel_cell_factory(), context_cell_factory()
    Note over Compiler: Was calling textwrap.dedent()
    Compiler->>Dedent: smart_dedent(code) - cell code being compiled

    Parser->>Dedent: CHANGED: fixed_dedent() - AI/edited code path
    Note over Parser: Now calls smart_dedent() for final strip

    rect rgb(240, 240, 240)
        Note over Dedent: NEW: Token-Aware Dedent Logic

        Dedent->>Dedent: Split code into lines
        Dedent->>Tokenize: CHANGED: tokenize.generate_tokens()
        Tokenize-->>Dedent: Token stream

        Dedent->>Dedent: _get_protected_lines() - identify STRING tokens with newlines
        Note over Dedent: Marks lines inside """ ... """ as protected

        Dedent->>Dedent: Calculate min_indent from unprotected lines only
        alt No indent needed (min_indent == 0 or infinity)
            Dedent-->>Compiler: Return original code unchanged
        else Indent found
            Dedent->>Dedent: Strip min_indent from unprotected lines
            Note over Dedent: Protected (string-content) lines untouched
            Dedent-->>Compiler: Return dedented code with strings preserved
        end
    end

    rect rgb(240, 235, 255)
        Note over Parser: fixed_dedent() Additional Logic

        Parser->>Dedent: Call _get_protected_lines()
        Parser->>Parser: refill() skips protected lines
        Note over Parser: AI-generated inconsistent indentation only fixed for code lines
        Parser->>Dedent: Finally call smart_dedent() on result
    end

    Note over Compiler,Parser: Result: Multiline string literal content is never stripped of leading whitespace
Loading

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread marimo/_ast/dedent.py Outdated
Comment thread marimo/_ast/dedent.py Outdated
except TokenError:
return protected
for tok_type, tok_string, tok_start, tok_end, _ in tokens:
if tok_type == tokenize.STRING and "\n" in tok_string:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not worth for f-strings

@mscolnick

Copy link
Copy Markdown
Contributor

can we add some tests for this specific function? i dont think this works on f-strings currently (hard to tell from just reading the existing tests)

@dmadisetti dmadisetti left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

My largest concern is that this isn't exactly what the user sees in front end, but I understand how it's maybe useful for script mode comparison. We do rewrite the AST, but I wonder if one way conversions would be better and only if the line falls under the standard indent.

Happy to be wrong, but it does seem like our current doc strings will get formatted weird and round tripping is not supported. What happens to:

"""Cell to compute sum
e.g. a=1; b=2 \imples c=3
foo
"""
a + b

Under this scheme it should become

def _():
    """Cell to compute sum
e.g. a=1; b=2 \imples c=3
foo
"""
    a + b

That seems a little counter intuitive and not currently in code. Conversely

def _():
   """Cell to compute sum
   e.g. a=1; b=2 \imples c=3
   foo
   """
   a + b

becomes

"""Cell to compute sum
   e.g. a=1; b=2 \imples c=3
   foo
   """
a + b

So maybe we only trigger in the following case:

def _():
    """Cell to compute sum
  e.g. a=1; b=2 \imples c=3 <--- under indented!
    foo
    """
    a + b
"""Cell to compute sum
  e.g. a=1; b=2 \imples c=3 <--- under indented!
    foo
    """
a + b

Can we also throw in a test for top level functions? e.g.

@app.function
def foo():
    """I wonder what
    Happens to this ?
    """
    return ...
@app.function
def bar():
    """Also wonder what
 Happens to that ?
    """
    return ...

Comment thread marimo/_ast/parse.py Outdated
@@ -56,22 +56,31 @@ def ast_parse(

def fixed_dedent(text: str) -> str:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to dedent while we're here?

@ChidiebereNjoku

Copy link
Copy Markdown
Author

@mscolnick Thanks for flagging this. f-strings work correctly because Python's tokenize module classifies f"""...""" as a STRING token just like regular triple-quoted strings, the token type is the same regardless of the f, r, or b prefix. I'll add explicit tests for f-strings and top-level function docstrings to make this clear.

@ChidiebereNjoku

Copy link
Copy Markdown
Author

@dmadisetti Good catches man. I'll move fixed_dedent into dedent.py to consolidate all dedent logic in one place. On the docstring concern, you're right that a docstring whose content lines start at column 0 (less indented than the function body) is a valid edge case. smart_dedent handles it correctly: those lines are marked protected and emitted byte-for-byte, so they won't be touched. I'll add a test covering this case.

@ChidiebereNjoku

Copy link
Copy Markdown
Author

@mscolnick You were right. Python 3.12+ tokenizes f-strings as FSTRING_START/FSTRING_MIDDLE/FSTRING_END instead of a single STRING token, so they weren't being protected by the original code. Fixed _get_protected_lines to handle both cases. Added explicit tests for f-strings (test_fstring_interior_preserved, test_fstring_preserved) and top-level function docstrings as requested. All 14 new tests pass.

@ChidiebereNjoku

Copy link
Copy Markdown
Author

@dmadisetti Done, fixed_dedent is now in dedent.py alongside smart_dedent and _get_protected_lines. Also added test_docstring_content_at_column_zero which confirms that docstring content lines less indented than the function body are correctly preserved byte-for-byte.

@dmadisetti

Copy link
Copy Markdown
Member

Woops. I just broke this PR trying to merge res. Sorry this fell off my plate, for reference hit review to to let people know you're ready

image

ChidiebereNjoku and others added 4 commits June 25, 2026 09:53
textwrap.dedent blindly strips leading whitespace from all lines,
including content inside triple-quoted string literals, mutating
user data. Introduce a token-aware smart_dedent in marimo/_ast/dedent.py
that marks lines inside multiline strings as protected before
stripping base indentation.

Changes:
- Add marimo/_ast/dedent.py with smart_dedent and _get_protected_lines
- Replace textwrap.dedent with smart_dedent in compiler.py cell_factory
- Replace textwrap.dedent with smart_dedent in parse.py fixed_dedent,
  keeping the refill logic for AI-generated inconsistent indentation
  but skipping refill on protected (string-content) lines
- Remove stray bare 'import tokenize' that shadowed the imported
  tokenize function with the module object, causing TypeError
- Update test_fixed_dedent snapshot to reflect corrected behaviour
pre-commit-ci Bot and others added 6 commits June 25, 2026 09:53
- Fix _get_protected_lines to handle Python 3.12+ f-string tokens
  (FSTRING_START/FSTRING_MIDDLE/FSTRING_END) in addition to regular
  STRING tokens, so f-string content lines are correctly protected
- Eliminate duplicate tokenize loop by passing tokens into
  _get_protected_lines instead of re-tokenizing
- Move fixed_dedent from parse.py into dedent.py to consolidate
  all dedent logic in one module (per dmadisetti's request)
- Add tests/_ast/test_dedent.py with 14 tests covering: basic dedent,
  regular/f-string/r-string multiline preservation, docstrings with
  content at column zero, top-level functions, inconsistent indentation
@ChidiebereNjoku

Copy link
Copy Markdown
Author

Woops. I just broke this PR trying to merge res. Sorry this fell off my plate, for reference hit review to to let people know you're ready

image

No worries. All conflicts resolved and CI is green again. Ready for re-review whenever you get a chance. @dmadisetti

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a correctness issue in marimo’s AST-based cell loading: dedenting extracted cell code should not mutate the contents of multiline (triple-quoted) string literals, preserving Python’s runtime string semantics.

Changes:

  • Add token-aware smart_dedent / fixed_dedent utilities to dedent code while preserving multiline string literal interior whitespace.
  • Switch cell/code extraction to use smart_dedent / fixed_dedent instead of textwrap.dedent.
  • Update and extend tests to cover multiline string preservation and related edge cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
marimo/_ast/dedent.py New token-aware dedent implementation (smart_dedent, fixed_dedent, helpers).
marimo/_ast/compiler.py Use smart_dedent when extracting cell code from source.
marimo/_ast/parse.py Replace prior inline fixed_dedent with import from marimo._ast.dedent.
tests/_ast/test_parse.py Update snapshots/expectations for corrected dedent behavior.
tests/_ast/test_dedent.py New unit tests for split_source_lines, smart_dedent, and fixed_dedent.

Comment thread marimo/_ast/dedent.py
Comment on lines +93 to +101
min_indent: float = float("inf")
for i, line in enumerate(lines):
if protected[i]:
continue
stripped = line.lstrip()
if not stripped:
continue
min_indent = min(min_indent, len(line) - len(stripped))

Comment thread marimo/_ast/dedent.py
Comment on lines +53 to +57
for tok_type, tok_string, tok_start, tok_end, _ in tokens:
# Regular triple-quoted strings: single STRING token spanning lines
if tok_type == tokenize.STRING and "\n" in tok_string:
start_line = tok_start[0] - 1
end_line = tok_end[0] - 1
Comment thread marimo/_ast/parse.py
Comment on lines +60 to +62
Unlike `str.splitlines()`, this only treats `\n`, `\r`, and `\r\n` as
line breaks. `str.splitlines()` additionally splits on `\f`, `\v`, the
`\x1c`-`\x1e` separators, and Unicode line separators.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cell loader dedents lines inside multiline string literals — runtime value differs from Python semantics

4 participants