Skip to content

fix: make Frame.lookup() iterative to prevent StackOverflowError#107

Closed
fdelbrayelle wants to merge 1 commit into
dashjoin:mainfrom
fdelbrayelle:fix/frame-lookup-iterative
Closed

fix: make Frame.lookup() iterative to prevent StackOverflowError#107
fdelbrayelle wants to merge 1 commit into
dashjoin:mainfrom
fdelbrayelle:fix/frame-lookup-iterative

Conversation

@fdelbrayelle
Copy link
Copy Markdown

Problem

Jsonata$Frame.lookup() walks the parent scope chain via tail recursion:

Object lookup(String name) {
    if (bindings.containsKey(name)) return bindings.get(name);
    else if (parent != null) return parent.lookup(name);
    return null;
}

Java does not optimize tail calls. On JVMs with smaller default thread stack sizes (e.g. Windows workers, containers with -Xss256k), deeply nested JSONata expressions overflow the stack. The StackOverflowError is a JVM Error, not an Exception — it cannot be safely caught and recovered from, and it crashes the entire worker thread.

Fix

Replace with an iterative loop. Identical semantics, zero recursion:

Object lookup(String name) {
    for (Frame f = this; f != null; f = f.parent) {
        if (f.bindings.containsKey(name)) return f.bindings.get(name);
    }
    return null;
}

Impact

Java does not optimize tail calls. On JVMs with smaller default thread
stack sizes (e.g. Windows workers, containers with -Xss256k), deeply
nested JSONata expressions cause Jsonata$Frame.lookup() to overflow the
stack via tail recursion. StackOverflowError is a JVM Error — it cannot
be safely caught and the entire worker thread crashes.

Replace the recursive parent.lookup(name) call with an iterative loop
over the scope chain. Semantics are identical; traversal order is
unchanged. Eliminates all stack risk from scope chain lookup regardless
of nesting depth.

Fixes worker crashes reported in kestra-io/plugin-transform#79.
@aeberhart
Copy link
Copy Markdown
Contributor

thanks for the PR. looks good. can you provide an example of a JVM setup + expression that reproduces this issue. Thanks!

fdelbrayelle added a commit to kestra-io/plugin-transform that referenced this pull request Apr 30, 2026
Tests now assert isInstanceOf(StackOverflowError.class) — current behavior
without the upstream fix. Rename constant to XSS_256K and add Javadoc linking
it to -Xss256k so the stack constraint is self-documenting.
Comment marks where assertions flip once dashjoin/jsonata-java#107 ships.
@fdelbrayelle
Copy link
Copy Markdown
Author

Hi @aeberhart 👋
We don't have the customer's initial data from his flow because it's sensitive data.
But I'm able to reproduce it. I opened a PR to let you check.

fdelbrayelle added a commit to kestra-io/plugin-transform that referenced this pull request Apr 30, 2026
Tests document the current bug state: StackOverflowError is expected.
Will flip to assertThatNoException once dashjoin/jsonata-java#107 ships.
@aeberhart
Copy link
Copy Markdown
Contributor

Hi @fdelbrayelle, I ran some tests with the expression from your repo:

($f := function($n) { $n > 0 ? $f($n - 1) + 1 : 0 }; $f(...))

Your change does not really make a difference because the number of nested frames is independent of the recursion depth. Eventually, you get a StackOverflow at a more or less random code position.

In your engine, you might want to do this:

  @Test
  public void testStack() {
    var e = Jsonata.jsonata("($f := function($n) { $n > 0 ? $f($n - 1) + 1 : 0 }; $f(820))");
    Frame f = e.createFrame();
    f.setRuntimeBounds(Long.MAX_VALUE, 100);
    System.out.println(e.evaluate(null, f));
  }

Then, you get a JException from com.dashjoin.jsonata.Timebox.checkRunnaway().

@fdelbrayelle
Copy link
Copy Markdown
Author

fdelbrayelle commented Apr 30, 2026

In plugin-transform, we're already following this pattern with a maxDepth of 200 by default now (here see):

    protected JsonNode evaluateExpression(RunContext runContext, JsonNode jsonNode) {
        try {
            var timeoutInMilli = runContext.render(getTimeout()).as(Duration.class)
                .map(Duration::toMillis)
                .orElse(Long.MAX_VALUE);
            var rMaxDepth = runContext.render(getMaxDepth()).as(Integer.class).orElseThrow();

            var data = MAPPER.convertValue(jsonNode, Object.class);
            var frame = this.parsedExpression.createFrame();
            frame.setRuntimeBounds(timeoutInMilli, rMaxDepth);

            var result = this.parsedExpression.evaluate(data, frame);
            if (result == null) {
                return NullNode.getInstance();
            }
            return MAPPER.valueToTree(result);
        } catch (JException | IllegalVariableEvaluationException e) {
            throw new RuntimeException("Failed to evaluate expression", e);
        }
    }

Frame.lookup() is called inside the recursive evaluator. With maxDepth=N and a scope chain of depth D, each level of evaluation recursion adds D lookup frames on top of its own frame. Stack usage grows as O(N × D), not O(N). Making lookup() iterative collapses that D multiplier to 1 regardless of scope nesting.

setRuntimeBounds limits evaluation recursion depth but it fires after the frames are pushed. It can't prevent overflow if the stack is already exhausted by the time checkRunnaway() is reached. The iterative lookup() reduces the per level footprint giving the bounds check room to fire cleanly.

As an evidence: Kestra already calls frame.setRuntimeBounds(timeout, 200). That didn't prevent the crash. The overflow happened inside lookup() traversal while evaluating a moderately nested expression on a container with a reduced stack size. That's exactly the scenario this PR fixes.

The change is zero-behavior-change, zero-risk and removes one class of StackOverflowError entirely. It doesn't conflict with 'setRuntimeBounds`, it makes it more reliable.

@aeberhart
Copy link
Copy Markdown
Contributor

Hi @fdelbrayelle,

running

($f := function($n) { $n > 0 ? $f($n - 1) + 1 : 0 }; $f(1999))

this is the stack trace:

java.lang.StackOverflowError
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata$Frame.lookup(Jsonata.java:94)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata$Frame.lookup(Jsonata.java:97)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata$Frame.lookup(Jsonata.java:97)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata._evaluate(Jsonata.java:155)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.evaluate(Jsonata.java:138)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.evaluateFunction(Jsonata.java:1601)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata._evaluate(Jsonata.java:202)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.evaluate(Jsonata.java:138)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.evaluateBinary(Jsonata.java:565)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata._evaluate(Jsonata.java:166)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.evaluate(Jsonata.java:138)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.evaluateCondition(Jsonata.java:1247)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata._evaluate(Jsonata.java:190)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.evaluate(Jsonata.java:138)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.applyProcedure(Jsonata.java:1920)
	at com.dashjoin.jsonata/com.dashjoin.jsonata.Jsonata.applyInner(Jsonata.java:1722)
...

in the example D=4 and N is some large number. In total, lookup is called N*D times. The recursion depth is O(N+D). You can see this in the trace above. The recursive call (evaluate) is repeated N times with 4 calls to lookup "on top".
Of course, a while loop is minimally faster than recursion, but since D is very small, we won't merge the change.

Please let us know if we're missing something.

fdelbrayelle added a commit to kestra-io/plugin-transform that referenced this pull request May 2, 2026
Each JSONata recursion level pushes ~8 JVM frames. On 256 KB worker
stacks (~300 usable frames), the old default maxDepth=200 allowed
200 × 8 = 1600 frames before the bounds check fired — far past overflow.

Two-layer fix:
1. Lower default maxDepth 200 → 50 (50 × 8 = 400 frames, safe on 256 KB).
   setRuntimeBounds fires at depth 50, throwing JException cleanly.
2. Run evaluate() on a dedicated thread with an explicit 4 MB stack.
   Worker thread stack size (e.g. 256 KB on Windows) can no longer
   constrain the evaluator. If a user sets a very high maxDepth and
   triggers StackOverflowError anyway, it is caught as Throwable inside
   the throwaway eval thread; the worker thread gets a clean
   RuntimeException instead of crashing.

Update regression tests:
- Parametrized test: maxDepth=50/200/500/1000 all produce JException,
  never StackOverflowError, even on -Xss512k JVM.
- Isolation test: maxDepth=50000 with $f(49999) overflows the eval
  thread but worker receives RuntimeException(cause=StackOverflowError).

Closes dashjoin/jsonata-java#107 dependency — fix is now self-contained
in plugin-transform without requiring upstream changes.
@fdelbrayelle
Copy link
Copy Markdown
Author

You're right, thank you for the analysis and the stack trace.

The recursion depth is O(N+D) and D is small so making lookup() iterative doesn't prevent the overflow. I was wrong about the multiplier.

We've fixed it on our side in plugin-transform instead: evaluation now runs on a dedicated thread with an explicit 4MB stack and the default maxDepth has been lowered to 50. The worker thread is fully isolated from any StackOverflowError in the evaluator.

So in the end, no upstream changes are needed here. Closing this PR.

@fdelbrayelle fdelbrayelle deleted the fix/frame-lookup-iterative branch May 2, 2026 12:51
@aeberhart
Copy link
Copy Markdown
Contributor

Thanks for confirming and of course for using the lib!

fdelbrayelle added a commit to kestra-io/plugin-transform that referenced this pull request May 15, 2026
… thread isolation + lower default maxDepth (#80)

* test(jsonata): add StackOverflow regression tests for small JVM stacks

Reproduces dashjoin/jsonata-java#107.
Frame.lookup() recurses once per scope frame — on 256 KB (Windows) or
512 KB (Linux default) thread stacks, a 999/1999-deep recursive JSONata
expression overflows before maxDepth fires. Tests use explicit Thread
stack sizes so the scenario is reproducible on any platform.

Passes once jsonata-java ships the iterative Frame.lookup() fix.

* test(jsonata): refine StackOverflow regression tests — shared 256k stack helper

Both Windows and Linux tests now use SMALL_STACK_BYTES = 256 KB via
Thread(stackSize) so no -Xss JVM flag is needed in build config.
Extract runOnSmallStack() helper to avoid duplication.

* test(jsonata): assert StackOverflowError and name constant XSS_256K

Tests now assert isInstanceOf(StackOverflowError.class) — current behavior
without the upstream fix. Rename constant to XSS_256K and add Javadoc linking
it to -Xss256k so the stack constraint is self-documenting.
Comment marks where assertions flip once dashjoin/jsonata-java#107 ships.

* test(jsonata): use -Xss256k JVM flag instead of Thread(stackSize)

Thread(stackSize) is advisory — JVM ignores it on Linux. Add -Xss256k to
test JVM args in plugin-transform-json/build.gradle via afterEvaluate so
it appends after the root build's jvmArgs= Allure assignment. Simplify
tests to call task.run() directly and assert StackOverflowError with
assertThatThrownBy.

* test(jsonata): fix -Xss256k not applied — root build used jvmArgs= (replace)

afterEvaluate in the subproject was silently wiped because the root Allure
subprojects{} block used jvmArgs=[...] (full assignment). Change root to
jvmArgs append; subproject test{jvmArgs '-Xss256k'} then safely extends it
without ordering ambiguity.

* test(jsonata): drop -Xss256k, increase depth to crash default Linux stack

-Xss256k is silently ignored by OpenJDK on Linux 64-bit (enforces a higher
minimum). Instead use depth=4999/9999 which overflows the default 512 KB-1 MB
thread stack without any JVM flag (~150 bytes/frame × 9999 ≈ 1.5 MB needed).

* test(jsonata): use -Xss512k + depth=49999 to guarantee stack overflow in CI

Linux default thread stack is ~8 MB (OS RLIMIT_STACK), far too large to
reproduce the crash without -Xss. Pin test JVM to -Xss512k (above HotSpot
minimum, safe for Kestra framework). Use depth=49999 so 49999 JVM frames
× minimum 16 bytes/frame = 800 KB > 512k — overflow guaranteed regardless
of JIT optimization level.

* test(jsonata): fix expression — use non-tail-recursive form to prevent TCO

jsonata-java TCO-optimises tail-recursive calls into a loop (O(1) stack),
so $f($n-1) never caused a stack overflow. Adding + 0 after the recursive
call puts it in non-tail position, forcing each frame to stay live and
producing the expected StackOverflowError on -Xss512k.

* test(jsonata): revert to assertThatThrownBy — lib not upgraded yet

Tests document the current bug state: StackOverflowError is expected.
Will flip to assertThatNoException once dashjoin/jsonata-java#107 ships.

* fix(jsonata): isolate eval on 4MB thread + lower default maxDepth to 50

Each JSONata recursion level pushes ~8 JVM frames. On 256 KB worker
stacks (~300 usable frames), the old default maxDepth=200 allowed
200 × 8 = 1600 frames before the bounds check fired — far past overflow.

Two-layer fix:
1. Lower default maxDepth 200 → 50 (50 × 8 = 400 frames, safe on 256 KB).
   setRuntimeBounds fires at depth 50, throwing JException cleanly.
2. Run evaluate() on a dedicated thread with an explicit 4 MB stack.
   Worker thread stack size (e.g. 256 KB on Windows) can no longer
   constrain the evaluator. If a user sets a very high maxDepth and
   triggers StackOverflowError anyway, it is caught as Throwable inside
   the throwaway eval thread; the worker thread gets a clean
   RuntimeException instead of crashing.

Update regression tests:
- Parametrized test: maxDepth=50/200/500/1000 all produce JException,
  never StackOverflowError, even on -Xss512k JVM.
- Isolation test: maxDepth=50000 with $f(49999) overflows the eval
  thread but worker receives RuntimeException(cause=StackOverflowError).

Closes dashjoin/jsonata-java#107 dependency — fix is now self-contained
in plugin-transform without requiring upstream changes.

* refactor(jsonata): replace per-call Thread with per-instance executor; document Throwable catch

Addresses two review comments from Malay on PR #80:

1. Thread-per-record regression (TransformItems): evaluateExpression() previously spawned a
   new 4 MB-stack Thread per call. Inside TransformItems.run(), calls are inside a Flux.map
   so this allocated one thread per record. Fixed by replacing new Thread(...) with a
   lazy-initialized single-threaded ExecutorService (daemon, 4 MB ThreadFactory) stored per
   Transform instance. TransformItems and TransformValue now shut it down in finally at the end
   of run(), so the executor lives for exactly one task run.

2. catch (Throwable): kept intentionally broad with an explanatory comment. The eval thread is
   a throwaway sandbox — narrowing to Exception | StackOverflowError would let other Errors
   (OOM, InternalError) escape silently via the UncaughtExceptionHandler, leaving both
   resultRef and errorRef null and returning null from evaluateExpression instead of failing.

New tests:
- shouldReuseEvalThreadAcrossRecords: asserts zero jsonata-eval threads alive after run()
- shouldContinueWorkingAfterStackOverflowError: asserts a clean second run after a SOE

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(jsonata): restore maxDepth default to 1000 to avoid breaking change

The 4 MB eval thread is the actual fix for the StackOverflowError on
Windows workers. Lowering maxDepth had no
effect on the crash, which was driven by Frame.lookup() scope-chain
depth, not user-defined function recursion. maxDepth=50 was an
unnecessary breaking change.

* test(jsonata): add regression test for large-dataset Frame.lookup() StackOverflowError

Reproduces a production crash: 5,000 LDAP-like
records with a 15-field flat-lookup expression (ternary, $string, $join on a
multi-value array) run on the -Xss512k JVM. The expression mirrors a real-world
actual flow. Without the 4 MB eval thread this crashes; with it the test passes.

* fix(jsonata): join eval thread in shutdown to eliminate termination race

awaitTermination() returns when all tasks are done but the thread itself
may still be in its exit sequence. Thread.getAllStackTraces() in the test
caught it in that window, causing a flaky assertion.

Track the thread reference in the factory and join() it (1 s timeout)
after awaitTermination so shutdownEvalExecutor() guarantees the thread
is fully dead before returning.

* fix(jsonata): use join() without timeout — thread is already done

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants