fix: make Frame.lookup() iterative to prevent StackOverflowError#107
fix: make Frame.lookup() iterative to prevent StackOverflowError#107fdelbrayelle wants to merge 1 commit into
Conversation
Java does not optimize tail calls. On JVMs with smaller default thread stack sizes (e.g. Windows workers, containers with -Xss256k), deeply nested JSONata expressions cause Jsonata$Frame.lookup() to overflow the stack via tail recursion. StackOverflowError is a JVM Error — it cannot be safely caught and the entire worker thread crashes. Replace the recursive parent.lookup(name) call with an iterative loop over the scope chain. Semantics are identical; traversal order is unchanged. Eliminates all stack risk from scope chain lookup regardless of nesting depth. Fixes worker crashes reported in kestra-io/plugin-transform#79.
|
thanks for the PR. looks good. can you provide an example of a JVM setup + expression that reproduces this issue. Thanks! |
Tests now assert isInstanceOf(StackOverflowError.class) — current behavior without the upstream fix. Rename constant to XSS_256K and add Javadoc linking it to -Xss256k so the stack constraint is self-documenting. Comment marks where assertions flip once dashjoin/jsonata-java#107 ships.
|
Hi @aeberhart 👋 |
Tests document the current bug state: StackOverflowError is expected. Will flip to assertThatNoException once dashjoin/jsonata-java#107 ships.
|
Hi @fdelbrayelle, I ran some tests with the expression from your repo: Your change does not really make a difference because the number of nested frames is independent of the recursion depth. Eventually, you get a StackOverflow at a more or less random code position. In your engine, you might want to do this: Then, you get a JException from com.dashjoin.jsonata.Timebox.checkRunnaway(). |
|
In protected JsonNode evaluateExpression(RunContext runContext, JsonNode jsonNode) {
try {
var timeoutInMilli = runContext.render(getTimeout()).as(Duration.class)
.map(Duration::toMillis)
.orElse(Long.MAX_VALUE);
var rMaxDepth = runContext.render(getMaxDepth()).as(Integer.class).orElseThrow();
var data = MAPPER.convertValue(jsonNode, Object.class);
var frame = this.parsedExpression.createFrame();
frame.setRuntimeBounds(timeoutInMilli, rMaxDepth);
var result = this.parsedExpression.evaluate(data, frame);
if (result == null) {
return NullNode.getInstance();
}
return MAPPER.valueToTree(result);
} catch (JException | IllegalVariableEvaluationException e) {
throw new RuntimeException("Failed to evaluate expression", e);
}
}
As an evidence: Kestra already calls The change is zero-behavior-change, zero-risk and removes one class of |
|
Hi @fdelbrayelle, running this is the stack trace: in the example D=4 and N is some large number. In total, lookup is called N*D times. The recursion depth is O(N+D). You can see this in the trace above. The recursive call (evaluate) is repeated N times with 4 calls to lookup "on top". Please let us know if we're missing something. |
Each JSONata recursion level pushes ~8 JVM frames. On 256 KB worker stacks (~300 usable frames), the old default maxDepth=200 allowed 200 × 8 = 1600 frames before the bounds check fired — far past overflow. Two-layer fix: 1. Lower default maxDepth 200 → 50 (50 × 8 = 400 frames, safe on 256 KB). setRuntimeBounds fires at depth 50, throwing JException cleanly. 2. Run evaluate() on a dedicated thread with an explicit 4 MB stack. Worker thread stack size (e.g. 256 KB on Windows) can no longer constrain the evaluator. If a user sets a very high maxDepth and triggers StackOverflowError anyway, it is caught as Throwable inside the throwaway eval thread; the worker thread gets a clean RuntimeException instead of crashing. Update regression tests: - Parametrized test: maxDepth=50/200/500/1000 all produce JException, never StackOverflowError, even on -Xss512k JVM. - Isolation test: maxDepth=50000 with $f(49999) overflows the eval thread but worker receives RuntimeException(cause=StackOverflowError). Closes dashjoin/jsonata-java#107 dependency — fix is now self-contained in plugin-transform without requiring upstream changes.
|
You're right, thank you for the analysis and the stack trace. The recursion depth is We've fixed it on our side in plugin-transform instead: evaluation now runs on a dedicated thread with an explicit 4MB stack and the default So in the end, no upstream changes are needed here. Closing this PR. |
|
Thanks for confirming and of course for using the lib! |
… thread isolation + lower default maxDepth (#80) * test(jsonata): add StackOverflow regression tests for small JVM stacks Reproduces dashjoin/jsonata-java#107. Frame.lookup() recurses once per scope frame — on 256 KB (Windows) or 512 KB (Linux default) thread stacks, a 999/1999-deep recursive JSONata expression overflows before maxDepth fires. Tests use explicit Thread stack sizes so the scenario is reproducible on any platform. Passes once jsonata-java ships the iterative Frame.lookup() fix. * test(jsonata): refine StackOverflow regression tests — shared 256k stack helper Both Windows and Linux tests now use SMALL_STACK_BYTES = 256 KB via Thread(stackSize) so no -Xss JVM flag is needed in build config. Extract runOnSmallStack() helper to avoid duplication. * test(jsonata): assert StackOverflowError and name constant XSS_256K Tests now assert isInstanceOf(StackOverflowError.class) — current behavior without the upstream fix. Rename constant to XSS_256K and add Javadoc linking it to -Xss256k so the stack constraint is self-documenting. Comment marks where assertions flip once dashjoin/jsonata-java#107 ships. * test(jsonata): use -Xss256k JVM flag instead of Thread(stackSize) Thread(stackSize) is advisory — JVM ignores it on Linux. Add -Xss256k to test JVM args in plugin-transform-json/build.gradle via afterEvaluate so it appends after the root build's jvmArgs= Allure assignment. Simplify tests to call task.run() directly and assert StackOverflowError with assertThatThrownBy. * test(jsonata): fix -Xss256k not applied — root build used jvmArgs= (replace) afterEvaluate in the subproject was silently wiped because the root Allure subprojects{} block used jvmArgs=[...] (full assignment). Change root to jvmArgs append; subproject test{jvmArgs '-Xss256k'} then safely extends it without ordering ambiguity. * test(jsonata): drop -Xss256k, increase depth to crash default Linux stack -Xss256k is silently ignored by OpenJDK on Linux 64-bit (enforces a higher minimum). Instead use depth=4999/9999 which overflows the default 512 KB-1 MB thread stack without any JVM flag (~150 bytes/frame × 9999 ≈ 1.5 MB needed). * test(jsonata): use -Xss512k + depth=49999 to guarantee stack overflow in CI Linux default thread stack is ~8 MB (OS RLIMIT_STACK), far too large to reproduce the crash without -Xss. Pin test JVM to -Xss512k (above HotSpot minimum, safe for Kestra framework). Use depth=49999 so 49999 JVM frames × minimum 16 bytes/frame = 800 KB > 512k — overflow guaranteed regardless of JIT optimization level. * test(jsonata): fix expression — use non-tail-recursive form to prevent TCO jsonata-java TCO-optimises tail-recursive calls into a loop (O(1) stack), so $f($n-1) never caused a stack overflow. Adding + 0 after the recursive call puts it in non-tail position, forcing each frame to stay live and producing the expected StackOverflowError on -Xss512k. * test(jsonata): revert to assertThatThrownBy — lib not upgraded yet Tests document the current bug state: StackOverflowError is expected. Will flip to assertThatNoException once dashjoin/jsonata-java#107 ships. * fix(jsonata): isolate eval on 4MB thread + lower default maxDepth to 50 Each JSONata recursion level pushes ~8 JVM frames. On 256 KB worker stacks (~300 usable frames), the old default maxDepth=200 allowed 200 × 8 = 1600 frames before the bounds check fired — far past overflow. Two-layer fix: 1. Lower default maxDepth 200 → 50 (50 × 8 = 400 frames, safe on 256 KB). setRuntimeBounds fires at depth 50, throwing JException cleanly. 2. Run evaluate() on a dedicated thread with an explicit 4 MB stack. Worker thread stack size (e.g. 256 KB on Windows) can no longer constrain the evaluator. If a user sets a very high maxDepth and triggers StackOverflowError anyway, it is caught as Throwable inside the throwaway eval thread; the worker thread gets a clean RuntimeException instead of crashing. Update regression tests: - Parametrized test: maxDepth=50/200/500/1000 all produce JException, never StackOverflowError, even on -Xss512k JVM. - Isolation test: maxDepth=50000 with $f(49999) overflows the eval thread but worker receives RuntimeException(cause=StackOverflowError). Closes dashjoin/jsonata-java#107 dependency — fix is now self-contained in plugin-transform without requiring upstream changes. * refactor(jsonata): replace per-call Thread with per-instance executor; document Throwable catch Addresses two review comments from Malay on PR #80: 1. Thread-per-record regression (TransformItems): evaluateExpression() previously spawned a new 4 MB-stack Thread per call. Inside TransformItems.run(), calls are inside a Flux.map so this allocated one thread per record. Fixed by replacing new Thread(...) with a lazy-initialized single-threaded ExecutorService (daemon, 4 MB ThreadFactory) stored per Transform instance. TransformItems and TransformValue now shut it down in finally at the end of run(), so the executor lives for exactly one task run. 2. catch (Throwable): kept intentionally broad with an explanatory comment. The eval thread is a throwaway sandbox — narrowing to Exception | StackOverflowError would let other Errors (OOM, InternalError) escape silently via the UncaughtExceptionHandler, leaving both resultRef and errorRef null and returning null from evaluateExpression instead of failing. New tests: - shouldReuseEvalThreadAcrossRecords: asserts zero jsonata-eval threads alive after run() - shouldContinueWorkingAfterStackOverflowError: asserts a clean second run after a SOE Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(jsonata): restore maxDepth default to 1000 to avoid breaking change The 4 MB eval thread is the actual fix for the StackOverflowError on Windows workers. Lowering maxDepth had no effect on the crash, which was driven by Frame.lookup() scope-chain depth, not user-defined function recursion. maxDepth=50 was an unnecessary breaking change. * test(jsonata): add regression test for large-dataset Frame.lookup() StackOverflowError Reproduces a production crash: 5,000 LDAP-like records with a 15-field flat-lookup expression (ternary, $string, $join on a multi-value array) run on the -Xss512k JVM. The expression mirrors a real-world actual flow. Without the 4 MB eval thread this crashes; with it the test passes. * fix(jsonata): join eval thread in shutdown to eliminate termination race awaitTermination() returns when all tasks are done but the thread itself may still be in its exit sequence. Thread.getAllStackTraces() in the test caught it in that window, causing a flaky assertion. Track the thread reference in the factory and join() it (1 s timeout) after awaitTermination so shutdownEvalExecutor() guarantees the thread is fully dead before returning. * fix(jsonata): use join() without timeout — thread is already done --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Problem
Jsonata$Frame.lookup()walks the parent scope chain via tail recursion:Java does not optimize tail calls. On JVMs with smaller default thread stack sizes (e.g. Windows workers, containers with
-Xss256k), deeply nested JSONata expressions overflow the stack. TheStackOverflowErroris a JVMError, not anException— it cannot be safely caught and recovered from, and it crashes the entire worker thread.Fix
Replace with an iterative loop. Identical semantics, zero recursion:
Impact
plugin-transform-jsonon Windows workers (reported at fix(jsonata): lower default maxDepth to prevent Windows worker crash kestra-io/plugin-transform#79)