Skip to content

[SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers#53076

Closed
susheel-aroskar wants to merge 15 commits intoapache:masterfrom
susheel-aroskar:sarsokar-SPARK-54314-pyspark-connect-debug
Closed

[SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers#53076
susheel-aroskar wants to merge 15 commits intoapache:masterfrom
susheel-aroskar:sarsokar-SPARK-54314-pyspark-connect-debug

Conversation

@susheel-aroskar
Copy link
Copy Markdown

What changes were proposed in this pull request?

Optionally transmitting client-side code location details (function name, file name and line number) along with actions.

Why are the changes needed?

Right now there is no information sent to Spark Connect server that will aid in pointing the location of the call (i.e. Spark data frame action) in the client application code. By making this change, client application call stack details are sent to the server as a list of (function name, file name, line number) tuples where they can be logged in the server logs, included in corresponding open telemetry spans as attributes etc. This will help users looking from server side UI or Console to quickly pinpoint call locations of erring or slow calls in their own (client application) code without server needing to have access to the actual code.

Does this PR introduce any user-facing change?

It includes a new ENV variable SPARK_CONNECT_DEBUG_CLIENT_CALL_STACK which user can set to true / 1 to opt into transmitting client application code locations to server. If opted into, the client app call stack trace details are included in the user_context.extensions field of the Spark Connect protobufs

How was this patch tested?

By adding new unit test test_client_call_stack_trace.py

Was this patch authored or co-authored using generative AI tooling?

Yes.
Some of the unit tests were Generated-by: Cursor

Copy link
Copy Markdown
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks neat. If you can rebase it and run the linter that would be awesome. Thanks for working to improve the debugging experience for PySpark connect users :)

Comment thread python/pyspark/sql/connect/client/core.py Outdated
List[any_pb2.Any]: A list of Any objects, each representing a stack frame in the call stack trace in the user code.
"""
call_stack_trace = []
if os.getenv("SPARK_CONNECT_DEBUG_CLIENT_CALL_STACK", "false").lower() in ("true", "1"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why system env variable instead of Spark configuration flag? Also if we're adding a new configuration option we should probably document it somewhere (if it's a spark conf flag we have the doc in-line sort of already)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel Spark Configuration flags are mainly for things that affect Spark's - i.e. Spark server's - behavior. This only affects client's behavior. Also if I set it in Spark conf then I'll have to make a network call spark.conf.get() every time to decide whether to include the client code call stack or not.
Anyway, that was my thinking but I'm open to changing it to Spark conf if that's the convention / best practice.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting the conf has a network call. Often in the Javaside we'd resolve a non-modifiable config like that in a lazy val

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-54314][PySpark] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers [WIP][SPARK-54314][PYTHON[[CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers Nov 18, 2025
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-54314][PYTHON[[CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers [WIP][SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers Nov 18, 2025
@HyukjinKwon
Copy link
Copy Markdown
Member

cc @zhengruifeng and @ueshin FYI

@susheel-aroskar susheel-aroskar force-pushed the sarsokar-SPARK-54314-pyspark-connect-debug branch from 55b3c46 to 5db628d Compare November 19, 2025 01:54
@susheel-aroskar
Copy link
Copy Markdown
Author

This looks neat. If you can rebase it and run the linter that would be awesome. Thanks for working to improve the debugging experience for PySpark connect users :)

Done. Please LMK if anything else is amiss

@holdenk
Copy link
Copy Markdown
Contributor

holdenk commented Nov 20, 2025

@susheel-aroskar looks like still some long line issues in the tests.

@holdenk
Copy link
Copy Markdown
Contributor

holdenk commented Nov 27, 2025

Hey @susheel-aroskar can you merge master in and also look at the failing test?

@susheel-aroskar susheel-aroskar force-pushed the sarsokar-SPARK-54314-pyspark-connect-debug branch from 69a5e08 to d06496d Compare December 3, 2025 07:19
@sfc-gh-saroskar
Copy link
Copy Markdown
Contributor

Hey @susheel-aroskar can you merge master in and also look at the failing test?

Done. Can you PTAL?

@susheel-aroskar susheel-aroskar marked this pull request as ready for review December 4, 2025 07:58
@susheel-aroskar susheel-aroskar changed the title [WIP][SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers [SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers Dec 8, 2025
@huaxingao
Copy link
Copy Markdown
Contributor

gentle ping @zhengruifeng @ueshin @HyukjinKwon



if __name__ == "__main__":
from pyspark.sql.tests.connect.client.test_client_call_stack_trace import * # noqa: F401
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this new test is skipped if it is not listed in modules.py

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, added it to the modules.py

@zhengruifeng
Copy link
Copy Markdown
Contributor

cc @gaogaotiantian



@unittest.skipIf(not should_test_connect, connect_requirement_message)
class CallStackTraceTestCase(unittest.TestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we don't set up a connect session in this test, do we need a E2E test?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it using python -m unittest pyspark.sql.tests.connect.client.test_client_call_stack_trace in ./python directory and it passes all tests. Not sure if it needs E2E setup.

Copy link
Copy Markdown
Contributor

@gaogaotiantian gaogaotiantian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, CI is not running this test, we need to add it to modules. Also we need to reduce the number of tests. To be honest I don't think we should have a separate test file for this new feature. If it's not generated by LLM, it would be like 4-5 test cases living in a connect client test file.

)


def _is_pyspark_source(filename: str) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this to be a function. It's a single line code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it into a function so that it can be easily unit tested for both positive and negative test cases.

return filtered_stack_frames


def _build_call_stack_trace() -> Optional[any_pb2.Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These function are exclusively used by SparkConnectClient and it is providing information of SparkConnectClient. We should put these in the class instead of having individual functions in the module (I also believe this is a good pattern connect is trying to keep).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Moved it into the SparkConnectClient class.

break
if i + 1 < len(frames):
_, _, func, _ = frames[i + 1]
filtered_stack_frames.append(CallSite(function=func, file=filename, linenum=lineno))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is what first_spark_call does, but I think this is wrong here. In the definition of StackTraceElement, the fields are:

        method_name: builtins.str
        """The name of the method containing the execution point."""
        file_name: builtins.str
        """The name of the file containing the execution point."""
        line_number: builtins.int
        """The line number of the source line containing the execution point."""

method_name should be the method/function that contains the execution point, instead of the callee of this execution point. I don't think you should use the func from the next frame.

If you get rid of this logic, this function would be trivial, that you should not need to do a separate function at all. You are iterating through frames to build a list of CallSite and immediately unpacking it in the caller function. I think you should just do everything in a single function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the stack trace filled for test_call_stack_trace_captures_correct_calling_context

method_name: "level1"
file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py"
line_number: 274

method_name: "level2"
file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py"
line_number: 272

method_name: "level3"
file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py"
line_number: 268

So it is recording the name of the function invoked along with the line number where it is invoked. For example, function level1() is invoked on the line# 274 (the line req = level1()).

I believe this format - name of the function called + it's call location in the callee - is more useful from the point of view of a developer who is trying to debug the error from the server side logs etc. For example, in most cases the function of interest being invoked would be some data frame action like collect() or count(). There may be multiple of these action calls present in a single callee function too. So showing which DF action was invoked (action's name) + exact location in the code where it was invoked will make things unambiguous IMO. I suspect that's why first_spark_call follows similar logic.

wdyt?

self._update_request_with_user_context_extensions(req)

call_stack_trace = _build_call_stack_trace()
if call_stack_trace:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The min version supported is 3.10 now so you can do

if call_stack_trace := _build_call_stack_trace():
    req.user_context.extensions.append(call_stack_trace)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, thanks.

# https://issues.apache.org/jira/browse/SPARK-54314


@unittest.skipIf(not should_test_connect, connect_requirement_message)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked at this code, the first thing came to my mind is - is this generated by LLM? Then I went back to the PR description and confirmed it.

I'll try to explain why I don't enjoy this piece.

This is a very large test case, testing a very small function. The test itself is a few times larger than the function it's testing. A lot of the stuff it tests are trivial - when you read the actual test you'll be like - why am I testing this?

People might think - it's harmless to have more tests. Yes tests are good, but only good tests are good. Tests are code, and any extra code increases effort to maintain. I think we should greatly reduce the number of tests to test what really matters. To test against the real potential dark corners instead of artificial ones. For example, I don't think any human would write 3 test methods to test _is_pyspark_source.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I have removed more than half the original tests which I felt were extraneous. PTAL.

@susheel-aroskar susheel-aroskar force-pushed the sarsokar-SPARK-54314-pyspark-connect-debug branch from 592085d to 8edaa5c Compare January 7, 2026 04:33
@huaxingao
Copy link
Copy Markdown
Contributor

@gaogaotiantian @zhengruifeng Thank you for the reviews! This PR looks good to me now. Any further comments? If not, I’d like to move it forward.

Copy link
Copy Markdown
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao
Copy link
Copy Markdown
Contributor

Merged to master. Thanks @susheel-aroskar for the PR! Thanks @zhengruifeng @gaogaotiantian for the review!

@susheel-aroskar
Copy link
Copy Markdown
Author

Merged to master. Thanks @susheel-aroskar for the PR! Thanks @zhengruifeng @gaogaotiantian for the review!

Thank you @@zhengruifeng @gaogaotiantian , @huaxingao and @holdenk for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants