[SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers by susheel-aroskar · Pull Request #53076 · apache/spark

susheel-aroskar · 2025-11-15T00:45:11Z

What changes were proposed in this pull request?

Optionally transmitting client-side code location details (function name, file name and line number) along with actions.

Why are the changes needed?

Right now there is no information sent to Spark Connect server that will aid in pointing the location of the call (i.e. Spark data frame action) in the client application code. By making this change, client application call stack details are sent to the server as a list of (function name, file name, line number) tuples where they can be logged in the server logs, included in corresponding open telemetry spans as attributes etc. This will help users looking from server side UI or Console to quickly pinpoint call locations of erring or slow calls in their own (client application) code without server needing to have access to the actual code.

Does this PR introduce any user-facing change?

It includes a new ENV variable SPARK_CONNECT_DEBUG_CLIENT_CALL_STACK which user can set to true / 1 to opt into transmitting client application code locations to server. If opted into, the client app call stack trace details are included in the user_context.extensions field of the Spark Connect protobufs

How was this patch tested?

By adding new unit test test_client_call_stack_trace.py

Was this patch authored or co-authored using generative AI tooling?

Yes.
Some of the unit tests were Generated-by: Cursor

holdenk

This looks neat. If you can rebase it and run the linter that would be awesome. Thanks for working to improve the debugging experience for PySpark connect users :)

holdenk · 2025-11-17T18:13:14Z

+    List[any_pb2.Any]: A list of Any objects, each representing a stack frame in the call stack trace in the user code.
+    """
+    call_stack_trace = []
+    if os.getenv("SPARK_CONNECT_DEBUG_CLIENT_CALL_STACK", "false").lower() in ("true", "1"):


Why system env variable instead of Spark configuration flag? Also if we're adding a new configuration option we should probably document it somewhere (if it's a spark conf flag we have the doc in-line sort of already)

I feel Spark Configuration flags are mainly for things that affect Spark's - i.e. Spark server's - behavior. This only affects client's behavior. Also if I set it in Spark conf then I'll have to make a network call spark.conf.get() every time to decide whether to include the client code call stack or not.
Anyway, that was my thinking but I'm open to changing it to Spark conf if that's the convention / best practice.

Interesting the conf has a network call. Often in the Javaside we'd resolve a non-modifiable config like that in a lazy val

HyukjinKwon · 2025-11-18T01:55:35Z

cc @zhengruifeng and @ueshin FYI

susheel-aroskar · 2025-11-19T02:08:00Z

This looks neat. If you can rebase it and run the linter that would be awesome. Thanks for working to improve the debugging experience for PySpark connect users :)

Done. Please LMK if anything else is amiss

holdenk · 2025-11-20T18:12:44Z

@susheel-aroskar looks like still some long line issues in the tests.

holdenk · 2025-11-27T01:06:59Z

Hey @susheel-aroskar can you merge master in and also look at the failing test?

sfc-gh-saroskar · 2025-12-04T07:51:01Z

Hey @susheel-aroskar can you merge master in and also look at the failing test?

Done. Can you PTAL?

huaxingao · 2025-12-17T04:11:22Z

gentle ping @zhengruifeng @ueshin @HyukjinKwon

zhengruifeng · 2025-12-17T08:33:41Z

+
+
+if __name__ == "__main__":
+    from pyspark.sql.tests.connect.client.test_client_call_stack_trace import *  # noqa: F401


this new test is skipped if it is not listed in modules.py

Thanks for the pointer, added it to the modules.py

zhengruifeng · 2025-12-17T11:36:29Z

cc @gaogaotiantian

zhengruifeng · 2025-12-17T11:37:57Z

+
+
+@unittest.skipIf(not should_test_connect, connect_requirement_message)
+class CallStackTraceTestCase(unittest.TestCase):


it seems we don't set up a connect session in this test, do we need a E2E test?

I ran it using python -m unittest pyspark.sql.tests.connect.client.test_client_call_stack_trace in ./python directory and it passes all tests. Not sure if it needs E2E setup.

gaogaotiantian

And yes, CI is not running this test, we need to add it to modules. Also we need to reduce the number of tests. To be honest I don't think we should have a separate test file for this new feature. If it's not generated by LLM, it would be like 4-5 test cases living in a connect client test file.

gaogaotiantian · 2025-12-17T19:06:52Z

        )


+def _is_pyspark_source(filename: str) -> bool:


I don't think we need this to be a function. It's a single line code.

I made it into a function so that it can be easily unit tested for both positive and negative test cases.

gaogaotiantian · 2025-12-17T19:08:19Z

+    return filtered_stack_frames
+
+
+def _build_call_stack_trace() -> Optional[any_pb2.Any]:


These function are exclusively used by SparkConnectClient and it is providing information of SparkConnectClient. We should put these in the class instead of having individual functions in the module (I also believe this is a good pattern connect is trying to keep).

Makes sense. Moved it into the SparkConnectClient class.

gaogaotiantian · 2025-12-17T19:18:02Z

+            break
+        if i + 1 < len(frames):
+            _, _, func, _ = frames[i + 1]
+        filtered_stack_frames.append(CallSite(function=func, file=filename, linenum=lineno))


I know this is what first_spark_call does, but I think this is wrong here. In the definition of StackTraceElement, the fields are:

method_name: builtins.str """The name of the method containing the execution point.""" file_name: builtins.str """The name of the file containing the execution point.""" line_number: builtins.int """The line number of the source line containing the execution point."""

method_name should be the method/function that contains the execution point, instead of the callee of this execution point. I don't think you should use the func from the next frame.

If you get rid of this logic, this function would be trivial, that you should not need to do a separate function at all. You are iterating through frames to build a list of CallSite and immediately unpacking it in the caller function. I think you should just do everything in a single function.

This is the stack trace filled for test_call_stack_trace_captures_correct_calling_context

method_name: "level1" file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py" line_number: 274 method_name: "level2" file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py" line_number: 272 method_name: "level3" file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py" line_number: 268

So it is recording the name of the function invoked along with the line number where it is invoked. For example, function level1() is invoked on the line# 274 (the line req = level1()).

I believe this format - name of the function called + it's call location in the callee - is more useful from the point of view of a developer who is trying to debug the error from the server side logs etc. For example, in most cases the function of interest being invoked would be some data frame action like collect() or count(). There may be multiple of these action calls present in a single callee function too. So showing which DF action was invoked (action's name) + exact location in the code where it was invoked will make things unambiguous IMO. I suspect that's why first_spark_call follows similar logic.

wdyt?

gaogaotiantian · 2025-12-17T19:25:05Z

        self._update_request_with_user_context_extensions(req)
+
+        call_stack_trace = _build_call_stack_trace()
+        if call_stack_trace:


The min version supported is 3.10 now so you can do

if call_stack_trace := _build_call_stack_trace(): req.user_context.extensions.append(call_stack_trace)

Changed, thanks.

gaogaotiantian · 2025-12-17T19:36:05Z

+# https://issues.apache.org/jira/browse/SPARK-54314
+
+
+@unittest.skipIf(not should_test_connect, connect_requirement_message)


When I looked at this code, the first thing came to my mind is - is this generated by LLM? Then I went back to the PR description and confirmed it.

I'll try to explain why I don't enjoy this piece.

This is a very large test case, testing a very small function. The test itself is a few times larger than the function it's testing. A lot of the stuff it tests are trivial - when you read the actual test you'll be like - why am I testing this?

People might think - it's harmless to have more tests. Yes tests are good, but only good tests are good. Tests are code, and any extra code increases effort to maintain. I think we should greatly reduce the number of tests to test what really matters. To test against the real potential dark corners instead of artificial ones. For example, I don't think any human would write 3 test methods to test _is_pyspark_source.

Fair enough. I have removed more than half the original tests which I felt were extraneous. PTAL.

…t application's file name and line numbers in PySpark

…ing them to user_context.extensions

huaxingao · 2026-01-16T05:32:07Z

@gaogaotiantian @zhengruifeng Thank you for the reviews! This PR looks good to me now. Any further comments? If not, I’d like to move it forward.

huaxingao

LGTM

huaxingao · 2026-01-22T22:06:01Z

Merged to master. Thanks @susheel-aroskar for the PR! Thanks @zhengruifeng @gaogaotiantian for the review!

susheel-aroskar · 2026-01-26T09:47:17Z

Merged to master. Thanks @susheel-aroskar for the PR! Thanks @zhengruifeng @gaogaotiantian for the review!

Thank you @@zhengruifeng @gaogaotiantian , @huaxingao and @holdenk for the reviews!

github-actions Bot added SQL PYTHON CONNECT labels Nov 15, 2025

holdenk reviewed Nov 17, 2025

View reviewed changes

susheel-aroskar force-pushed the sarsokar-SPARK-54314-pyspark-connect-debug branch from 55b3c46 to 5db628d Compare November 19, 2025 01:54

susheel-aroskar force-pushed the sarsokar-SPARK-54314-pyspark-connect-debug branch from 69a5e08 to d06496d Compare December 3, 2025 07:19

susheel-aroskar marked this pull request as ready for review December 4, 2025 07:58

zhengruifeng reviewed Dec 17, 2025

View reviewed changes

gaogaotiantian reviewed Dec 17, 2025

View reviewed changes

sfc-gh-saroskar and others added 10 commits January 6, 2026 19:08

Improve Server-Side debuggability in Spark Connect by capturing clien…

ffc3a25

…t application's file name and line numbers in PySpark

Add descriptive comment on test, link it to the Jira

0928545

Fix linter warnings

967f2f6

Add individual call stack frames to a wrapper error object before add…

b9dfe96

…ing them to user_context.extensions

Mollify linter

f1e382e

Incorporate code review comments

59ffd73

Mollify python linter

d5b7192

Ran reformat-python

0aacd56

Reformat

cc4124b

Fix typing

d052be6

raksoras added 3 commits January 6, 2026 19:17

Add test_client_call_stack_trace to CI CD

ee0ab36

Move utility methods inside SparkConnectClient class

450d3eb

Reduce extraneous tests

8edaa5c

susheel-aroskar force-pushed the sarsokar-SPARK-54314-pyspark-connect-debug branch from 592085d to 8edaa5c Compare January 7, 2026 04:33

raksoras added 2 commits January 6, 2026 20:37

Fix renamed tests

dece6c1

mollify linter

d90dc1c

zhengruifeng approved these changes Jan 21, 2026

View reviewed changes

huaxingao approved these changes Jan 22, 2026

View reviewed changes

susheel-aroskar requested a review from gaogaotiantian January 22, 2026 12:22

asf-gitbox-commits closed this in 01a79c8 Jan 22, 2026



		if __name__ == "__main__":
		from pyspark.sql.tests.connect.client.test_client_call_stack_trace import * # noqa: F401



		@unittest.skipIf(not should_test_connect, connect_requirement_message)
		class CallStackTraceTestCase(unittest.TestCase):

		return filtered_stack_frames


		def _build_call_stack_trace() -> Optional[any_pb2.Any]:

		# https://issues.apache.org/jira/browse/SPARK-54314


		@unittest.skipIf(not should_test_connect, connect_requirement_message)

Conversation

susheel-aroskar commented Nov 15, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 18, 2025

Uh oh!

susheel-aroskar commented Nov 19, 2025

Uh oh!

holdenk commented Nov 20, 2025

Uh oh!

holdenk commented Nov 27, 2025

Uh oh!

sfc-gh-saroskar commented Dec 4, 2025

Uh oh!

huaxingao commented Dec 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Dec 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jan 16, 2026

Uh oh!

huaxingao left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jan 22, 2026

Uh oh!

susheel-aroskar commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

gaogaotiantian left a comment •

edited

Loading