[SPARK-54314][PYTHON][CONNECT] Improve Server-Side debuggability in Spark Connect by capturing client application's file name and line numbers#53076
Conversation
holdenk
left a comment
There was a problem hiding this comment.
This looks neat. If you can rebase it and run the linter that would be awesome. Thanks for working to improve the debugging experience for PySpark connect users :)
| List[any_pb2.Any]: A list of Any objects, each representing a stack frame in the call stack trace in the user code. | ||
| """ | ||
| call_stack_trace = [] | ||
| if os.getenv("SPARK_CONNECT_DEBUG_CLIENT_CALL_STACK", "false").lower() in ("true", "1"): |
There was a problem hiding this comment.
Why system env variable instead of Spark configuration flag? Also if we're adding a new configuration option we should probably document it somewhere (if it's a spark conf flag we have the doc in-line sort of already)
There was a problem hiding this comment.
I feel Spark Configuration flags are mainly for things that affect Spark's - i.e. Spark server's - behavior. This only affects client's behavior. Also if I set it in Spark conf then I'll have to make a network call spark.conf.get() every time to decide whether to include the client code call stack or not.
Anyway, that was my thinking but I'm open to changing it to Spark conf if that's the convention / best practice.
There was a problem hiding this comment.
Interesting the conf has a network call. Often in the Javaside we'd resolve a non-modifiable config like that in a lazy val
|
cc @zhengruifeng and @ueshin FYI |
55b3c46 to
5db628d
Compare
Done. Please LMK if anything else is amiss |
|
@susheel-aroskar looks like still some long line issues in the tests. |
|
Hey @susheel-aroskar can you merge master in and also look at the failing test? |
69a5e08 to
d06496d
Compare
Done. Can you PTAL? |
|
gentle ping @zhengruifeng @ueshin @HyukjinKwon |
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| from pyspark.sql.tests.connect.client.test_client_call_stack_trace import * # noqa: F401 |
There was a problem hiding this comment.
this new test is skipped if it is not listed in modules.py
There was a problem hiding this comment.
Thanks for the pointer, added it to the modules.py
|
|
||
|
|
||
| @unittest.skipIf(not should_test_connect, connect_requirement_message) | ||
| class CallStackTraceTestCase(unittest.TestCase): |
There was a problem hiding this comment.
it seems we don't set up a connect session in this test, do we need a E2E test?
There was a problem hiding this comment.
I ran it using python -m unittest pyspark.sql.tests.connect.client.test_client_call_stack_trace in ./python directory and it passes all tests. Not sure if it needs E2E setup.
There was a problem hiding this comment.
And yes, CI is not running this test, we need to add it to modules. Also we need to reduce the number of tests. To be honest I don't think we should have a separate test file for this new feature. If it's not generated by LLM, it would be like 4-5 test cases living in a connect client test file.
| ) | ||
|
|
||
|
|
||
| def _is_pyspark_source(filename: str) -> bool: |
There was a problem hiding this comment.
I don't think we need this to be a function. It's a single line code.
There was a problem hiding this comment.
I made it into a function so that it can be easily unit tested for both positive and negative test cases.
| return filtered_stack_frames | ||
|
|
||
|
|
||
| def _build_call_stack_trace() -> Optional[any_pb2.Any]: |
There was a problem hiding this comment.
These function are exclusively used by SparkConnectClient and it is providing information of SparkConnectClient. We should put these in the class instead of having individual functions in the module (I also believe this is a good pattern connect is trying to keep).
There was a problem hiding this comment.
Makes sense. Moved it into the SparkConnectClient class.
| break | ||
| if i + 1 < len(frames): | ||
| _, _, func, _ = frames[i + 1] | ||
| filtered_stack_frames.append(CallSite(function=func, file=filename, linenum=lineno)) |
There was a problem hiding this comment.
I know this is what first_spark_call does, but I think this is wrong here. In the definition of StackTraceElement, the fields are:
method_name: builtins.str
"""The name of the method containing the execution point."""
file_name: builtins.str
"""The name of the file containing the execution point."""
line_number: builtins.int
"""The line number of the source line containing the execution point."""method_name should be the method/function that contains the execution point, instead of the callee of this execution point. I don't think you should use the func from the next frame.
If you get rid of this logic, this function would be trivial, that you should not need to do a separate function at all. You are iterating through frames to build a list of CallSite and immediately unpacking it in the caller function. I think you should just do everything in a single function.
There was a problem hiding this comment.
This is the stack trace filled for test_call_stack_trace_captures_correct_calling_context
method_name: "level1"
file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py"
line_number: 274
method_name: "level2"
file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py"
line_number: 272
method_name: "level3"
file_name: "/Users/saroskar/Github/debug-improvement/python/pyspark/sql/tests/connect/client/test_client_call_stack_trace.py"
line_number: 268So it is recording the name of the function invoked along with the line number where it is invoked. For example, function level1() is invoked on the line# 274 (the line req = level1()).
I believe this format - name of the function called + it's call location in the callee - is more useful from the point of view of a developer who is trying to debug the error from the server side logs etc. For example, in most cases the function of interest being invoked would be some data frame action like collect() or count(). There may be multiple of these action calls present in a single callee function too. So showing which DF action was invoked (action's name) + exact location in the code where it was invoked will make things unambiguous IMO. I suspect that's why first_spark_call follows similar logic.
wdyt?
| self._update_request_with_user_context_extensions(req) | ||
|
|
||
| call_stack_trace = _build_call_stack_trace() | ||
| if call_stack_trace: |
There was a problem hiding this comment.
The min version supported is 3.10 now so you can do
if call_stack_trace := _build_call_stack_trace():
req.user_context.extensions.append(call_stack_trace)| # https://issues.apache.org/jira/browse/SPARK-54314 | ||
|
|
||
|
|
||
| @unittest.skipIf(not should_test_connect, connect_requirement_message) |
There was a problem hiding this comment.
When I looked at this code, the first thing came to my mind is - is this generated by LLM? Then I went back to the PR description and confirmed it.
I'll try to explain why I don't enjoy this piece.
This is a very large test case, testing a very small function. The test itself is a few times larger than the function it's testing. A lot of the stuff it tests are trivial - when you read the actual test you'll be like - why am I testing this?
People might think - it's harmless to have more tests. Yes tests are good, but only good tests are good. Tests are code, and any extra code increases effort to maintain. I think we should greatly reduce the number of tests to test what really matters. To test against the real potential dark corners instead of artificial ones. For example, I don't think any human would write 3 test methods to test _is_pyspark_source.
There was a problem hiding this comment.
Fair enough. I have removed more than half the original tests which I felt were extraneous. PTAL.
…t application's file name and line numbers in PySpark
…ing them to user_context.extensions
592085d to
8edaa5c
Compare
|
@gaogaotiantian @zhengruifeng Thank you for the reviews! This PR looks good to me now. Any further comments? If not, I’d like to move it forward. |
|
Merged to master. Thanks @susheel-aroskar for the PR! Thanks @zhengruifeng @gaogaotiantian for the review! |
Thank you @@zhengruifeng @gaogaotiantian , @huaxingao and @holdenk for the reviews! |
What changes were proposed in this pull request?
Optionally transmitting client-side code location details (function name, file name and line number) along with actions.
Why are the changes needed?
Right now there is no information sent to Spark Connect server that will aid in pointing the location of the call (i.e. Spark data frame action) in the client application code. By making this change, client application call stack details are sent to the server as a list of (function name, file name, line number) tuples where they can be logged in the server logs, included in corresponding open telemetry spans as attributes etc. This will help users looking from server side UI or Console to quickly pinpoint call locations of erring or slow calls in their own (client application) code without server needing to have access to the actual code.
Does this PR introduce any user-facing change?
It includes a new ENV variable
SPARK_CONNECT_DEBUG_CLIENT_CALL_STACKwhich user can set to true / 1 to opt into transmitting client application code locations to server. If opted into, the client app call stack trace details are included in theuser_context.extensionsfield of the Spark Connect protobufsHow was this patch tested?
By adding new unit test
test_client_call_stack_trace.pyWas this patch authored or co-authored using generative AI tooling?
Yes.
Some of the unit tests were Generated-by: Cursor