-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54344][PYTHON] Kill the worker if flush fails in daemon.py #53055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I'd mark this as draft for now as I'm not super confident about this. |
python/pyspark/daemon.py
Outdated
| faulthandler_log_path = os.path.join(faulthandler_log_path, str(os.getpid())) | ||
| with open(faulthandler_log_path, "w") as faulthandler_log_file: | ||
| faulthandler.dump_traceback(file=faulthandler_log_file) | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the behavior change here intentional? The original code swallows the exception and return the actual exit code. This will raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's intentional.
If it returns some exit code (!=0), it waits for Java to send kind of "ACK" command for graceful shutdown, which is part of the protocol, but Java is still waiting for the reply from Python, so it won't recover anyway.
I don't think it can recover in this case as something wrong should be already happening on the connection.
spark/python/pyspark/daemon.py
Lines 229 to 236 in 97eec3f
| if not reuse or code: | |
| # wait for closing | |
| try: | |
| while sock.recv(1024): | |
| pass | |
| except Exception: | |
| pass | |
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be good enough to just write out the exception to see whether this actually happens or not, without this behavior change, for now.
| val PYTHON_DAEMON_KILL_WORKER_ON_FLUSH_FAILURE = | ||
| ConfigBuilder("spark.python.daemon.killWorkerOnFlushFailure") | ||
| .doc("When enabled, exceptions raised during output flush operations in the Python " + | ||
| "worker managed under Python daemon are not caught, causing the worker to terminate " + | ||
| "with the exception. This allows Spark to detect the failure and retry the task. " + | ||
| "When disabled (default), flush exceptions are caught and logged, " + | ||
| "but the worker continues, " + | ||
| "which could cause the worker to get stuck due to protocol mismatch.") | ||
| .version("4.1.0") | ||
| .booleanConf | ||
| .createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default to false here feels weird given we're saying we expect issues with it disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it to true. Thanks!
| "with the exception. This allows Spark to detect the failure and retry the task. " + | ||
| "When disabled, flush exceptions are caught and logged but the worker continues, " + | ||
| "which could cause the worker to get stuck due to protocol mismatch.") | ||
| .version("4.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it targeting Apache Spark 4.1.0 as a bug fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, yes, but it depends on how the reviews go, Also I'll follow you as the release manager. Thanks.
What changes were proposed in this pull request?
Kills the worker if flush fails in
daemon.py.spark.python.daemon.killWorkerOnFlushFailure(defaulttrue)spark.sql.execution.pyspark.udf.daemonKillWorkerOnFlushFailure(fallback to the above)Before it just dies, reuse
faulthandlerfeature and record the thread dump and it will appear in the error message iffaulthandleris enabled.Even when
faulthandleris not eabled, the error will appear in the executor'sstderrfile.When this is disabled, the behavior is the same as before but with a log.
Why are the changes needed?
Currently an exception caused by
outfile.flush()failure indaemon.pyis ignored, but if the last command inworker_mainis still not flushed, it could cause a UDF stuck in Java waiting for the response from the Python worker.It should just die and let Spark retry the task.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manually.
Test with the patch to emulate the case
With just
pass(before this), it gets stuck, and after this it lets Spark retry the task.Was this patch authored or co-authored using generative AI tooling?
No.