Skip to content

Conversation

LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@LuciferYang
Copy link
Contributor Author

I conducted a sampling analysis on the data from all Python tests and found that the following 16 test cases have execution times exceeding 60 seconds (for the time being, let's assume 60 seconds as the threshold; should we consider choosing a larger threshold?). Should we temporarily disable them and then re-enable them after optimizing their execution times? What are your opinions on this? @zhengruifeng @dongjoon-hyun @HyukjinKwon

class name case name time(s)
pyspark.sql.tests.connect.streaming.test_parity_listener.StreamingListenerParityTests test_listener_events_spark_command 97.006
pyspark.sql.tests.connect.pandas.test_parity_pandas_transform_with_state.TransformWithStateInPandasParityTests test_transform_with_state_with_timers_single_partition 87.612
pyspark.sql.tests.connect.pandas.test_parity_pandas_transform_with_state.TransformWithStateInPySparkParityTests test_transform_with_state_with_timers_single_partition 89.920
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPandasWithCheckpointV2Tests test_transform_with_state_with_timers_single_partition 82.795
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPySparkTests test_transform_with_state_with_timers_single_partition 80.131
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPySparkWithCheckpointV2Tests test_transform_with_state_with_timers_single_partition 87.991
pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests test_training_and_prediction 75.927
pyspark.pandas.tests.connect.indexes.test_parity_datetime_property.DatetimeIndexParityTests test_properties 71.911
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPandasTests test_transform_with_state_with_timers_single_partition 77.818
pyspark.pandas.tests.connect.groupby.test_parity_split_apply.GroupbyParitySplitApplyTests test_split_apply_combine_on_series 66.697
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPandasTests test_schema_evolution_scenarios 60.880
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPandasWithCheckpointV2Tests test_schema_evolution_scenarios 60.634
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPySparkTests test_schema_evolution_scenarios 61.579
pyspark.sql.tests.pandas.test_pandas_transform_with_state.TransformWithStateInPySparkWithCheckpointV2Tests test_schema_evolution_scenarios 67.869
pyspark.sql.tests.pandas.test_pandas_udf_scalar.ScalarPandasUDFTests test_mixed_udf 66.439
pyspark.pandas.tests.connect.groupby.test_parity_split_apply_min_max.GroupbySplitApplyMMParityTests test_split_apply_combine_on_series 61.645




@dongjoon-hyun
Copy link
Member

Thank you for collecting the result and shedding the light to us, @LuciferYang . BTW, 97s is the maximum of test duration so far?

@LuciferYang
Copy link
Contributor Author

Thank you for collecting the result and shedding the light to us, @LuciferYang . BTW, 97s is the maximum of test duration so far?

Yes, the statistical unit here is a test case, rather than a test file or a test class.

@dongjoon-hyun
Copy link
Member

Thank you so much for spending your time on this. I really appreciate your passion, @LuciferYang .

I rechecked the usage Today. It seems that we spent 20 Full-Time runners for last week. It's lesser than I expected.

The root cause seems that Apache Spark repository has less commits in these days. On August, we had lots of commits.

Screenshot 2025-09-24 at 22 29 06

For now, let's keep the AS-IS status which means to increase the timeout-limit without skipping (until we need a serious action). We can skip the tests easily always if needed later, but recovering test coverage happens seldomly. WDYT?

@LuciferYang
Copy link
Contributor Author

@dongjoon-hyun Ok ~ let me close this pr first. If there's a need later on, the statistical data from this pr can serve as a reference.

@zhengruifeng
Copy link
Contributor

Thanks @LuciferYang and @dongjoon-hyun for the investigation, the data is very useful.
I think we can let alone tests for pandas API pyspark.sql.tests.pandas.*, because it is high likely we won't add new tests frequently, and they doesn't cause problems (unless serious performance regression in spark connect or sth)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants