⚡️ Speed up function correlation
by 12,290%
#73
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 12,290% (122.90x) speedup for
correlation
insrc/numpy_pandas/dataframe_operations.py
⏱️ Runtime :
925 milliseconds
→7.47 milliseconds
(best of179
runs)📝 Explanation and details
The optimized code achieves a 12290% speedup by replacing row-by-row pandas DataFrame access with vectorized NumPy operations. Here are the key optimizations:
1. Pre-convert DataFrame to NumPy array
values = df[numeric_columns].to_numpy(dtype=float)
converts all numeric columns to a single NumPy array upfrontdf.iloc[k][col_i]
operations that dominated the original runtime (51.8% + 23.7% + 23.7% = 99.2% of total time)2. Vectorized NaN filtering
pd.isna()
checks in Python loopsmask = ~np.isnan(vals_i) & ~np.isnan(vals_j)
creates boolean mask in one vectorized operationx = vals_i[mask]
instead of appending valid values one by one3. Vectorized statistical calculations
sum()
, list comprehensions)x.mean()
,x.std()
,((x - mean_x) * (y - mean_y)).mean()
)Performance characteristics by test case:
values[:, i]
) is very fastThe optimization transforms an O(n²m) algorithm with expensive Python operations into O(nm) with fast C-level NumPy operations, where n is rows and m is numeric columns.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-correlation-mdpfhnu2
and push.