Skip to content

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Jul 30, 2025

📄 1,072% (10.72x) speedup for dataframe_merge in src/numpy_pandas/dataframe_operations.py

⏱️ Runtime : 1.69 seconds 144 milliseconds (best of 63 runs)

📝 Explanation and details

The optimized code achieves a 1071% speedup by replacing slow pandas .iloc[] operations with fast NumPy array indexing. Here are the key optimizations:

1. NumPy Array Access Instead of .iloc[]

  • Original: Used right.iloc[i][right_on] and left.iloc[i] for data access, which are extremely slow pandas operations
  • Optimized: Converted DataFrames to NumPy arrays (left.values, right.values) and used direct array indexing like right_values[i, right_on_idx]
  • Impact: The line profiler shows right.iloc[right_idx] took 60.4% of total time in the original (8.32s), while the equivalent NumPy operations are barely visible in the optimized version

2. Pre-computed Column Index Mappings

  • Original: Accessed columns by name repeatedly: left_row[col] and right_row[col]
  • Optimized: Pre-computed column-to-index mappings (left_col_indices, right_col_indices) and used direct array indexing: left_values[i, left_col_indices[col]]
  • Impact: Eliminates repeated column name lookups and leverages NumPy's optimized indexing

3. Direct Column Index Lookup

  • Original: Accessed join columns through pandas Series indexing
  • Optimized: Used columns.get_loc() to get integer indices upfront, enabling direct NumPy array access

Why This Works:

  • NumPy vs Pandas: NumPy arrays provide O(1) direct memory access, while pandas .iloc[] has significant overhead for type checking, alignment, and Series creation
  • Memory Layout: NumPy arrays store data contiguously in memory, enabling faster access patterns
  • Reduced Object Creation: The original created pandas Series objects for each row access; the optimized version works directly with primitive values

Test Case Performance:
The optimizations are most effective for:

  • Large datasets: test_large_scale_many_duplicates shows 753% speedup - the more data accessed, the greater the NumPy advantage
  • Many matches: Cases with frequent .iloc[] calls benefit most from the NumPy conversion
  • Cartesian products: When duplicate keys create many row combinations, the NumPy indexing advantage compounds

The optimization maintains identical functionality while dramatically reducing the computational overhead of data access operations.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 44 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import dataframe_merge

# unit tests

# ----------------------------- BASIC TEST CASES -----------------------------

def test_basic_single_match():
    # Test a simple merge with one matching key
    left = pd.DataFrame({'id': [1], 'val': ['a']})
    right = pd.DataFrame({'key': [1], 'num': [100]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 102μs -> 77.6μs (31.4% faster)
    expected = pd.DataFrame({'id': [1], 'val': ['a'], 'num': [100]})

def test_basic_no_match():
    # Test when there are no matching keys
    left = pd.DataFrame({'id': [1], 'val': ['a']})
    right = pd.DataFrame({'key': [2], 'num': [100]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 94.9μs -> 80.0μs (18.6% faster)

def test_basic_multiple_matches():
    # Test when multiple rows match on the join key
    left = pd.DataFrame({'id': [1, 2], 'val': ['a', 'b']})
    right = pd.DataFrame({'key': [2, 1], 'num': [200, 100]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 123μs -> 76.6μs (60.6% faster)
    expected = pd.DataFrame({'id': [1, 2], 'val': ['a', 'b'], 'num': [100, 200]})

def test_basic_duplicate_keys_in_right():
    # Test when right DataFrame has duplicate join keys (should result in multiple rows per left)
    left = pd.DataFrame({'id': [1], 'val': ['a']})
    right = pd.DataFrame({'key': [1, 1], 'num': [100, 101]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 113μs -> 75.6μs (50.6% faster)
    expected = pd.DataFrame({'id': [1, 1], 'val': ['a', 'a'], 'num': [100, 101]})

def test_basic_duplicate_keys_in_left():
    # Test when left DataFrame has duplicate join keys (should result in multiple rows per right)
    left = pd.DataFrame({'id': [1, 1], 'val': ['a', 'b']})
    right = pd.DataFrame({'key': [1], 'num': [100]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 117μs -> 75.3μs (56.1% faster)
    expected = pd.DataFrame({'id': [1, 1], 'val': ['a', 'b'], 'num': [100, 100]})

def test_basic_different_column_names():
    # Test when join columns have different names in left and right
    left = pd.DataFrame({'foo': [1, 2], 'bar': ['x', 'y']})
    right = pd.DataFrame({'baz': [2, 1], 'qux': [22, 11]})
    codeflash_output = dataframe_merge(left, right, 'foo', 'baz'); result = codeflash_output # 123μs -> 75.1μs (64.2% faster)
    expected = pd.DataFrame({'foo': [1, 2], 'bar': ['x', 'y'], 'qux': [11, 22]})

# ----------------------------- EDGE TEST CASES ------------------------------

def test_edge_empty_left():
    # Merging with empty left DataFrame should return empty DataFrame
    left = pd.DataFrame({'id': [], 'val': []})
    right = pd.DataFrame({'key': [1, 2], 'num': [100, 200]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 85.3μs -> 72.7μs (17.4% faster)

def test_edge_empty_right():
    # Merging with empty right DataFrame should return empty DataFrame
    left = pd.DataFrame({'id': [1, 2], 'val': ['a', 'b']})
    right = pd.DataFrame({'key': [], 'num': []})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 92.8μs -> 78.5μs (18.2% faster)

def test_edge_both_empty():
    # Both DataFrames empty
    left = pd.DataFrame({'id': [], 'val': []})
    right = pd.DataFrame({'key': [], 'num': []})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 63.2μs -> 70.2μs (10.0% slower)

def test_edge_no_common_columns():
    # Test when left and right have no columns in common except join keys
    left = pd.DataFrame({'id': [1], 'foo': ['bar']})
    right = pd.DataFrame({'key': [1], 'baz': [123]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 101μs -> 76.0μs (34.1% faster)
    expected = pd.DataFrame({'id': [1], 'foo': ['bar'], 'baz': [123]})

def test_edge_column_name_collision():
    # Test when right has a column with the same name as left (other than join key)
    left = pd.DataFrame({'id': [1], 'val': ['a']})
    right = pd.DataFrame({'key': [1], 'val': ['b']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 86.9μs -> 59.3μs (46.6% faster)
    # The right's 'val' column should overwrite left's 'val' in the output
    expected = pd.DataFrame({'id': [1], 'val': ['b']})

def test_edge_nonexistent_join_columns():
    # Test when join columns do not exist in one or both DataFrames
    left = pd.DataFrame({'a': [1]})
    right = pd.DataFrame({'b': [1]})
    with pytest.raises(KeyError):
        dataframe_merge(left, right, 'id', 'b') # 27.0μs -> 6.67μs (304% faster)
    with pytest.raises(KeyError):
        dataframe_merge(left, right, 'a', 'key') # 8.96μs -> 4.54μs (97.2% faster)
    with pytest.raises(KeyError):
        dataframe_merge(left, right, 'foo', 'bar') # 7.88μs -> 1.88μs (320% faster)

def test_edge_different_types_in_join_key():
    # Test when join columns have different types (should not match)
    left = pd.DataFrame({'id': [1, 2], 'val': ['a', 'b']})
    right = pd.DataFrame({'key': ['1', '2'], 'num': [100, 200]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 117μs -> 84.2μs (38.9% faster)

def test_edge_null_values_in_join_key():
    # Test when join columns contain null values (should not match nulls)
    left = pd.DataFrame({'id': [1, None, 3], 'val': ['a', 'b', 'c']})
    right = pd.DataFrame({'key': [1, 2, None], 'num': [100, 200, 300]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 145μs -> 86.6μs (68.2% faster)
    # Only row with id=1 should match key=1
    expected = pd.DataFrame({'id': [1], 'val': ['a'], 'num': [100]})

def test_edge_left_and_right_have_only_join_key():
    # Both DataFrames have only the join key
    left = pd.DataFrame({'id': [1, 2]})
    right = pd.DataFrame({'id': [2, 1]})
    codeflash_output = dataframe_merge(left, right, 'id', 'id'); result = codeflash_output # 79.4μs -> 39.3μs (102% faster)
    expected = pd.DataFrame({'id': [1, 2]})

def test_edge_non_string_column_names():
    # Test with non-string column names
    left = pd.DataFrame({0: [1, 2], 1: ['a', 'b']})
    right = pd.DataFrame({2: [1, 2], 3: [100, 200]})
    codeflash_output = dataframe_merge(left, right, 0, 2); result = codeflash_output # 127μs -> 77.3μs (65.1% faster)
    expected = pd.DataFrame({0: [1, 2], 1: ['a', 'b'], 3: [100, 200]})

# --------------------------- LARGE SCALE TEST CASES -------------------------

def test_large_scale_many_rows():
    # Test with 1000 rows in both DataFrames, all matching
    n = 1000
    left = pd.DataFrame({'id': list(range(n)), 'val': [str(i) for i in range(n)]})
    right = pd.DataFrame({'key': list(range(n)), 'num': [i*10 for i in range(n)]})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 19.5ms -> 921μs (2016% faster)
    expected = pd.DataFrame({'id': list(range(n)), 'val': [str(i) for i in range(n)], 'num': [i*10 for i in range(n)]})

def test_large_scale_sparse_matches():
    # Test with 1000 rows in left, 1000 in right, only 10 matches
    left = pd.DataFrame({'id': list(range(1000)), 'val': ['a']*1000})
    right = pd.DataFrame({'key': list(range(990, 1000)), 'num': list(range(10))})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 7.55ms -> 145μs (5096% faster)
    expected = pd.DataFrame({
        'id': list(range(990, 1000)),
        'val': ['a']*10,
        'num': list(range(10))
    })

def test_large_scale_many_duplicate_keys():
    # Test with duplicate keys in both left and right
    left = pd.DataFrame({'id': [1]*500 + [2]*500, 'val': ['x']*1000})
    right = pd.DataFrame({'key': [1]*200 + [2]*300, 'num': list(range(500))})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 1.61s -> 138ms (1062% faster)

def test_large_scale_column_collision():
    # Test with large DataFrames and overlapping column names (other than join key)
    left = pd.DataFrame({'id': list(range(100)), 'val': ['l']*100})
    right = pd.DataFrame({'key': list(range(100)), 'val': ['r']*100})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 2.49ms -> 137μs (1719% faster)
    # The right's 'val' should overwrite left's 'val'
    expected = pd.DataFrame({'id': list(range(100)), 'val': ['r']*100})

def test_large_scale_no_matches():
    # Test with large DataFrames and no matches
    left = pd.DataFrame({'id': list(range(500)), 'val': ['a']*500})
    right = pd.DataFrame({'key': list(range(500, 1000)), 'num': list(range(500))})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 6.26ms -> 197μs (3077% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import dataframe_merge

# unit tests

# ---------------------------
# 1. Basic Test Cases
# ---------------------------

def test_basic_single_match():
    # Simple case: one row in each, matching key
    left = pd.DataFrame({'id': [1], 'val': ['A']})
    right = pd.DataFrame({'key': [1], 'data': ['B']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 94.5μs -> 67.9μs (39.2% faster)

def test_basic_multiple_matches():
    # Multiple rows, some matching keys
    left = pd.DataFrame({'id': [1, 2], 'val': ['A', 'B']})
    right = pd.DataFrame({'key': [2, 1], 'data': ['Y', 'X']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 120μs -> 67.7μs (77.4% faster)
    # Check that all combinations are present
    merged = set((row['id'], row['val'], row['data']) for _, row in result.iterrows())

def test_basic_duplicate_keys():
    # Duplicate keys in right
    left = pd.DataFrame({'id': [1], 'val': ['A']})
    right = pd.DataFrame({'key': [1, 1], 'data': ['B', 'C']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 110μs -> 66.3μs (66.4% faster)
    datas = set(result['data'])

def test_basic_duplicate_keys_left():
    # Duplicate keys in left
    left = pd.DataFrame({'id': [1, 1], 'val': ['A', 'B']})
    right = pd.DataFrame({'key': [1], 'data': ['C']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 109μs -> 66.8μs (63.6% faster)
    vals = set(result['val'])

def test_basic_multiple_matches_both_sides():
    # Duplicate keys on both sides (cartesian product)
    left = pd.DataFrame({'id': [1, 1], 'val': ['A', 'B']})
    right = pd.DataFrame({'key': [1, 1], 'data': ['C', 'D']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 136μs -> 68.0μs (101% faster)
    # All combinations should be present
    combos = set((row['val'], row['data']) for _, row in result.iterrows())

# ---------------------------
# 2. Edge Test Cases
# ---------------------------

def test_edge_no_matches():
    # No matching keys
    left = pd.DataFrame({'id': [1, 2], 'val': ['A', 'B']})
    right = pd.DataFrame({'key': [3, 4], 'data': ['X', 'Y']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 118μs -> 85.4μs (38.3% faster)

def test_edge_empty_left():
    # Empty left DataFrame
    left = pd.DataFrame({'id': [], 'val': []})
    right = pd.DataFrame({'key': [1, 2], 'data': ['X', 'Y']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 93.8μs -> 79.0μs (18.6% faster)

def test_edge_empty_right():
    # Empty right DataFrame
    left = pd.DataFrame({'id': [1, 2], 'val': ['A', 'B']})
    right = pd.DataFrame({'key': [], 'data': []})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 92.6μs -> 78.0μs (18.6% faster)

def test_edge_both_empty():
    # Both DataFrames empty
    left = pd.DataFrame({'id': [], 'val': []})
    right = pd.DataFrame({'key': [], 'data': []})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 63.2μs -> 70.5μs (10.2% slower)

def test_edge_missing_merge_column_left():
    # left_on column missing in left
    left = pd.DataFrame({'foo': [1]})
    right = pd.DataFrame({'key': [1], 'data': ['X']})
    with pytest.raises(KeyError):
        dataframe_merge(left, right, 'id', 'key') # 34.1μs -> 7.92μs (331% faster)

def test_edge_missing_merge_column_right():
    # right_on column missing in right
    left = pd.DataFrame({'id': [1]})
    right = pd.DataFrame({'foo': [1], 'data': ['X']})
    with pytest.raises(KeyError):
        dataframe_merge(left, right, 'id', 'key') # 21.2μs -> 8.92μs (138% faster)

def test_edge_different_column_types():
    # Merge columns with different types (should not match)
    left = pd.DataFrame({'id': [1, 2], 'val': ['A', 'B']})
    right = pd.DataFrame({'key': ['1', '2'], 'data': ['X', 'Y']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 110μs -> 79.7μs (39.2% faster)

def test_edge_nan_merge_keys():
    # Merge columns contain NaN
    left = pd.DataFrame({'id': [1, None, 3], 'val': ['A', 'B', 'C']})
    right = pd.DataFrame({'key': [1, 3, None], 'data': ['X', 'Y', 'Z']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 139μs -> 68.7μs (102% faster)
    merged = set((row['id'], row['data']) for _, row in result.iterrows())

def test_edge_column_name_overlap():
    # Both DataFrames have a column with the same name (other than merge key)
    left = pd.DataFrame({'id': [1], 'val': ['A'], 'shared': [10]})
    right = pd.DataFrame({'key': [1], 'shared': [20], 'data': ['B']})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 124μs -> 97.5μs (28.1% faster)

def test_edge_non_string_column_names():
    # Non-string column names
    left = pd.DataFrame({0: [1, 2], 1: ['A', 'B']})
    right = pd.DataFrame({2: [1, 2], 3: ['X', 'Y']})
    codeflash_output = dataframe_merge(left, right, 0, 2); result = codeflash_output # 129μs -> 69.9μs (85.3% faster)

def test_edge_merge_on_index():
    # Merge columns are indices
    left = pd.DataFrame({'val': ['A', 'B']}, index=[1, 2])
    right = pd.DataFrame({'data': ['X', 'Y']}, index=[2, 1])
    # Reset index to make index a column
    left = left.reset_index()
    right = right.reset_index()
    codeflash_output = dataframe_merge(left, right, 'index', 'index'); result = codeflash_output # 119μs -> 68.2μs (75.6% faster)
    merged = set((row['val'], row['data']) for _, row in result.iterrows())

# ---------------------------
# 3. Large Scale Test Cases
# ---------------------------

def test_large_scale_all_match():
    # Large DataFrames, all keys match
    size = 500
    left = pd.DataFrame({'id': list(range(size)), 'val': ['A'] * size})
    right = pd.DataFrame({'key': list(range(size)), 'data': ['B'] * size})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 12.1ms -> 359μs (3267% faster)
    # Check a few random rows
    for idx in [0, 100, 499]:
        row = result.iloc[idx]

def test_large_scale_no_match():
    # Large DataFrames, no keys match
    size = 500
    left = pd.DataFrame({'id': list(range(size)), 'val': ['A'] * size})
    right = pd.DataFrame({'key': list(range(size, 2*size)), 'data': ['B'] * size})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 7.52ms -> 161μs (4555% faster)

def test_large_scale_some_matches():
    # Large DataFrames, some keys match
    size = 500
    left = pd.DataFrame({'id': list(range(size)), 'val': ['A'] * size})
    right = pd.DataFrame({'key': list(range(0, size, 2)), 'data': ['B'] * (size//2)})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 7.94ms -> 230μs (3347% faster)

def test_large_scale_many_duplicates():
    # Many duplicate keys on both sides (cartesian explosion)
    n_left = 100
    n_right = 10
    left = pd.DataFrame({'id': [1]*n_left, 'val': list(range(n_left))})
    right = pd.DataFrame({'key': [1]*n_right, 'data': list(range(n_right))})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 7.18ms -> 841μs (753% faster)
    # Check all combinations present
    for l in range(n_left):
        for r in range(n_right):
            pass

def test_large_scale_column_overlap():
    # Large DataFrames with overlapping column names
    size = 200
    left = pd.DataFrame({'id': list(range(size)), 'shared': [0]*size})
    right = pd.DataFrame({'key': list(range(size)), 'shared': [1]*size})
    codeflash_output = dataframe_merge(left, right, 'id', 'key'); result = codeflash_output # 3.33ms -> 210μs (1481% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-dataframe_merge-mdpei80l and push.

Codeflash

The optimized code achieves a 1071% speedup by replacing slow pandas `.iloc[]` operations with fast NumPy array indexing. Here are the key optimizations:

**1. NumPy Array Access Instead of .iloc[]**
- **Original**: Used `right.iloc[i][right_on]` and `left.iloc[i]` for data access, which are extremely slow pandas operations
- **Optimized**: Converted DataFrames to NumPy arrays (`left.values`, `right.values`) and used direct array indexing like `right_values[i, right_on_idx]`
- **Impact**: The line profiler shows `right.iloc[right_idx]` took 60.4% of total time in the original (8.32s), while the equivalent NumPy operations are barely visible in the optimized version

**2. Pre-computed Column Index Mappings**
- **Original**: Accessed columns by name repeatedly: `left_row[col]` and `right_row[col]`
- **Optimized**: Pre-computed column-to-index mappings (`left_col_indices`, `right_col_indices`) and used direct array indexing: `left_values[i, left_col_indices[col]]`
- **Impact**: Eliminates repeated column name lookups and leverages NumPy's optimized indexing

**3. Direct Column Index Lookup**
- **Original**: Accessed join columns through pandas Series indexing
- **Optimized**: Used `columns.get_loc()` to get integer indices upfront, enabling direct NumPy array access

**Why This Works:**
- **NumPy vs Pandas**: NumPy arrays provide O(1) direct memory access, while pandas `.iloc[]` has significant overhead for type checking, alignment, and Series creation
- **Memory Layout**: NumPy arrays store data contiguously in memory, enabling faster access patterns
- **Reduced Object Creation**: The original created pandas Series objects for each row access; the optimized version works directly with primitive values

**Test Case Performance:**
The optimizations are most effective for:
- **Large datasets**: `test_large_scale_many_duplicates` shows 753% speedup - the more data accessed, the greater the NumPy advantage
- **Many matches**: Cases with frequent `.iloc[]` calls benefit most from the NumPy conversion
- **Cartesian products**: When duplicate keys create many row combinations, the NumPy indexing advantage compounds

The optimization maintains identical functionality while dramatically reducing the computational overhead of data access operations.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jul 30, 2025
@codeflash-ai codeflash-ai bot requested a review from aseembits93 July 30, 2025 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants