⚡️ Speed up function _get_jsonable_obj by 12%
#130
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 12% (0.12x) speedup for
_get_jsonable_objinmlflow/utils/proto_json_utils.py⏱️ Runtime :
8.77 milliseconds→7.81 milliseconds(best of65runs)📝 Explanation and details
The optimized code achieves a 12% speedup by replacing
pd.DataFrame(data)withdata.to_frame()when processing pandas Series objects. This single change improves performance across all test cases involving Series data.Key optimization:
pd.DataFrame(data)todata.to_frame()for pandas Series objects. Theto_frame()method is a more direct, optimized pathway that avoids the overhead of the genericDataFrameconstructor.Why this works:
pd.DataFrame(data)invokes the full DataFrame constructor which performs type checking, data validation, and multiple code paths to handle various input typesdata.to_frame()is a specialized Series method that directly creates a DataFrame with minimal overhead, leveraging the Series' existing structure and metadataPerformance impact by test category:
Behavioral preservation:
The optimization maintains identical output format and handles all edge cases (empty Series, NaN values, custom indices) correctly. The
to_frame()method preserves the Series name as the DataFrame column name, ensuring consistent JSON serialization behavior.This optimization is especially valuable for ML workflows where Series-to-JSON conversion happens frequently during model training, logging, and data preprocessing pipelines.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import numpy as np
import pandas as pd
imports
import pytest # used for our unit tests
from mlflow.utils.proto_json_utils import _get_jsonable_obj
unit tests
-----------------------
Basic Test Cases
-----------------------
def test_numpy_array_basic():
# Test conversion of simple numpy array to list
arr = np.array([1, 2, 3])
codeflash_output = _get_jsonable_obj(arr); result = codeflash_output # 3.34μs -> 2.66μs (25.4% faster)
def test_numpy_array_2d():
# Test conversion of 2D numpy array to nested list
arr = np.array([[1, 2], [3, 4]])
codeflash_output = _get_jsonable_obj(arr); result = codeflash_output # 2.92μs -> 2.72μs (7.27% faster)
def test_pandas_dataframe_records():
# Test conversion of DataFrame with 'records' orient
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
codeflash_output = _get_jsonable_obj(df, pandas_orient='records'); result = codeflash_output # 241μs -> 227μs (6.30% faster)
def test_pandas_dataframe_dict():
# Test conversion of DataFrame with 'dict' orient
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
codeflash_output = _get_jsonable_obj(df, pandas_orient='dict'); result = codeflash_output # 210μs -> 199μs (5.70% faster)
def test_pandas_series_records():
# Test conversion of Series with 'records' orient
s = pd.Series([10, 20], name='x')
codeflash_output = _get_jsonable_obj(s, pandas_orient='records'); result = codeflash_output # 297μs -> 246μs (20.6% faster)
def test_pandas_series_dict():
# Test conversion of Series with 'dict' orient
s = pd.Series([10, 20], name='x')
codeflash_output = _get_jsonable_obj(s, pandas_orient='dict'); result = codeflash_output # 278μs -> 227μs (22.7% faster)
def test_list_passthrough():
# Test that lists are returned as is
data = [1, 2, 3]
codeflash_output = _get_jsonable_obj(data); result = codeflash_output # 2.68μs -> 2.28μs (17.6% faster)
def test_dict_passthrough():
# Test that dicts are returned as is
data = {'foo': 'bar'}
codeflash_output = _get_jsonable_obj(data); result = codeflash_output # 2.81μs -> 2.51μs (11.8% faster)
def test_str_passthrough():
# Test that strings are returned as is
data = "hello"
codeflash_output = _get_jsonable_obj(data); result = codeflash_output # 2.67μs -> 2.46μs (8.79% faster)
def test_int_passthrough():
# Test that ints are returned as is
data = 42
codeflash_output = _get_jsonable_obj(data); result = codeflash_output # 2.58μs -> 2.48μs (4.08% faster)
def test_float_passthrough():
# Test that floats are returned as is
data = 3.14159
codeflash_output = _get_jsonable_obj(data); result = codeflash_output # 2.52μs -> 2.42μs (4.01% faster)
-----------------------
Edge Test Cases
-----------------------
def test_empty_numpy_array():
# Test empty numpy array
arr = np.array([])
codeflash_output = _get_jsonable_obj(arr); result = codeflash_output # 3.23μs -> 2.47μs (30.6% faster)
def test_empty_dataframe():
# Test empty DataFrame
df = pd.DataFrame()
codeflash_output = _get_jsonable_obj(df); result = codeflash_output # 112μs -> 103μs (8.26% faster)
def test_empty_series():
# Test empty Series
s = pd.Series([], name='z')
codeflash_output = _get_jsonable_obj(s); result = codeflash_output # 330μs -> 261μs (26.4% faster)
def test_numpy_array_with_nan():
# Test numpy array with NaN values
arr = np.array([1, np.nan, 3])
codeflash_output = _get_jsonable_obj(arr); result = codeflash_output # 3.02μs -> 2.62μs (14.9% faster)
def test_dataframe_with_nan():
# Test DataFrame with NaN values
df = pd.DataFrame({'a': [1, np.nan], 'b': [3, 4]})
codeflash_output = _get_jsonable_obj(df, pandas_orient='records'); result = codeflash_output # 243μs -> 228μs (6.36% faster)
def test_series_with_nan():
# Test Series with NaN values
s = pd.Series([np.nan, 2], name='y')
codeflash_output = _get_jsonable_obj(s, pandas_orient='records'); result = codeflash_output # 287μs -> 241μs (19.1% faster)
def test_numpy_array_object_dtype():
# Test numpy array of dtype object (e.g., strings)
arr = np.array(['a', 'b', 'c'], dtype=object)
codeflash_output = _get_jsonable_obj(arr); result = codeflash_output # 2.92μs -> 2.66μs (9.73% faster)
def test_dataframe_with_mixed_types():
# Test DataFrame with mixed types
df = pd.DataFrame({'a': [1, 'x'], 'b': [3.0, None]})
codeflash_output = _get_jsonable_obj(df, pandas_orient='records'); result = codeflash_output # 246μs -> 226μs (8.63% faster)
def test_series_with_custom_index():
# Test Series with custom index
s = pd.Series([100, 200], index=['foo', 'bar'], name='z')
codeflash_output = _get_jsonable_obj(s, pandas_orient='dict'); result = codeflash_output # 253μs -> 214μs (18.3% faster)
def test_unknown_type_passthrough():
# Test that unknown types are returned as is
class Custom:
pass
obj = Custom()
codeflash_output = _get_jsonable_obj(obj); result = codeflash_output # 3.03μs -> 2.60μs (16.5% faster)
def test_none_passthrough():
# Test that None is returned as is
codeflash_output = _get_jsonable_obj(None); result = codeflash_output # 2.69μs -> 2.41μs (11.7% faster)
def test_numpy_scalar():
# Test numpy scalar is returned as is
scalar = np.int32(5)
codeflash_output = _get_jsonable_obj(scalar); result = codeflash_output # 3.25μs -> 2.88μs (12.8% faster)
def test_pandas_index_passthrough():
# Test that pandas Index is returned as is (not converted)
idx = pd.Index([1, 2, 3])
codeflash_output = _get_jsonable_obj(idx); result = codeflash_output # 2.25μs -> 2.46μs (8.45% slower)
-----------------------
Large Scale Test Cases
-----------------------
def test_large_numpy_array():
# Test large numpy array conversion
arr = np.arange(1000)
codeflash_output = _get_jsonable_obj(arr); result = codeflash_output # 10.4μs -> 9.39μs (11.2% faster)
def test_large_numpy_2d_array():
# Test large 2D numpy array conversion
arr = np.arange(1000).reshape(100, 10)
codeflash_output = _get_jsonable_obj(arr); result = codeflash_output # 12.2μs -> 11.1μs (10.2% faster)
def test_large_dataframe_records():
# Test large DataFrame conversion with 'records' orient
df = pd.DataFrame({'a': range(1000), 'b': range(1000, 2000)})
codeflash_output = _get_jsonable_obj(df, pandas_orient='records'); result = codeflash_output # 553μs -> 508μs (8.88% faster)
def test_large_dataframe_dict():
# Test large DataFrame conversion with 'dict' orient
df = pd.DataFrame({'a': range(1000), 'b': range(1000, 2000)})
codeflash_output = _get_jsonable_obj(df, pandas_orient='dict'); result = codeflash_output # 346μs -> 325μs (6.48% faster)
def test_large_series_records():
# Test large Series conversion with 'records' orient
s = pd.Series(range(1000), name='big')
codeflash_output = _get_jsonable_obj(s, pandas_orient='records'); result = codeflash_output # 499μs -> 445μs (12.1% faster)
def test_large_series_dict():
# Test large Series conversion with 'dict' orient
s = pd.Series(range(1000), name='big')
codeflash_output = _get_jsonable_obj(s, pandas_orient='dict'); result = codeflash_output # 346μs -> 304μs (13.8% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import numpy as np
import pandas as pd
imports
import pytest # used for our unit tests
from mlflow.utils.proto_json_utils import _get_jsonable_obj
unit tests
----------------- Basic Test Cases -----------------
def test_numpy_array_basic():
# Test with a simple 1D numpy array
arr = np.array([1, 2, 3])
codeflash_output = _get_jsonable_obj(arr) # 3.53μs -> 2.67μs (32.0% faster)
def test_numpy_array_2d():
# Test with a 2D numpy array
arr = np.array([[1, 2], [3, 4]])
codeflash_output = _get_jsonable_obj(arr) # 2.99μs -> 2.77μs (7.89% faster)
def test_pandas_dataframe_records_orient():
# Test with a simple pandas DataFrame, default orient='records'
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
expected = [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}]
codeflash_output = _get_jsonable_obj(df) # 241μs -> 231μs (4.16% faster)
def test_pandas_dataframe_dict_orient():
# Test with pandas DataFrame, orient='dict'
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
expected = {'a': {0: 1, 1: 2}, 'b': {0: 3, 1: 4}}
codeflash_output = _get_jsonable_obj(df, pandas_orient='dict') # 215μs -> 198μs (8.47% faster)
def test_pandas_series_basic():
# Test with a simple pandas Series
s = pd.Series([10, 20], name='foo')
expected = [{'foo': 10}, {'foo': 20}]
codeflash_output = _get_jsonable_obj(s) # 292μs -> 250μs (16.4% faster)
def test_builtin_types_int():
# Test with a built-in int type
codeflash_output = _get_jsonable_obj(42) # 2.84μs -> 2.50μs (13.6% faster)
def test_builtin_types_str():
# Test with a built-in str type
codeflash_output = _get_jsonable_obj("hello") # 2.71μs -> 2.39μs (13.5% faster)
def test_builtin_types_list():
# Test with a built-in list type
data = [1, 2, 3]
codeflash_output = _get_jsonable_obj(data) # 2.75μs -> 2.41μs (14.1% faster)
def test_builtin_types_dict():
# Test with a built-in dict type
data = {'a': 1, 'b': 2}
codeflash_output = _get_jsonable_obj(data) # 2.73μs -> 2.37μs (15.3% faster)
----------------- Edge Test Cases -----------------
def test_numpy_array_empty():
# Test with an empty numpy array
arr = np.array([])
codeflash_output = _get_jsonable_obj(arr) # 2.78μs -> 2.50μs (10.9% faster)
def test_pandas_dataframe_empty():
# Test with an empty pandas DataFrame
df = pd.DataFrame()
codeflash_output = _get_jsonable_obj(df) # 109μs -> 101μs (8.79% faster)
def test_pandas_series_empty():
# Test with an empty pandas Series
s = pd.Series([], dtype=int)
codeflash_output = _get_jsonable_obj(s) # 227μs -> 187μs (21.6% faster)
def test_numpy_array_object_dtype():
# Test with numpy array of object dtype
arr = np.array(['a', 'b', 'c'], dtype=object)
codeflash_output = _get_jsonable_obj(arr) # 3.36μs -> 2.74μs (22.6% faster)
def test_pandas_dataframe_with_nan():
# Test with pandas DataFrame containing NaN values
df = pd.DataFrame({'a': [1, None], 'b': [np.nan, 4]})
# NaN and None will be preserved as-is in the output
expected = [{'a': 1, 'b': np.nan}, {'a': None, 'b': 4}]
codeflash_output = _get_jsonable_obj(df); out = codeflash_output # 242μs -> 226μs (7.11% faster)
def test_pandas_series_with_nan():
# Test with pandas Series containing NaN
s = pd.Series([1, np.nan], name='foo')
codeflash_output = _get_jsonable_obj(s); out = codeflash_output # 288μs -> 244μs (18.0% faster)
def test_custom_object_passthrough():
# Test with a custom object, should be returned as-is
class Custom:
pass
obj = Custom()
codeflash_output = _get_jsonable_obj(obj) # 3.06μs -> 2.56μs (19.6% faster)
def test_tuple_passthrough():
# Test with a tuple, should be returned as-is
tup = (1, 2, 3)
codeflash_output = _get_jsonable_obj(tup) # 2.83μs -> 2.33μs (21.5% faster)
def test_set_passthrough():
# Test with a set, should be returned as-is
s = {1, 2, 3}
codeflash_output = _get_jsonable_obj(s) # 2.70μs -> 2.31μs (16.9% faster)
def test_pandas_dataframe_index_and_columns():
# Test DataFrame with custom index and columns
df = pd.DataFrame([[10, 20], [30, 40]], index=['x', 'y'], columns=['a', 'b'])
expected = [{'a': 10, 'b': 20}, {'a': 30, 'b': 40}]
codeflash_output = _get_jsonable_obj(df) # 238μs -> 222μs (7.04% faster)
def test_pandas_series_with_index():
# Test Series with custom index
s = pd.Series([100, 200], index=['foo', 'bar'], name='baz')
expected = [{'baz': 100}, {'baz': 200}]
codeflash_output = _get_jsonable_obj(s) # 275μs -> 230μs (19.3% faster)
def test_pandas_dataframe_orient_split():
# Test DataFrame with orient='split'
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
codeflash_output = _get_jsonable_obj(df, pandas_orient='split'); result = codeflash_output # 241μs -> 223μs (7.99% faster)
def test_pandas_series_orient_split():
# Test Series with orient='split'
s = pd.Series([1, 2], name='foo')
codeflash_output = _get_jsonable_obj(s, pandas_orient='split'); result = codeflash_output # 288μs -> 251μs (14.9% faster)
----------------- Large Scale Test Cases -----------------
def test_large_numpy_array():
# Test with a large 1D numpy array
arr = np.arange(1000)
codeflash_output = _get_jsonable_obj(arr); out = codeflash_output # 9.02μs -> 9.15μs (1.43% slower)
def test_large_numpy_array_2d():
# Test with a large 2D numpy array
arr = np.arange(1000).reshape(100, 10)
codeflash_output = _get_jsonable_obj(arr); out = codeflash_output # 11.9μs -> 10.7μs (11.0% faster)
def test_large_pandas_dataframe():
# Test with a large pandas DataFrame
df = pd.DataFrame({'a': range(1000), 'b': range(1000, 2000)})
codeflash_output = _get_jsonable_obj(df); out = codeflash_output # 531μs -> 493μs (7.80% faster)
def test_large_pandas_series():
# Test with a large pandas Series
s = pd.Series(range(1000), name='foo')
codeflash_output = _get_jsonable_obj(s); out = codeflash_output # 518μs -> 452μs (14.4% faster)
def test_large_pandas_dataframe_orient_dict():
# Test with large DataFrame and orient='dict'
df = pd.DataFrame({'a': range(1000), 'b': range(1000, 2000)})
codeflash_output = _get_jsonable_obj(df, pandas_orient='dict'); out = codeflash_output # 340μs -> 323μs (5.50% faster)
def test_large_pandas_series_orient_dict():
# Test with large Series and orient='dict'
s = pd.Series(range(1000), name='foo')
codeflash_output = _get_jsonable_obj(s, pandas_orient='dict'); out = codeflash_output # 354μs -> 306μs (15.6% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-_get_jsonable_obj-mhuj7ybland push.