Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automl broken with pandas 2 #1300

Open
jgukelberger opened this issue Apr 18, 2024 · 6 comments
Open

automl broken with pandas 2 #1300

jgukelberger opened this issue Apr 18, 2024 · 6 comments

Comments

@jgukelberger
Copy link
Member

Several of the official examples and tests currently error out due to incompatibility with pandas 2.

For example, when following AutoML - Time series forecast in a fresh Python 3.10 environment, pip install "flaml[automl,ts_forecast]" currently installs pandas 2.2.2. Then, the first example raises

TypeError: cannot infer freq from a non-convertible index of dtype float64

Similarly, 3/9 tests in test/automl/test_forecast.py fail.

There's also #984, but that focuses on supporting pandas 2. In the meantime, the pandas dependency in setup.py should at least be constrained to <2 to avoid users ending up with a broken installation by default.

@sonichi
Copy link
Contributor

sonichi commented Apr 18, 2024

Thanks. Feel free to make a PR.
cc @thinkall

@thinkall
Copy link
Collaborator

Thanks @jgukelberger , @sonichi , I don't see the issue with pandas 2.0.3 and 2.2.2

numpy 1.24.3
pandas 2.2.2 / 2.0.3

image

@jgukelberger
Copy link
Member Author

That's interesting @sonichi. Attached is the full pip list output for the environment I'm seeing these errors in. The errors are fixed with pip install "pandas<2".

The only difference between the two environments is pandas:

$ diff pipenv-works.txt pipenv-fails.txt
31c31
< pandas                1.5.3
---
> pandas                2.2.2

Here's the full output of a failing test case:

$ pytest -v test/automl/test_forecast.py -k test_numpy
=============================================== test session starts ================================================
platform linux -- Python 3.10.14, pytest-7.4.0, pluggy-1.0.0 -- /home/jagukelb/opt/miniconda3/envs/flaml-test/bin/python
cachedir: .pytest_cache
rootdir: /home/jagukelb/src/experiments/FLAML
configfile: pyproject.toml
collected 9 items / 7 deselected / 2 selected

test/automl/test_forecast.py::test_numpy FAILED                                                              [ 50%]
test/automl/test_forecast.py::test_numpy_large PASSED                                                        [100%]

===================================================== FAILURES =====================================================
____________________________________________________ test_numpy ____________________________________________________

    def test_numpy():
        X_train = np.arange("2014-01", "2021-01", dtype="datetime64[M]")
        y_train = np.random.random(size=len(X_train))
        automl = AutoML()
>       automl.fit(
            X_train=X_train[:72],  # a single column of timestamp
            y_train=y_train[:72],  # value for each timestamp
            period=12,  # time horizon to forecast, e.g., 12 months
            task="ts_forecast",
            time_budget=3,  # time budget in seconds
            log_file_name="test/ts_forecast.log",
            n_splits=3,  # number of splits
        )

test/automl/test_forecast.py:126:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
flaml/automl/automl.py:1664: in fit
    task.validate_data(
flaml/automl/task/time_series_task.py:166: in validate_data
    data = TimeSeriesDataset(
flaml/automl/time_series/ts_data.py:57: in __init__
    self.frequency = pd.infer_freq(train_data[time_col].unique())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

index = array([1.3885344e+09, 1.3912128e+09, 1.3936320e+09, 1.3963104e+09,
       1.3989024e+09, 1.4015808e+09, 1.4041728e+09,...8e+09, 1.5593472e+09, 1.5619392e+09, 1.5646176e+09,
       1.5672960e+09, 1.5698880e+09, 1.5725664e+09, 1.5751584e+09])

    def infer_freq(
        index: DatetimeIndex | TimedeltaIndex | Series | DatetimeLikeArrayMixin,
    ) -> str | None:
        """
        Infer the most likely frequency given the input index.

        Parameters
        ----------
        index : DatetimeIndex, TimedeltaIndex, Series or array-like
          If passed a Series will use the values of the series (NOT THE INDEX).

        Returns
        -------
        str or None
            None if no discernible frequency.

        Raises
        ------
        TypeError
            If the index is not datetime-like.
        ValueError
            If there are fewer than three values.

        Examples
        --------
        >>> idx = pd.date_range(start='2020/12/01', end='2020/12/30', periods=30)
        >>> pd.infer_freq(idx)
        'D'
        """
        from pandas.core.api import DatetimeIndex

        if isinstance(index, ABCSeries):
            values = index._values
            if not (
                lib.is_np_dtype(values.dtype, "mM")
                or isinstance(values.dtype, DatetimeTZDtype)
                or values.dtype == object
            ):
                raise TypeError(
                    "cannot infer freq from a non-convertible dtype "
                    f"on a Series of {index.dtype}"
                )
            index = values

        inferer: _FrequencyInferer

        if not hasattr(index, "dtype"):
            pass
        elif isinstance(index.dtype, PeriodDtype):
            raise TypeError(
                "PeriodIndex given. Check the `freq` attribute "
                "instead of using infer_freq."
            )
        elif lib.is_np_dtype(index.dtype, "m"):
            # Allow TimedeltaIndex and TimedeltaArray
            inferer = _TimedeltaFrequencyInferer(index)
            return inferer.get_freq()

        elif is_numeric_dtype(index.dtype):
>           raise TypeError(
                f"cannot infer freq from a non-convertible index of dtype {index.dtype}"
            )
E           TypeError: cannot infer freq from a non-convertible index of dtype float64

../../../opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/pandas/tseries/frequencies.py:148: TypeError
================================================= warnings summary =================================================
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/src/experiments/FLAML/flaml/automl/time_series/ts_data.py:121: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
    return pd.concat([self.X_train, self.X_val], axis=0)

test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/src/experiments/FLAML/test/automl/test_forecast.py:158: FutureWarning: 'T' is deprecated and will be removed in a future version, please use 'min' instead.
    X_train = pd.date_range("2017-01-01", periods=70000, freq="T")

test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/prophet/models.py:16: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

test/automl/test_forecast.py: 20 warnings
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/lightgbm/basic.py:696: UserWarning: Usage of np.ndarray subset (sliced data) is not recommended due to it will double the peak memory cost in LightGBM.
    _log_warning("Usage of np.ndarray subset (sliced data) is not recommended "

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================= short test summary info ==============================================
FAILED test/automl/test_forecast.py::test_numpy - TypeError: cannot infer freq from a non-convertible index of dtype float64
============================= 1 failed, 1 passed, 7 deselected, 24 warnings in 19.07s ==============================

And here when it works after downgrading pandas:

$ pytest -v test/automl/test_forecast.py -k test_numpy
=============================================== test session starts ================================================
platform linux -- Python 3.10.14, pytest-7.4.0, pluggy-1.0.0 -- /home/jagukelb/opt/miniconda3/envs/flaml-test/bin/python
cachedir: .pytest_cache
rootdir: /home/jagukelb/src/experiments/FLAML
configfile: pyproject.toml
collected 9 items / 7 deselected / 2 selected

test/automl/test_forecast.py::test_numpy PASSED                                                              [ 50%]
test/automl/test_forecast.py::test_numpy_large PASSED                                                        [100%]

================================================= warnings summary =================================================
test/automl/test_forecast.py::test_numpy
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/prophet/models.py:16: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

test/automl/test_forecast.py: 22 warnings
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:1641: DeprecationWarning: np.find_common_type is deprecated.  Please use `np.result_type` or `np.promote_types`.
  See https://numpy.org/devdocs/release/1.25.0-notes.html and the docs for more information.  (Deprecated NumPy 1.25)
    return np.find_common_type(types, [])

test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy_large
test/automl/test_forecast.py::test_numpy_large
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/lightgbm/basic.py:696: UserWarning: Usage of np.ndarray subset (sliced data) is not recommended due to it will double the peak memory cost in LightGBM.
    _log_warning("Usage of np.ndarray subset (sliced data) is not recommended "

test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
test/automl/test_forecast.py::test_numpy
  /home/jagukelb/opt/miniconda3/envs/flaml-test/lib/python3.10/site-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
    self._init_dates(dates, freq)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================== 2 passed, 7 deselected, 34 warnings in 20.20s ===================================

pipenv-fails.txt
pipenv-works.txt

@yareyaredesuyo
Copy link

yareyaredesuyo commented Jun 18, 2024

Similar error happens, both kaggle and colab environment.

numpy: 1.25.2
pandas: 2.0.3
flaml: 2.1.2

Screenshot 2024-06-18 at 14 50 02 Screenshot 2024-06-18 at 14 50 11
1.25.2
2.0.3
2.1.2
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-3-97dd957c5800>](https://localhost:8080/#) in <cell line: 13>()
     11 y_train = np.random.random(size=84)
     12 automl = AutoML()
---> 13 automl.fit(
     14     X_train=X_train[:84],  # a single column of timestamp
     15     y_train=y_train,  # value for each timestamp

2 frames
[/usr/local/lib/python3.10/dist-packages/flaml/automl/time_series/ts_data.py](https://localhost:8080/#) in __init__(self, train_data, time_col, target_names, time_idx, test_data)
     56 
     57         self.frequency = pd.infer_freq(train_data[time_col].unique())
---> 58         assert self.frequency is not None, "Only time series of regular frequency are currently supported."
     59 
     60         float_cols = list(train_data.select_dtypes(include=["floating"]).columns)

AssertionError: Only time series of regular frequency are currently supported.

@Seyda1
Copy link

Seyda1 commented Nov 4, 2024

I encountered a similar error related to the Date column while using the FLAML library. Should I consider downgrading my pandas version to resolve this issue? The error message I received is as follows:

pandas: 2.1.4
numpy: 1.26.4
flaml : 2.1.1

Image

@thinkall
Copy link
Collaborator

I encountered a similar error related to the Date column while using the FLAML library. Should I consider downgrading my pandas version to resolve this issue? The error message I received is as follows:

pandas: 2.1.4 numpy: 1.26.4 flaml : 2.1.1

Image

Thanks for reporting the issue, @Seyda1 . The error message shows that your data doesn't have a regular frequency. According to @jgukelberger 's feedback, pandas<2 doesn't have the issue. Could you try it and let me know if it works for you? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants