Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Weather data has duplicate index values, rows and object column types #669

Open
psychemedia opened this issue Dec 20, 2024 · 5 comments
Labels
bug Something isn't working good first issue Good for newcomers
Milestone

Comments

@psychemedia
Copy link

psychemedia commented Dec 20, 2024

Describe the issue:

The dataframe that contains weather data:

  • may contain duplicate rows;
  • may contain rows with duplicate index values
  • has columns datatypes as "object" even though columns contain uniformly typed elements (eg all np.float, boolean or Timedelta values)

Reproduce the code example:

import fastf1
fastf1.set_log_level('ERROR')

fp1 = fastf1.get_session(2024, 'Bahrain', 'FP1')
fp1.load(laps=True)

w = fp1.laps.get_weather_data()

w.dtypes, w.index.value_counts()

"""
# Temporary hackfix
import pandas as pd
w.drop_duplicates(inplace=True)
w.sort_values(by='Time', inplace=True)
w.reset_index(drop=True, inplace=True)

for c in w.columns:
    if c=='Time':
        w[c] = pd.to_timedelta(w[c])
    else:
        w[c] = w[c].astype(type(w[c].iloc[0]))
"""

Error message:

(Time             object
 AirTemp          object
 Humidity         object
 Pressure         object
 Rainfall         object
 TrackTemp        object
 WindDirection    object
 WindSpeed        object
 dtype: object,
 66    14
 68    12
 18    12
...
@theOehrly
Copy link
Owner

The relevant code just picks the first weather data value for each lap and concatenates all the values. If there are multiple drivers, laps can of course take place simultaneously and the same weather data info can be returned multiple times. I probably didn't consider the case of calling this function on a Laps object that contains multiple drivers when writing this.

Your "hacky" fix seems pretty OK, to be honest. Except that I'd prefer to not reindex. This way, the data stays a slice of Session.weather_data. But drop duplicates and sort by time seems like the way to go here.

Disregarding my opinion here, what did you expect the function to return? Maybe we need to reconsider this and change the behaviour to better match expectations. I'd like to hear your opinion on that.

In general, I think this would be a relatively simple fix. So once we agree on what should be done here, I'm open to a pull request for this :)

@theOehrly theOehrly added bug Something isn't working good first issue Good for newcomers labels Dec 20, 2024
@theOehrly theOehrly added this to the v3.5.0 milestone Dec 20, 2024
@psychemedia
Copy link
Author

My possible use case was a quick weather chart under a chart showing when cars were out on track during a practice session.

The reindex was just to clean things up a bit. One of the major issues with the returned df as is was the duped index values.

@theOehrly
Copy link
Owner

My possible use case was a quick weather chart under a chart showing when cars were out on track during a practice session.

With this approach, there will be gaps in the weather chart whenever no cars are on track. Is this what you intended to achieve?

I'm not sure yet, if this discussion is relevant to the bug itself, but I'm trying to understand the usage of this function a bit more. The bug is easy enough to understand 😅

@psychemedia
Copy link
Author

Re: gaps.. probably not.. (I've only just started exploring how I can use the package, but that's often the time when I'm likely to use things in a "not meant to be used like that" way...;-)

( The way I use data is to look for stories. So the gappy data is useful for talking about weather when a driver is on track, eg on a long run, but maybe not so useful to explain eg why there are no cars running (tho' if it starts raining and everyone pits that's a clue!)

@theOehrly
Copy link
Owner

I have thought about this some more and about potential use cases (and looked at my own documentation😅). I now think that it's best to leave this as it is. There are advantages to having the duplicates, and there are advantages not having them. For example, the documentation shows an example where weather data is concatenated with timing data such that there is one weather data value for each lap. This requires the duplicates to exist.
The intention of this function is/was/will be to return one value for each lap. When laps overlap, there may be duplicate weather data values in the returned data frame. Users can drop them themselves if they need to (as you did).

Changing this behaviour would be an API change that breaks backwards compatibility. I no longer think this is a bug. And I don't see a compelling enough reason to change the API behaviour here. Specifically because there are reasons for both ways to be useful.

Instead, I'll clarify the documentation to make this behaviour more clear so that users aren't surprised or confused by it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants