Skip to content

Aggregations (BasicStatsOTF) #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: trunk
Choose a base branch
from
Draft

Aggregations (BasicStatsOTF) #83

wants to merge 8 commits into from

Conversation

intarga
Copy link
Member

@intarga intarga commented Mar 29, 2025

No description provided.

// TODO: this is a messy hack, but it's the only way people at met currently have to determine
// time_resolution. Ultimately we intend to store time_resolution info in the database under
// public.timeseries or labels.met. This will be populated by a combination of a script that looks
// at a timeseries's history, and manual editing by content managers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the only way to get correct metadata is to analyze the timeseries. No generalization is respected fully. The list below is a rough first guess. There is also a table in StInfoSys called time_resolution_default which should provide a more fine-grained estimate by defining resolution per (typeid, paramid).

@ketilt
Copy link
Collaborator

ketilt commented Mar 31, 2025

Exciting that work is being done on aggregation. Whenever you need it, I can summarize the most important bits from my implementation in ODA to save you guys some time. (I've probably done so before too.)

This struct indicates the parameters that I have used to define how an aggregation should be performed. The offsets are important to reproduce, and fortunately our metadata here is exact and consistent unlike the resolutions. Code values do appear and should not influence aggregations.
https://gitlab.met.no/oda/oda/-/blob/main/internal/pkg/abakus/main.go?ref_type=heads#L39

This gives an overview of all the different aggregation methods we use in our timeseries products:
https://gitlab.met.no/oda/oda/-/blob/main/internal/pkg/abakus/aggregators.go?ref_type=heads

@intarga
Copy link
Member Author

intarga commented Apr 3, 2025

@ketilt As discussed on Tuesday, I probably won't come back to this for a month or two yet, but if you have a time, providing answers to the TODOs I left in the code would be a huge help

Avg,
}

// TODO: Just use RelativeDuration?
Copy link
Collaborator

@ketilt ketilt Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RelativeDuration a way to enable the variable durations of months and years? In ODA I used a parser for ISO 8601 duration strings which dealt with that.

One challenge was to ensure that durations crossing a month/year used the appropriate duration. When this wasn't done right, the aggregations would end on different days each month.

To achieve this consistently, I actually added and subtracted a short interval to the steps before calculating where to go next. This ensured that the the relative duration from the starting point would end up at the right place.
https://gitlab.met.no/oda/oda/-/blob/main/internal/pkg/abakus/iterators.go?ref_type=heads#L143

Note: I spent a lot of time and code to try to create an "aggregation machine" in ODA, and my impression was that SQL queries could've solved the same need more efficiently with a lot less fuss. We didn't have that option because CQL doesn't do operations like MEAN or MAX

The PL/SQL triggers from KDVH typically perform COUNT() as well as other operators, and use the count to decide if it's good enough. (As a side note, I think it would be more useful to keep them all and use count to decide on the fly which ones to give to the user, based on their preferences.)

      SELECT SUM (ff),
             MIN (ff),
             MAX (ff),
             COUNT (ff)
        FROM t_adata
       WHERE stnr = b_stnr
             AND dato BETWEEN TRUNC (b_dato) AND TRUNC (b_dato) + 23 / 24;
   IF (l_count < 24)
   THEN
      l_mid := NULL;
      l_max := NULL;
      l_min := NULL;
   ELSE
      l_mid := (l_mid) / l_count;
   END IF;

@ketilt
Copy link
Collaborator

ketilt commented Apr 3, 2025

Back in 2019 I wrote this overview of what each of the KDVH triggers do:
https://docs.google.com/document/d/1GUqbrIbXSwOGcSwueTxE2ESulEihQ5qV20Z_07_c7q0/edit?usp=sharing

I also have a 2020 snapshot of the actual PL/SQL files here:
https://drive.google.com/file/d/1b7BkhIHEeHn9eptbRgw5fMENPOIN4L-K/view?usp=drive_link

And this file, which can be opened as a spreadsheet, contains all of the current aggregation products in ODA. It also includes products that are not produced by normal aggregation, but that still output timeseries format.
https://gitlab.met.no/oda/oda/-/blob/main/tools/metproducts/productsfull_add.txt?ref_type=heads

pub async fn aggregation_handler(
State(pool): State<PgConnectionPool>,
// TODO: Should param here be something other than param_id? Perhaps some kind of abstract
// param that represents several specific ones, like temperature instead of TA, TAX, TAM, TAN, etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully understood what the code does yet, but I can try to give two cents on this.

When calculating on the fly, users rarely want all children of a parameter at the same time. So if we calculated all of these at once, most of the time that effort would be wasted. So ODA only considerso ne paramid at a time.

If instead we pregenerate for a persistent storage, it might make more sense to follow the chain from the start. This is what KDVH's triggers do. When a new temperature value arrives, it may (re)calculate a whole host of children like TAM, TAN, TAX, MR, SH, VP, GD17, etc. Both diurnal, monthly, seasonal and yearly aggregations may be affected.

One way of another, we need to keep track of those relations so we can know what to invalidate and recalculate.

})
// Take the one left with the largest resolution
// TODO: Is this right? In the case of a station-aggregated ts it probably is, but if it's
// instantaneous measurements, we probably want the smallest?
Copy link
Collaborator

@ketilt ketilt Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is trying to find the right input data to use for aggregation?

Typeid we will remove in the climate filter, so that would resolve some of this issue.

We can still have issues where one output (like diurnal precipitation) could have multiple possible inputs (minute sums, 10min sums, hourly sums, all with unique paramid).

Preferring hourly over higher temporal resolution is usually the way to go in such cases. We don't calculate extremes from instantaneous observations. And for example, TA, used to make TAM, is rarely measured more frequently than hourly. So the potential gains from using higher resolution is small.

)
})?;

// TODO: Rounding of start and end time?
Copy link
Collaborator

@ketilt ketilt Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Is this the start and end of e.g. one 24-hour period to be aggregated?

I usually think of that aggregation interval as (start, end]. So with exact times, not including start but including end.

@@ -64,7 +65,16 @@ async fn stations_handler(

let header = get_timeseries_info(&conn, station_id, param_id)
.await
.map_err(internal_error)?;
.map_err(internal_error)?
// TODO: find a better way to pick a ts than taking the first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would select sensor=0 and (h)level=0. If filtered, that should give a unique ts.

(Later, I'm sure people will want to see aggregated data for other sensor/level as well. So it might be worth to be able to define this somehow.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants