Aggregations (BasicStatsOTF) #83

intarga · 2025-03-29T17:02:53Z

No description provided.

ketilt · 2025-03-31T14:48:44Z

util/src/lib.rs

+// TODO: this is a messy hack, but it's the only way people at met currently have to determine
+// time_resolution. Ultimately we intend to store time_resolution info in the database under
+// public.timeseries or labels.met. This will be populated by a combination of a script that looks
+// at a timeseries's history, and manual editing by content managers.


I agree, the only way to get correct metadata is to analyze the timeseries. No generalization is respected fully. The list below is a rough first guess. There is also a table in StInfoSys called time_resolution_default which should provide a more fine-grained estimate by defining resolution per (typeid, paramid).

ketilt · 2025-03-31T14:58:08Z

Exciting that work is being done on aggregation. Whenever you need it, I can summarize the most important bits from my implementation in ODA to save you guys some time. (I've probably done so before too.)

This struct indicates the parameters that I have used to define how an aggregation should be performed. The offsets are important to reproduce, and fortunately our metadata here is exact and consistent unlike the resolutions. Code values do appear and should not influence aggregations.
https://gitlab.met.no/oda/oda/-/blob/main/internal/pkg/abakus/main.go?ref_type=heads#L39

This gives an overview of all the different aggregation methods we use in our timeseries products:
https://gitlab.met.no/oda/oda/-/blob/main/internal/pkg/abakus/aggregators.go?ref_type=heads

intarga · 2025-04-03T12:32:09Z

@ketilt As discussed on Tuesday, I probably won't come back to this for a month or two yet, but if you have a time, providing answers to the TODOs I left in the code would be a huge help

ketilt · 2025-04-03T12:53:09Z

egress/src/aggregations.rs

+    Avg,
+}
+
+// TODO: Just use RelativeDuration?


Is RelativeDuration a way to enable the variable durations of months and years? In ODA I used a parser for ISO 8601 duration strings which dealt with that.

One challenge was to ensure that durations crossing a month/year used the appropriate duration. When this wasn't done right, the aggregations would end on different days each month.

To achieve this consistently, I actually added and subtracted a short interval to the steps before calculating where to go next. This ensured that the the relative duration from the starting point would end up at the right place.
https://gitlab.met.no/oda/oda/-/blob/main/internal/pkg/abakus/iterators.go?ref_type=heads#L143

Note: I spent a lot of time and code to try to create an "aggregation machine" in ODA, and my impression was that SQL queries could've solved the same need more efficiently with a lot less fuss. We didn't have that option because CQL doesn't do operations like MEAN or MAX

The PL/SQL triggers from KDVH typically perform COUNT() as well as other operators, and use the count to decide if it's good enough. (As a side note, I think it would be more useful to keep them all and use count to decide on the fly which ones to give to the user, based on their preferences.)

SELECT SUM (ff), MIN (ff), MAX (ff), COUNT (ff) FROM t_adata WHERE stnr = b_stnr AND dato BETWEEN TRUNC (b_dato) AND TRUNC (b_dato) + 23 / 24;

IF (l_count < 24) THEN l_mid := NULL; l_max := NULL; l_min := NULL; ELSE l_mid := (l_mid) / l_count; END IF;

ketilt · 2025-04-03T15:06:04Z

Back in 2019 I wrote this overview of what each of the KDVH triggers do:
https://docs.google.com/document/d/1GUqbrIbXSwOGcSwueTxE2ESulEihQ5qV20Z_07_c7q0/edit?usp=sharing

I also have a 2020 snapshot of the actual PL/SQL files here:
https://drive.google.com/file/d/1b7BkhIHEeHn9eptbRgw5fMENPOIN4L-K/view?usp=drive_link

And this file, which can be opened as a spreadsheet, contains all of the current aggregation products in ODA. It also includes products that are not produced by normal aggregation, but that still output timeseries format.
https://gitlab.met.no/oda/oda/-/blob/main/tools/metproducts/productsfull_add.txt?ref_type=heads

ketilt · 2025-04-09T10:51:02Z

egress/src/aggregations.rs

+pub async fn aggregation_handler(
+    State(pool): State<PgConnectionPool>,
+    // TODO: Should param here be something other than param_id? Perhaps some kind of abstract
+    // param that represents several specific ones, like temperature instead of TA, TAX, TAM, TAN, etc.


I haven't fully understood what the code does yet, but I can try to give two cents on this.

When calculating on the fly, users rarely want all children of a parameter at the same time. So if we calculated all of these at once, most of the time that effort would be wasted. So ODA only considerso ne paramid at a time.

If instead we pregenerate for a persistent storage, it might make more sense to follow the chain from the start. This is what KDVH's triggers do. When a new temperature value arrives, it may (re)calculate a whole host of children like TAM, TAN, TAX, MR, SH, VP, GD17, etc. Both diurnal, monthly, seasonal and yearly aggregations may be affected.

One way of another, we need to keep track of those relations so we can know what to invalidate and recalculate.

ketilt · 2025-04-09T13:14:39Z

egress/src/aggregations.rs

+        })
+        // Take the one left with the largest resolution
+        // TODO: Is this right? In the case of a station-aggregated ts it probably is, but if it's
+        // instantaneous measurements, we probably want the smallest?


I assume this is trying to find the right input data to use for aggregation?

Typeid we will remove in the climate filter, so that would resolve some of this issue.

We can still have issues where one output (like diurnal precipitation) could have multiple possible inputs (minute sums, 10min sums, hourly sums, all with unique paramid).

Preferring hourly over higher temporal resolution is usually the way to go in such cases. We don't calculate extremes from instantaneous observations. And for example, TA, used to make TAM, is rarely measured more frequently than hourly. So the potential gains from using higher resolution is small.

ketilt · 2025-04-09T13:19:21Z

egress/src/aggregations.rs

+            )
+        })?;
+
+    // TODO: Rounding of start and end time?


I don't understand this. Is this the start and end of e.g. one 24-hour period to be aggregated?

I usually think of that aggregation interval as (start, end]. So with exact times, not including start but including end.

ketilt · 2025-04-09T13:21:43Z

egress/src/lib.rs

@@ -64,7 +65,16 @@ async fn stations_handler(

    let header = get_timeseries_info(&conn, station_id, param_id)
        .await
-        .map_err(internal_error)?;
+        .map_err(internal_error)?
+        // TODO: find a better way to pick a ts than taking the first


I would select sensor=0 and (h)level=0. If filtered, that should give a unique ts.

(Later, I'm sure people will want to see aggregated data for other sensor/level as well. So it might be worth to be able to define this somehow.)

intarga added 8 commits March 29, 2025 11:57

egress: stub aggregation implementation

532063d

egress: move aggregations to own module

56952c4

egress: return all matches from get_timeseries_info

bced91b

egress: include type_id in TimeseriesInfo

2c27831

egress/aggregations: implement filtering of ts headers

39069ea

egress/aggregations: FromSql impl for AggregationPeriod

67efc9e

egress/aggregations: implement query to aggregate a ts

b7dee34

egress/aggregations: change Into impl to From to make clippy happy

f2ead62

intarga force-pushed the products branch from 24b9206 to f2ead62 Compare March 29, 2025 20:40

ketilt reviewed Mar 31, 2025

View reviewed changes

ketilt reviewed Apr 3, 2025

View reviewed changes

ketilt reviewed Apr 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations (BasicStatsOTF) #83

Aggregations (BasicStatsOTF) #83

intarga commented Mar 29, 2025

ketilt Mar 31, 2025

ketilt commented Mar 31, 2025

intarga commented Apr 3, 2025

ketilt Apr 3, 2025 •

edited

Loading

ketilt commented Apr 3, 2025

ketilt Apr 9, 2025

ketilt Apr 9, 2025 •

edited

Loading

ketilt Apr 9, 2025 •

edited

Loading

ketilt Apr 9, 2025

Aggregations (BasicStatsOTF) #83

Are you sure you want to change the base?

Aggregations (BasicStatsOTF) #83

Conversation

intarga commented Mar 29, 2025

ketilt Mar 31, 2025

Choose a reason for hiding this comment

ketilt commented Mar 31, 2025

intarga commented Apr 3, 2025

ketilt Apr 3, 2025 • edited Loading

Choose a reason for hiding this comment

ketilt commented Apr 3, 2025

ketilt Apr 9, 2025

Choose a reason for hiding this comment

ketilt Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

ketilt Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

ketilt Apr 9, 2025

Choose a reason for hiding this comment

ketilt Apr 3, 2025 •

edited

Loading

ketilt Apr 9, 2025 •

edited

Loading

ketilt Apr 9, 2025 •

edited

Loading