Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 89 additions & 21 deletions docs/source/user_guide/infilling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,21 +77,33 @@ In more detail
==============

The :meth:`~time_stream.TimeFrame.infill` method is the entry point for infilling your
timeseries data in **Time-Stream**. It delegates to well established methods from the `SciPy data science library
<https://docs.scipy.org/doc/scipy/reference/interpolate.html>`_, combined with the time-integrity of your **TimeFrame**.
timeseries data in **Time-Stream**. There are various infill methods available; from using alternative data from
another source, to delegating to well established methods from the `SciPy data science library
<https://docs.scipy.org/doc/scipy/reference/interpolate.html>`_. All methods are combined with the time-integrity
of your **TimeFrame**.

Let's look at the method in more detail:

.. automethod:: time_stream.TimeFrame.infill

Infill methods
--------------

Choose how missing values are estimated by passing a method name as a string. Each method has its strengths,
depending on your data.
The ``infill_method`` parameter lets you choose how missing values are estimated by passing a method name as a string.
Each method has its strengths, depending on your data. The currently available methods are:

Simple infilling techniques
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``"alt_data"`` - **infill using data from an alternative source.**

Either another column in your TimeFrame, or data from a different DataFrame entirely.

Polynomial interpolation
~~~~~~~~~~~~~~~~~~~~~~~~

- ``"linear"`` - **straight-line interpolation between neighbouring points.**

Simple and neutral; best for very short gaps (1–2 steps).
Simple and neutral; best for short gaps.

- ``"quadratic"`` - **second-order polynomial curve.**

Expand All @@ -103,7 +115,7 @@ Polynomial interpolation

- ``"bspline"`` - **B-spline interpolation (configurable order).**

Flexible piecewise polynomials; user decides.*
Flexible piecewise polynomials; user decides.

Shape-preserving methods
~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -123,25 +135,21 @@ Shape-preserving methods

.. note::

NaN values at the very beginning and very end of a timeseries will remain NaN; there is no pre- or post- data to
constrain the infilling method.

Column selection
----------------

Specify which column to infill; only this column will be used by the infill function.

For infill methods using interpolation techniques, NaN values at the very beginning and very end of a timeseries
will remain NaN; there is no pre- or post- data to constrain the infilling method.

Column selection
----------------

Specify which column to infill; only this column will be used by the infill function.
The ``column_name`` parameter lets you specify which column to infill; only this column will be used by the infill
function.


Observation interval
--------------------

Specify an observation interval to restrict infilling to a **specific time window**. This is useful when:
The ``observation_interval`` parameter lets you specify an observation interval to restrict infilling
to a **specific time window**. This is useful when:

- You only want to work with a subset of data (e.g. one hydrological year).
- You want to fill recent gaps without touching the historical record.
Expand All @@ -164,7 +172,7 @@ This will only attempt infilling **between January to Decemeber 2024**; gaps out
Max gap size
------------

Use the maximum gap size to prevent **over-eager interpolation**. Only gaps less than this
Use the ``max_gap_size`` parameter to prevent **over-eager interpolation**. Only gaps less than this
(measured in consecutive missing **steps**) will be infilled.

Example:
Expand All @@ -183,11 +191,71 @@ Example:
At 15-minute resolution, ``max_gap_size=2`` = 30 minutes; at daily resolution,
``max_gap_size=2`` = 2 days.

Visualisation of methods
========================
Examples
========

Alternative data infilling
--------------------------

The ``"alt_data"`` infill method allows you to fill missing values in a column using data from an alternative source.

You can specify the alternative data in two ways:

1. **From a column within the same TimeFrame**: If the alternative data is already present as a column in your
current :class:`~time_stream.TimeFrame` object, you can directly reference it.
2. **From a separate DataFrame**: You can provide an entirely separate
Polars DataFrame containing the alternative data.

In both cases, you can also apply a ``correction_factor`` to the alternative data before it's used for infilling.

Infilling from a separate DataFrame
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's say you have a primary dataset with missing "flow" values, and a separate ``alt_df`` with "alt_data" that
can be used to infill these gaps.

**Input:**

.. tab-set::
:class: outline padded-tabs

.. tab-item:: Main Data

.. jupyter-execute::
:hide-code:

import examples_infilling
ts = examples_infilling.alt_data_main()

.. tab-item:: Alternative Data

.. jupyter-execute::
:hide-code:

import examples_infilling
ts = examples_infilling.alt_data_alt()

**Code:**

.. literalinclude:: ../../../src/time_stream/examples/examples_infilling.py
:language: python
:start-after: [start_block_2]
:end-before: [end_block_2]
:dedent:

**Output:**

.. jupyter-execute::
:hide-code:

import examples_infilling
ts = examples_infilling.alt_data_infill()

Visualisation of interpolation methods
======================================

A quick visualisation of the results from the different infill methods is sometimes useful. However, bear in mind
that this is a very simplistic example and the correct method to use is dependent on your data.
A quick visualisation of the results from the different interpolation infill methods is sometimes useful. However,
bear in mind that this is a very simplistic example and the correct method to use is dependent on your data.
You should do your research into which is most appropriate.

.. jupyter-execute::
Expand Down
28 changes: 28 additions & 0 deletions src/time_stream/examples/examples_infilling.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,34 @@ def create_simple_time_series_with_gaps() -> ts.TimeFrame:
return tf


def alt_data_main() -> pl.DataFrame:
with suppress_output():
df = get_example_df(library="polars")
print(df)
return df


def alt_data_alt() -> pl.DataFrame:
with suppress_output():
alt_df = get_example_df(library="polars", complete=True)
alt_df = alt_df.with_columns(pl.col("flow").mul(1.25).alias("alt_flow")).drop("flow")
print(alt_df)
return alt_df


def alt_data_infill() -> None:
with suppress_output():
df = alt_data_main()
alt_df = alt_data_alt()

tf = ts.TimeFrame(df, "time", resolution="PT15M", periodicity="PT15M")

# [start_block_2]
tf_infill = tf.infill("alt_data", "flow", alt_df=alt_df, correction_factor=0.75, alt_data_column="alt_flow")
# [end_block_2]
print(tf_infill.df)


def all_infills() -> pl.DataFrame:
with suppress_output():
tf = create_simple_time_series_with_gaps()
Expand Down
15 changes: 9 additions & 6 deletions src/time_stream/examples/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,22 @@ def suppress_output() -> Iterator:
sys.stdout = original_stdout


def get_example_df(library: str = "polars") -> pd.DataFrame:
def get_example_df(library: str = "polars", complete: bool = False) -> pd.DataFrame:
# Create sample data: 15-minute intervals from 2020-09-01 to 2023-11-01 with random flow data
np.random.seed(31)
date_range = pd.date_range(start="2020-09-01", end="2023-11-01", freq="15min")
flow_data = np.random.uniform(10, 100, len(date_range)) + np.sin(np.arange(len(date_range)) * 0.01) * 20
flow_data[[1, 3, 4, 5, 6, -2, -3]] = np.nan
flow_data = np.random.uniform(90, 100, len(date_range)) + np.sin(np.arange(len(date_range)) * 0.01) * 20

# Create input dataframe
df = pd.DataFrame({"time": date_range, "flow": flow_data})

# Add some NaN values to simulate incomplete data
mask = np.random.random(len(df)) > 0.95
df.loc[mask, "flow"] = np.nan
if not complete:
# Add some NaN values to simulate incomplete data
mask = np.random.random(len(df)) > 0.95
df.loc[mask, "flow"] = np.nan
# Target some specific indexes so we can see them on examples
df.iloc[[1, 3, 4, 5, 6, -2, -3], df.columns.get_loc("flow")] = np.nan

if library == "polars":
df = pl.DataFrame(df)
else:
Expand Down
68 changes: 65 additions & 3 deletions src/time_stream/infill.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,13 @@ def _infilled_column_name(self, infill_column: str) -> str:
return f"{infill_column}_{self.name}"

@abstractmethod
def _fill(self, df: pl.DataFrame, infill_column: str) -> pl.DataFrame:
def _fill(self, df: pl.DataFrame, infill_column: str, ctx: InfillCtx) -> pl.DataFrame:
"""Return the Polars dataframe containing infilled data.

Args:
df: The DataFrame to infill.
infill_column: The column to infill.
ctx: The infill context.

Returns:
pl.DataFrame with infilled values
Expand Down Expand Up @@ -119,7 +120,7 @@ def execute(self) -> pl.DataFrame:
return self.ctx.df

# Apply the specific infill logic from the child class
df_infilled = self.infill_method._fill(df, self.column)
df_infilled = self.infill_method._fill(df, self.column, self.ctx)
infilled_column = self.infill_method._infilled_column_name(self.column)

# Limit the infilled data to where the infill mask is True
Expand Down Expand Up @@ -215,7 +216,7 @@ def min_points_required(self) -> int:
"""Minimum number of data points required for this interpolation method."""
pass

def _fill(self, df: pl.DataFrame, infill_column: str) -> pl.DataFrame:
def _fill(self, df: pl.DataFrame, infill_column: str, ctx: InfillCtx) -> pl.DataFrame:
"""Apply scipy interpolation to fill missing values in the specified column.

This method handles the common scipy interpolation workflow:
Expand All @@ -229,6 +230,7 @@ def _fill(self, df: pl.DataFrame, infill_column: str) -> pl.DataFrame:
Args:
df: The DataFrame to infill.
infill_column: The column to infill.
ctx: The infill context.

Returns:
pl.DataFrame with infilled values
Expand Down Expand Up @@ -356,3 +358,63 @@ class PchipInterpolation(ScipyInterpolation):
def _create_interpolator(self, x_valid: np.ndarray, y_valid: np.ndarray) -> Any:
"""Create scipy PCHIP interpolator."""
return PchipInterpolator(x_valid, y_valid, **self.scipy_kwargs)


@InfillMethod.register
class AltData(InfillMethod):
"""Infill from an alternative data source, with optional correction factor."""

name = "alt_data"

def __init__(self, alt_data_column: str, correction_factor: float = 1.0, alt_df: pl.DataFrame | None = None):
"""Initialize the alternative data infill method.

Args:
alt_data_column: The name of the column providing the alternative data.
correction_factor: An optional correction factor to apply to the alternative data.
alt_df: The DataFrame containing the alternative data.
"""
self.alt_data_column = alt_data_column
self.correction_factor = correction_factor
self.alt_df = alt_df

def _fill(self, df: pl.DataFrame, infill_column: str, ctx: InfillCtx) -> pl.DataFrame:
"""Fill missing values using data from the alternative column.

Args:
df: The DataFrame to infill.
infill_column: The column to infill.
ctx: The infill context.

Returns:
pl.DataFrame with infilled values.
"""
if self.alt_df is None:
check_columns_in_dataframe(df, [self.alt_data_column])
alt_data_column_name = self.alt_data_column
else:
time_column_name = ctx.time_name
check_columns_in_dataframe(self.alt_df, [time_column_name, self.alt_data_column])
alt_data_column_name = f"__ALT_DATA__{self.alt_data_column}"
alt_df = self.alt_df.select([time_column_name, self.alt_data_column]).rename(
{self.alt_data_column: alt_data_column_name}
)

df = df.join(
alt_df,
on=time_column_name,
how="left",
suffix="_alt",
)

infilled = df.with_columns(
pl.when(pl.col(infill_column).is_null())
.then(pl.col(alt_data_column_name) * self.correction_factor)
.otherwise(pl.col(infill_column))
.alias(self._infilled_column_name(infill_column))
)

if self.alt_df is not None:
infilled = infilled.drop(alt_data_column_name)

return infilled
Loading