Symbolic imputation for missing data #951

gm89uk · 2025-06-02T21:37:15Z

gm89uk
Jun 2, 2025

Hi Miles (and the PySR/SymbolicRegression.jl community),
Considering the versatility of template expressions, perhaps a a 'symbolic data imputation' for missing data could now be feasible.

What if an expressionspec could define a structure for simultaneously evolving imputed variables for missing data and the remaining functions. E.g:
Let's say we want to evolve f(x1,x2,x3,x4) but have some missing x1s.
-There could be imputation sub-expression, e.g. x1.na = f_sr_impute(x2,x3,x4) which would evolve an expression using all rows with full data to predict x1 as the target for that function, then use the now completed X1 to carry on as normal for the remaining code, e.g. return f(x1,x2,x3,x4), where X1 is now X1 and imputed X1 in a new validvector.

By co evolving the imputation and actual functions, the imputed function would hopefully have reduced over fitting and minimised complexity by the nature of SR and the loss function for pred and y. We'd essentially get imputation logic tailored and simplified for the end-task. Perhaps SR would choose an optimised constant as a stand in for missing data, but it could leverage any relationship between variables.

Obviously this would significantly increase computation and perhaps a two step approach is more practical, which is what I've been doing at the moment.

I've specifically avoided using y in predicting missing x1s because I want to avoid data leaking, and have a model that handles real life scenarios of missing data, where y is not available (e.g..some older equipment doesn't measure x1 but you want your model to be compatible, something where other imputation techniques like MICE would be less useful).

MilesCranmer · 2025-06-18T17:36:15Z

MilesCranmer
Jun 18, 2025
Maintainer

Cool idea. I think it should be easy to set up by making a template expression, and in the combiner function, have lines that replace the missing data with imputed data.

magic_number = -1000  # Missing data => fill it with -1000.0 exactly

x1_imput = f_x1(x2, x3, x4)
x1_fixed = ValidVector(
    map(i -> x1.x[i] == magic_number ? x1_imput.x[i] : x1.x[i], eachindex(x1.x, x1_imput.x)),
    x1.valid && x1_imput.valid
)

x2_imput = f_x2(x1_fixed, x3, x4)

#= ... =#

f(x1_fixed, x2_fixed, x3_fixed, x4_fixed)  # With imputed data applied

3 replies

MilesCranmer Jun 18, 2025
Maintainer

Hm... Although it is tricky because now f_x1 has missing data in the input! So might need to rely on multiple stages of imputation, or features that you always have available when all others are missing?

gm89uk Jun 19, 2025
Author

Thanks Miles!

That worked well. Although convergence is slow. I think if you had multiple missing variables, traditional imputation techniques are probably more efficient as the residual error would propagate through the model with multi-tiered imputation (although this is probably a feature if the algorithm is left long enough).

Although it's great to see this is possible as is with template spec!

gm89uk Jun 20, 2025
Author

Having tried it a bit with one variable with missing data, it actually works great! Sometimes, f(...) will ignore x1_fixed after many iterations but most of the runs, it does find a good imputation for it. It's very cool to see it start with a constant that's in the right ball park then refine it.

I think standardisation of the data on the complete rows, will help a lot here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbolic imputation for missing data #951

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Symbolic imputation for missing data #951

Uh oh!

Uh oh!

gm89uk Jun 2, 2025

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

MilesCranmer Jun 18, 2025 Maintainer

Uh oh!

MilesCranmer Jun 18, 2025 Maintainer

Uh oh!

gm89uk Jun 19, 2025 Author

Uh oh!

gm89uk Jun 20, 2025 Author

gm89uk
Jun 2, 2025

Replies: 1 comment 3 replies

MilesCranmer
Jun 18, 2025
Maintainer

MilesCranmer Jun 18, 2025
Maintainer

gm89uk Jun 19, 2025
Author

gm89uk Jun 20, 2025
Author