Skip to content

BUG: find_constants helper function is inconsistent #7851

Open
@OriolAbril

Description

@OriolAbril

Describe the issue:

The function find_constants in backends/arviz.py aims to return the elements that should go into the constant data group and has some logic to remove instances of pm.Data that are later on used as observed= so they don't end up duplicated in both observed_data and constant_data.

This exclusion doesn't always seem to work though. With pm.Categorical as observational distribution, the observed data ends up duplicated in both observed and constant data groups.

Reproduceable code example:

import numpy as np
import pymc as pm

rng = np.random.default_rng(1234)

obs_value = rng.choice(list("ABCDE"), 100, replace=True)

groups, groups_idx = np.unique(obs_value, return_inverse=True)

coords = {
    "obs_idx": np.arange(100),
    "groups": groups,
}

with pm.Model(coords=coords) as model:
    y_pt = pm.Data("y_observed", groups_idx, dims="obs_idx")
    beta = pm.Normal("beta", dims="groups")
    p = pm.Deterministic("p", pm.math.softmax(beta), dims="groups")
    pm.Categorical("y", p=p, observed=y_pt, dims="obs_idx")

with model:
    idata = pm.sample()

np.allclose(idata.constant_data["y_observed"], idata.observed_data["y"])
# True

Error message:

PyMC version information:

PyMC: 5.23.0 installed from conda.

Not sure if this is something that has always happened or if it is something new as I generally don't register the observed values as pm.Data.

Context for the issue:

This means, for example, that in more complex scenarios where we have both actual constant data and the observed data it is not possible to update the pm.Data containers except the y_observed to something with a different length and call pm.sample_posterior_predictive. As y_observed is still being pulled in, the conversion to inferencedata after sampling the posterior predictive fails because of incompatible shapes.

This seems to come from the set(model.rvs_to_values.values()) returning something related to casting y_observed instead of y_observed itself.

# for the model above:
set(model.rvs_to_values.values())
# {Cast{int64}.0, beta}

# whereas for a model with Normal as observational distribution
with pm.Model(coords={"obs_idx": np.arange(100)}) as model:
    y_pt = pm.Data("y_observed", rng.normal(size=100), dims="obs_idx")
    sigma = pm.Normal("sigma")
    pm.Normal("y", mu=0, sigma=sigma, observed=y_pt)

set(model.rvs_to_values.values())
# {sigma, y_observed}

# another discrete model has the same issue
with pm.Model(coords={"obs_idx": np.arange(100)}) as model:
    y_pt = pm.Data("y_observed", rng.poisson(size=100), dims="obs_idx")
    lam = pm.Normal("lambda")
    pm.Poisson("y", lam, observed=y_pt)

set(model.rvs_to_values.values())
# {Cast{int64}.0, lambda}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions