Skip to content

compute_log_likelihood for large datasets #6864

@danjenson

Description

@danjenson

Describe the issue:

Process memory grows steadily while computing log likelihood until it consumes all available memory (and swap). Replicated on linux and M1 Mac.

PYMC version: 5.7.2

Linux system:

Void Linux
Kernel 6.3.12_1
64 GB DDR5 RAM (64 GB SWAP)
24 GB RTX 4090 GPU
AMD Ryzen 9 7950X 16 core, 32 threads

Mac System:

16 GB memory
8 Cores

Dataset: ~161 mb total.

Reproducible code example:

#!/usr/bin/env python3
import numpy as np
import pandas as pd
import pymc as pm


def pymc_bayes(df: pd.DataFrame):
    a, b, c, i = df.a.values, df.b.values, df.c.values, df.i.values
    n_i = int(i.max() + 1)
    with pm.Model() as m:
        alpha = pm.Normal("alpha", 0, 1, shape=[n_i])
        beta_b = pm.HalfNormal("beta_b", 1)
        beta_c = pm.HalfNormal("beta_c", 1)
        beta_int = pm.Normal("beta_int", 0, 1)
        mu = alpha[i] + beta_b * b + beta_c * c + beta_int * b * c
        sigma = pm.Exponential("sigma", 1)
        a_hat = pm.Normal("a_hat", mu, sigma, observed=a)
        idata = pm.sample(mp_ctx="spawn", idata_kwargs={"log_likelihood": True})
        idata.to_netcdf("pymc_bayes.nc")
    print("finished!")


if __name__ == "__main__":
    n, n_int = 2618018, 17  # to match the real dataset I care about
    df = pd.DataFrame(np.random.randn(n, 3), columns=["a", "b", "c"])
    df["i"] = np.random.randint(0, n_int, size=n)
    pymc_bayes(df)

Error message:

Killed by OS.

PyMC version information:

PYMC version: 5.7.2

Context for the issue:

Trying to use this with arviz.compare(...)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions