Use dill.dumps to generate consistent hashes of Python objects #687

muhlbach · 2024-10-23T14:55:47Z

I'm looking into using dill as a key ingredient in generating consistent hashes of Python objects, especially scikit-learn Pipeline.

I would use it like this:

def compute_hash(x) -> str:
    return hashlib.sha256(dill.dumps(x)).hexdigest()

This works well in many cases, however, if I serialize an object and deserialize it again, the hash would change.
Is there a way to use dill in this case for consistent hashes even before and after serialization/deserialization?

Full example below:

import hashlib
import tempfile
from pathlib import Path

import dill
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


def compute_hash(x) -> str:
    return hashlib.sha256(dill.dumps(x)).hexdigest()

pipeline = Pipeline(
        steps=[
            ("scaler", StandardScaler()),
            ("estimator", LinearRegression()),
        ]
    )

N, p = 10, 2
X, y = np.random.rand(N, p), np.random.rand(N)
pipeline.fit(X, y)

hash_original_pipeline = compute_hash(pipeline)

with tempfile.TemporaryDirectory() as temp_dir:
    path = Path(temp_dir) / "pipeline.pkl"
    with open(path, "wb") as file:
        dill.dump(pipeline, file)
    with open(path, "rb") as file:
        pipeline_from_file = dill.load(file)

hash_deserialized_pipeline = compute_hash(pipeline_from_file)

assert hash_original_pipeline == hash_deserialized_pipeline, f"{hash_original_pipeline} != {hash_deserialized_pipeline}"

The text was updated successfully, but these errors were encountered:

mmckerns · 2024-10-26T22:14:03Z

I tend to use tools like klepto, which can leverage dill similarly to what I think you want to do:
https://github.com/uqfoundation/klepto/blob/master/klepto/keymaps.py

dill doesn't guarantee a unique hash for each different instance, it only guarantees each instance can be turned into a unique string, then reverted to an instance with state that is equivalent to the original's state.

muhlbach · 2024-10-30T19:23:55Z

@mmckerns thanks that is helpful. Would you be able to share a minimal example of how you would use klepto to create a consistent hash in this case?

Also, instead of using dill as highlighted above, I have found the below way to be consistent across many cases. Other than being a bit slow, do you see any issues in that approach?

def compute_hash(x: Any) -> str:
    return hashlib.sha256(
        string=dill.dumps(
            obj=dill.loads(dill.dumps(x)),
            protocol=None,
            byref=False,
            fmode=None,
            recurse=None,
        ),
        usedforsecurity=True,
    ).hexdigest()

mmckerns added the question label Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dill.dumps to generate consistent hashes of Python objects #687

Use dill.dumps to generate consistent hashes of Python objects #687

muhlbach commented Oct 23, 2024

mmckerns commented Oct 26, 2024

muhlbach commented Oct 30, 2024

Use dill.dumps to generate consistent hashes of Python objects #687

Use dill.dumps to generate consistent hashes of Python objects #687

Comments

muhlbach commented Oct 23, 2024

mmckerns commented Oct 26, 2024

muhlbach commented Oct 30, 2024