Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dill.dumps to generate consistent hashes of Python objects #687

Open
muhlbach opened this issue Oct 23, 2024 · 2 comments
Open

Use dill.dumps to generate consistent hashes of Python objects #687

muhlbach opened this issue Oct 23, 2024 · 2 comments
Labels

Comments

@muhlbach
Copy link

I'm looking into using dill as a key ingredient in generating consistent hashes of Python objects, especially scikit-learn Pipeline.

I would use it like this:

def compute_hash(x) -> str:
    return hashlib.sha256(dill.dumps(x)).hexdigest()

This works well in many cases, however, if I serialize an object and deserialize it again, the hash would change.
Is there a way to use dill in this case for consistent hashes even before and after serialization/deserialization?

Full example below:

import hashlib
import tempfile
from pathlib import Path

import dill
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


def compute_hash(x) -> str:
    return hashlib.sha256(dill.dumps(x)).hexdigest()

pipeline = Pipeline(
        steps=[
            ("scaler", StandardScaler()),
            ("estimator", LinearRegression()),
        ]
    )

N, p = 10, 2
X, y = np.random.rand(N, p), np.random.rand(N)
pipeline.fit(X, y)

hash_original_pipeline = compute_hash(pipeline)

with tempfile.TemporaryDirectory() as temp_dir:
    path = Path(temp_dir) / "pipeline.pkl"
    with open(path, "wb") as file:
        dill.dump(pipeline, file)
    with open(path, "rb") as file:
        pipeline_from_file = dill.load(file)

hash_deserialized_pipeline = compute_hash(pipeline_from_file)

assert hash_original_pipeline == hash_deserialized_pipeline, f"{hash_original_pipeline} != {hash_deserialized_pipeline}"
@mmckerns
Copy link
Member

I tend to use tools like klepto, which can leverage dill similarly to what I think you want to do:
https://github.com/uqfoundation/klepto/blob/master/klepto/keymaps.py

dill doesn't guarantee a unique hash for each different instance, it only guarantees each instance can be turned into a unique string, then reverted to an instance with state that is equivalent to the original's state.

@muhlbach
Copy link
Author

@mmckerns thanks that is helpful. Would you be able to share a minimal example of how you would use klepto to create a consistent hash in this case?

Also, instead of using dill as highlighted above, I have found the below way to be consistent across many cases. Other than being a bit slow, do you see any issues in that approach?

def compute_hash(x: Any) -> str:
    return hashlib.sha256(
        string=dill.dumps(
            obj=dill.loads(dill.dumps(x)),
            protocol=None,
            byref=False,
            fmode=None,
            recurse=None,
        ),
        usedforsecurity=True,
    ).hexdigest()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants