You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This works well in many cases, however, if I serialize an object and deserialize it again, the hash would change.
Is there a way to use dill in this case for consistent hashes even before and after serialization/deserialization?
Full example below:
import hashlib
import tempfile
from pathlib import Path
import dill
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def compute_hash(x) -> str:
return hashlib.sha256(dill.dumps(x)).hexdigest()
pipeline = Pipeline(
steps=[
("scaler", StandardScaler()),
("estimator", LinearRegression()),
]
)
N, p = 10, 2
X, y = np.random.rand(N, p), np.random.rand(N)
pipeline.fit(X, y)
hash_original_pipeline = compute_hash(pipeline)
with tempfile.TemporaryDirectory() as temp_dir:
path = Path(temp_dir) / "pipeline.pkl"
with open(path, "wb") as file:
dill.dump(pipeline, file)
with open(path, "rb") as file:
pipeline_from_file = dill.load(file)
hash_deserialized_pipeline = compute_hash(pipeline_from_file)
assert hash_original_pipeline == hash_deserialized_pipeline, f"{hash_original_pipeline} != {hash_deserialized_pipeline}"
The text was updated successfully, but these errors were encountered:
dill doesn't guarantee a unique hash for each different instance, it only guarantees each instance can be turned into a unique string, then reverted to an instance with state that is equivalent to the original's state.
@mmckerns thanks that is helpful. Would you be able to share a minimal example of how you would use klepto to create a consistent hash in this case?
Also, instead of using dill as highlighted above, I have found the below way to be consistent across many cases. Other than being a bit slow, do you see any issues in that approach?
I'm looking into using dill as a key ingredient in generating consistent hashes of Python objects, especially scikit-learn Pipeline.
I would use it like this:
This works well in many cases, however, if I serialize an object and deserialize it again, the hash would change.
Is there a way to use dill in this case for consistent hashes even before and after serialization/deserialization?
Full example below:
The text was updated successfully, but these errors were encountered: