Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest dill release raises exception #4379

Closed
albertvillanova opened this issue May 20, 2022 · 8 comments · Fixed by #4380
Closed

Latest dill release raises exception #4379

albertvillanova opened this issue May 20, 2022 · 8 comments · Fixed by #4380
Assignees
Labels
bug Something isn't working

Comments

@albertvillanova
Copy link
Member

Describe the bug

As reported by @sgugger, latest dill release is breaking things with Datasets.

______________ ExamplesTests.test_run_speech_recognition_seq2seq _______________


self = <multiprocess.pool.ApplyResult object at 0x7fa5981a1cd0>, timeout = None

    def get(self, timeout=None):
        self.wait(timeout)
        if not self.ready():
            raise TimeoutError
        if self._success:
            return self._value
        else:
>           raise self._value
E           TypeError: '>' not supported between instances of 'NoneType' and 'float'
@albertvillanova albertvillanova added the bug Something isn't working label May 20, 2022
@albertvillanova albertvillanova self-assigned this May 20, 2022
@albertvillanova albertvillanova linked a pull request May 20, 2022 that will close this issue
@albertvillanova
Copy link
Member Author

Fixed by:

@gugarosa
Copy link
Contributor

Just an additional insight, the latest dill (either 0.3.5 or 0.3.5.1) also broke the hashing/fingerprinting of any mapping function.

For example:

from datasets import load_dataset

d = load_dataset("rotten_tomatoes")
d.map(lambda x: x)

Returns the standard non-dillable error:

Parameter 'function'=<function <lambda> at 0x7fe7d18c9560> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly....

@anivegesana
Copy link

@albertvillanova ExamplesTests.test_run_speech_recognition_seq2seq is in which file?

@albertvillanova
Copy link
Member Author

Thanks a lot @gugarosa for the insight: we will incorporate it in our CI as regression testing for future dill releases.

@albertvillanova
Copy link
Member Author

@anivegesana
Copy link

anivegesana commented May 21, 2022

@albertvillanova

I did a deep dive into @gugarosa's problem and found the issue and it might be related to the one @sgugger discovered. In dill 0.3.5(.1), I created a new save_function that fixes a bug in dill that prevented the pickling of recursive inner functions. It was a more complete solution to the problem that dill._dill.stack tried to solve in the internal API of dill. Since dill._dill.stack was no longer needed, I removed it. Since datasets copies the save_function directly from the dill API, it stops working with the new dill version since dill._dill.stack is no longer present and the save_function has been updated with new code.

@pklregister(FunctionType)
def save_function(pickler, obj):
"""
From dill._dill.save_function
This is a modified version that make globs deterministic since the order of
the keys in the output dictionary of globalvars can change.
"""
if not dill._dill._locate_function(obj):
dill._dill.log.info(f"F1: {obj}")
if getattr(pickler, "_recurse", False):
# recurse to get all globals referred to by obj
globalvars = dill.detect.globalvars
globs = globalvars(obj, recurse=True, builtin=True)
if id(obj) in dill._dill.stack:
globs = obj.__globals__ if dill._dill.PY3 else obj.func_globals
else:
globs = obj.__globals__ if dill._dill.PY3 else obj.func_globals
# globs is a dictionary with keys = var names (str) and values = python objects
# however the dictionary is not always loaded in the same order
# therefore we have to sort the keys to make deterministic.
# This is important to make `dump` deterministic.
# Only this line is different from the original implementation:
globs = {k: globs[k] for k in sorted(globs.keys())}
# The rest is the same as in the original dill implementation
_byref = getattr(pickler, "_byref", None)
_recurse = getattr(pickler, "_recurse", None)
_memo = (id(obj) in dill._dill.stack) and (_recurse is not None)
dill._dill.stack[id(obj)] = len(dill._dill.stack), obj
if dill._dill.PY3:
_super = ("super" in getattr(obj.__code__, "co_names", ())) and (_byref is not None)
if _super:
pickler._byref = True
if _memo:
pickler._recurse = False
fkwdefaults = getattr(obj, "__kwdefaults__", None)
pickler.save_reduce(
dill._dill._create_function,
(obj.__code__, globs, obj.__name__, obj.__defaults__, obj.__closure__, obj.__dict__, fkwdefaults),
obj=obj,
)
else:
_super = (
("super" in getattr(obj.func_code, "co_names", ()))
and (_byref is not None)
and getattr(pickler, "_recurse", False)
)
if _super:
pickler._byref = True
if _memo:
pickler._recurse = False
pickler.save_reduce(
dill._dill._create_function,
(obj.func_code, globs, obj.func_name, obj.func_defaults, obj.func_closure, obj.__dict__),
obj=obj,
)
if _super:
pickler._byref = _byref
if _memo:
pickler._recurse = _recurse
if (
dill._dill.OLDER
and not _byref
and (_super or (not _super and _memo) or (not _super and not _memo and _recurse))
):
pickler.clear_memo()
dill._dill.log.info("# F1")
else:
dill._dill.log.info(f"F2: {obj}")
name = getattr(obj, "__qualname__", getattr(obj, "__name__", None))
dill._dill.StockPickler.save_global(pickler, obj, name=name)
dill._dill.log.info("# F2")
return

If the dill version is below 0.3.5, you should keep this function. If it is after, you would need to update your copy of save_function to use the code I introduced, or manually add a stack variable to dill._dill if it doesn't exist. Fortunately, in any version of Python 3.7+, dictionaries are always in insertion order and dill no longer supports Python 3.6 or older. So, any globals dictionary saved by dill 0.3.5+ will be deterministic given that the version of dill is held constant and this save_function is unnecessary for newer versions of dill.

Ah. I see what is happening. I guess a different copy of the function code is needed that sorts the global variables by name.

if dill.__version__.split('.') < ['0', '3', '5']:
    # current save_function code inside here
else:
    # new save_function code inside here with the following line inserted after creating the globals
    globs = {k: globs[k] for k in sorted(globs.keys())} 

Will look into the test case @sgugger pointed out after that and verify if this is causing the problem.

I am actually looking into rewriting the global variables code in uqfoundation/dill#466 and will keep this in mind and will try to create an easy way to modify the global variables in dill 0.3.6 (for example, sort them by key like datasets does).

@albertvillanova
Copy link
Member Author

Thanks a lot for your investigation @anivegesana.

Yes, we copied-pasted the old save_function function from dill, just adding a line to make deterministic the order of global variables globs.

However, this function has changed a lot from version 0.3.5, after your PR (thank you for the fix in recursiveness, indeed):

We have to address this change.

If finally your PR to sort global variables is merged into dill 0.3.6, that will make our life easier, as the tweak will no longer be necessary. ;)

I have included a regression test so that we are sure future releases of dill do not break datasets:

@anivegesana
Copy link

anivegesana commented May 21, 2022

I should note that because Python 3.6 and older are now deprecated and Python 3.7 has insertion order dictionaries, the globals in dill will have a deterministic order, just not sorted. I would still keep it sorted like you have it to help with stability (for example, if someone reorders variables in a file, then sorting the globals would not invalidate the cache.)

It seems that the order is not quite deterministic in IPython. Huggingface datasets seems to do well in Jupyter regardless, so it is not a good idea to remove the sorting. uqfoundation/dill#19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants