-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Inconsistent caching behaviour when using Dataset.map()
with a new_fingerprint
and num_proc>1
#3044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Following the discussion in #3045 if would be nice to have a way to let users have a nice experience with caching even if the function is not hashable. Currently a workaround is to make the function picklable. This can be done by implementing a callable class instead, that can be pickled using by implementing a custom However it sounds pretty complicated for a simple thing. Maybe one idea would be to have something similar to streamlit: they allow users to register the hashing of their own objects. See the documentation about their Here is the example they give: class FileReference:
def __init__(self, filename):
self.filename = filename
def hash_file_reference(file_reference):
filename = file_reference.filename
return (filename, os.path.getmtime(filename))
@st.cache(hash_funcs={FileReference: hash_file_reference})
def func(file_reference):
... |
My solution was to generate a custom hash, and use the hash as a @lhoestq, this approach is very neat, this would make the whole caching mechanic more explicit. I don't have so much time to look into this right now, but I might give it a try in the future. |
Almost a year later and I'm in a similar boat. Using custom fingerprints and when using multiprocessing the cached datasets are saved with a template at the end of the filename (something like "000001_of_000008" for every process of num_proc). So if in the next time you run the script you set num_proc to a different number, the cache cannot be used. Is there any way to get around this? I am processing a huge dataset so I do the processing on one machine and then transfer the processed data to another in its cache dir but currently that's not possible due to num_proc mismatch. |
In your example
but $ tree~/.cache/huggingface/datasets/json/.../
~/.cache/huggingface/datasets/json/.../
├── cache-3b163736cf4505085d8b5f9b4c266c26_00000_of_00002.arrow
├── cache-3b163736cf4505085d8b5f9b4c266c26_00001_of_00002.arrow When You can see there isn't a 2nd progress bar, also, so it is definitely using the cache on the second call to |
Describe the bug
Caching does not work when using
Dataset.map()
with:num_proc>1
new_fingerprint
.This means that the dataset will be mapped with the function for each and every call, which does not happen if
num_proc==1
. In that case (num_proc==1
) subsequent calls will load the transformed dataset from the cache, which is the expected behaviour. The example can easily be translated into a unit test.I have a fix and will submit a pull request asap.
Steps to reproduce the bug
Expected results
In the above python example, with
num_proc=2
, the cache file should exist in the second call ofprocess_dataset_with_cache
("=== Cache does not exist! ====" should not be printed).When the cache is successfully created,
map()
is called only one time.Actual results
In the above python example, with
num_proc=2
, the cache does not exist in the second call ofprocess_dataset_with_cache
(this results in printing "=== Cache does not exist! ====").Because the cache doesn't exist, the
map()
method is executed a second time and the dataset is not loaded from the cache.Environment info
datasets
version: 1.12.1The text was updated successfully, but these errors were encountered: