Skip to content

Inconsistent caching behaviour when using Dataset.map() with a new_fingerprint and num_proc>1 #3044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vlievin opened this issue Oct 8, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@vlievin
Copy link

vlievin commented Oct 8, 2021

Describe the bug

Caching does not work when using Dataset.map() with:

  1. a function that cannot be deterministically fingerprinted
  2. num_proc>1
  3. using a custom fingerprint set with the argument new_fingerprint.

This means that the dataset will be mapped with the function for each and every call, which does not happen if num_proc==1. In that case (num_proc==1) subsequent calls will load the transformed dataset from the cache, which is the expected behaviour. The example can easily be translated into a unit test.

I have a fix and will submit a pull request asap.

Steps to reproduce the bug

import hashlib
import json
import os
from typing import Dict, Any

import numpy as np
from datasets import load_dataset, Dataset

Batch = Dict[str, Any]
filename = 'example.json'


class Transformation():
    """A transformation with a random state that cannot be fingerprinted"""

    def __init__(self):
        self.state = np.random.random()

    def __call__(self, batch: Batch) -> Batch:
        batch['x'] = [np.random.random() for _ in batch['x']]
        return batch


def generate_dataset():
    """generate a simple dataset"""
    rgn = np.random.RandomState(24)
    data = {
        'data': [{'x': float(y), 'y': -float(y)} for y in
                 rgn.random(size=(1000,))]}
    if not os.path.exists(filename):
        with open(filename, 'w') as f:
            f.write(json.dumps(data))

    return filename


def process_dataset_with_cache(num_proc=1, remove_cache=False,
                               cache_expected_to_exist=False):

    # load the generated dataset
    dset: Dataset = next(
        iter(load_dataset('json', data_files=filename, field='data').values()))
    new_fingerprint = hashlib.md5("static-id".encode("utf8")).hexdigest()

    # get the expected cached path
    cache_path = dset._get_cache_file_path(new_fingerprint)
    if remove_cache and os.path.exists(cache_path):
        os.remove(cache_path)

     # check that the cache exists, and print a statement
    # if was actually expected to exist
    cache_exist = os.path.exists(cache_path)
    print(f"> cache file exists={cache_exist}")
    if cache_expected_to_exist and not cache_exist:
        print("=== Cache does not exist! ====")

    # apply the transformation with the new fingerprint
    dset = dset.map(
        Transformation(),
        batched=True,
        num_proc=num_proc,
        new_fingerprint=new_fingerprint,
        desc="mapping dataset with transformation")


generate_dataset()

for num_proc in [1, 2]:
    print(f"# num_proc={num_proc}, first pass")
    # first pass to generate the cache (always create a new cache here)
    process_dataset_with_cache(remove_cache=True,
                               num_proc=num_proc,
                               cache_expected_to_exist=False)
    print(f"# num_proc={num_proc}, second pass")
    # second pass, expects the cache to exist
    process_dataset_with_cache(remove_cache=False,
                               num_proc=num_proc,
                               cache_expected_to_exist=True)

os.remove(filename)

Expected results

In the above python example, with num_proc=2, the cache file should exist in the second call of process_dataset_with_cache ("=== Cache does not exist! ====" should not be printed).
When the cache is successfully created, map() is called only one time.

Actual results

In the above python example, with num_proc=2, the cache does not exist in the second call of process_dataset_with_cache (this results in printing "=== Cache does not exist! ====").
Because the cache doesn't exist, the map() method is executed a second time and the dataset is not loaded from the cache.

Environment info

  • datasets version: 1.12.1
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.8.8
  • PyArrow version: 5.0.0
@vlievin vlievin added the bug Something isn't working label Oct 8, 2021
vlievin added a commit to vlievin/datasets that referenced this issue Oct 8, 2021
vlievin added a commit to vlievin/datasets that referenced this issue Oct 8, 2021
@lhoestq
Copy link
Member

lhoestq commented Oct 21, 2021

Following the discussion in #3045 if would be nice to have a way to let users have a nice experience with caching even if the function is not hashable.

Currently a workaround is to make the function picklable. This can be done by implementing a callable class instead, that can be pickled using by implementing a custom __getstate__ method for example.

However it sounds pretty complicated for a simple thing. Maybe one idea would be to have something similar to streamlit: they allow users to register the hashing of their own objects.

See the documentation about their hash_funcs here: https://docs.streamlit.io/library/advanced-features/caching#the-hash_funcs-parameter

Here is the example they give:

class FileReference:
    def __init__(self, filename):
        self.filename = filename

def hash_file_reference(file_reference):
    filename = file_reference.filename
    return (filename, os.path.getmtime(filename))

@st.cache(hash_funcs={FileReference: hash_file_reference})
def func(file_reference):
    ...

@vlievin
Copy link
Author

vlievin commented Oct 27, 2021

My solution was to generate a custom hash, and use the hash as a new_fingerprint argument to the map() method to enable caching. This works, but is quite hacky.

@lhoestq, this approach is very neat, this would make the whole caching mechanic more explicit. I don't have so much time to look into this right now, but I might give it a try in the future.

@BramVanroy
Copy link
Contributor

Almost a year later and I'm in a similar boat. Using custom fingerprints and when using multiprocessing the cached datasets are saved with a template at the end of the filename (something like "000001_of_000008" for every process of num_proc). So if in the next time you run the script you set num_proc to a different number, the cache cannot be used.

Is there any way to get around this? I am processing a huge dataset so I do the processing on one machine and then transfer the processed data to another in its cache dir but currently that's not possible due to num_proc mismatch.

@ringohoffman
Copy link
Contributor

ringohoffman commented Mar 4, 2025

Expected results

In the above python example, with num_proc=2, the cache file should exist in the second call of process_dataset_with_cache ("=== Cache does not exist! ====" should not be printed). When the cache is successfully created, map() is called only one time.

Actual results

In the above python example, with num_proc=2, the cache does not exist in the second call of process_dataset_with_cache (this results in printing "=== Cache does not exist! ===="). Because the cache doesn't exist, the map() method is executed a second time and the dataset is not loaded from the cache.

In your example

cache_path = "~/.cache/huggingface/datasets/json/.../cache-3b163736cf4505085d8b5f9b4c266c26.arrow"

but

$ tree~/.cache/huggingface/datasets/json/.../
~/.cache/huggingface/datasets/json/.../
├── cache-3b163736cf4505085d8b5f9b4c266c26_00000_of_00002.arrow
├── cache-3b163736cf4505085d8b5f9b4c266c26_00001_of_00002.arrow

When num_proc > 1, the cache files are sharded and not saved under cache_path. Instead, a suffix appended, and so it is expected that not os.path.exists(cache_path) and that "=== Cache does not exist! ====".

You can see there isn't a 2nd progress bar, also, so it is definitely using the cache on the second call to process_dataset_with_cache with both num_proc=1 and num_proc=2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants