Inconsistent caching behaviour when using `Dataset.map()` with a `new_fingerprint` and `num_proc>1` #3044

vlievin · 2021-10-08T09:07:10Z

Describe the bug

Caching does not work when using Dataset.map() with:

a function that cannot be deterministically fingerprinted
num_proc>1
using a custom fingerprint set with the argument new_fingerprint.

This means that the dataset will be mapped with the function for each and every call, which does not happen if num_proc==1. In that case (num_proc==1) subsequent calls will load the transformed dataset from the cache, which is the expected behaviour. The example can easily be translated into a unit test.

I have a fix and will submit a pull request asap.

Steps to reproduce the bug

import hashlib
import json
import os
from typing import Dict, Any

import numpy as np
from datasets import load_dataset, Dataset

Batch = Dict[str, Any]
filename = 'example.json'


class Transformation():
    """A transformation with a random state that cannot be fingerprinted"""

    def __init__(self):
        self.state = np.random.random()

    def __call__(self, batch: Batch) -> Batch:
        batch['x'] = [np.random.random() for _ in batch['x']]
        return batch


def generate_dataset():
    """generate a simple dataset"""
    rgn = np.random.RandomState(24)
    data = {
        'data': [{'x': float(y), 'y': -float(y)} for y in
                 rgn.random(size=(1000,))]}
    if not os.path.exists(filename):
        with open(filename, 'w') as f:
            f.write(json.dumps(data))

    return filename


def process_dataset_with_cache(num_proc=1, remove_cache=False,
                               cache_expected_to_exist=False):

    # load the generated dataset
    dset: Dataset = next(
        iter(load_dataset('json', data_files=filename, field='data').values()))
    new_fingerprint = hashlib.md5("static-id".encode("utf8")).hexdigest()

    # get the expected cached path
    cache_path = dset._get_cache_file_path(new_fingerprint)
    if remove_cache and os.path.exists(cache_path):
        os.remove(cache_path)

     # check that the cache exists, and print a statement
    # if was actually expected to exist
    cache_exist = os.path.exists(cache_path)
    print(f"> cache file exists={cache_exist}")
    if cache_expected_to_exist and not cache_exist:
        print("=== Cache does not exist! ====")

    # apply the transformation with the new fingerprint
    dset = dset.map(
        Transformation(),
        batched=True,
        num_proc=num_proc,
        new_fingerprint=new_fingerprint,
        desc="mapping dataset with transformation")


generate_dataset()

for num_proc in [1, 2]:
    print(f"# num_proc={num_proc}, first pass")
    # first pass to generate the cache (always create a new cache here)
    process_dataset_with_cache(remove_cache=True,
                               num_proc=num_proc,
                               cache_expected_to_exist=False)
    print(f"# num_proc={num_proc}, second pass")
    # second pass, expects the cache to exist
    process_dataset_with_cache(remove_cache=False,
                               num_proc=num_proc,
                               cache_expected_to_exist=True)

os.remove(filename)

Expected results

In the above python example, with num_proc=2, the cache file should exist in the second call of process_dataset_with_cache ("=== Cache does not exist! ====" should not be printed).
When the cache is successfully created, map() is called only one time.

Actual results

In the above python example, with num_proc=2, the cache does not exist in the second call of process_dataset_with_cache (this results in printing "=== Cache does not exist! ====").
Because the cache doesn't exist, the map() method is executed a second time and the dataset is not loaded from the cache.

Environment info

datasets version: 1.12.1
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.8
PyArrow version: 5.0.0

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-10-21T17:05:15Z

Following the discussion in #3045 if would be nice to have a way to let users have a nice experience with caching even if the function is not hashable.

Currently a workaround is to make the function picklable. This can be done by implementing a callable class instead, that can be pickled using by implementing a custom __getstate__ method for example.

However it sounds pretty complicated for a simple thing. Maybe one idea would be to have something similar to streamlit: they allow users to register the hashing of their own objects.

See the documentation about their hash_funcs here: https://docs.streamlit.io/library/advanced-features/caching#the-hash_funcs-parameter

Here is the example they give:

class FileReference:
    def __init__(self, filename):
        self.filename = filename

def hash_file_reference(file_reference):
    filename = file_reference.filename
    return (filename, os.path.getmtime(filename))

@st.cache(hash_funcs={FileReference: hash_file_reference})
def func(file_reference):
    ...

vlievin · 2021-10-27T08:40:58Z

My solution was to generate a custom hash, and use the hash as a new_fingerprint argument to the map() method to enable caching. This works, but is quite hacky.

@lhoestq, this approach is very neat, this would make the whole caching mechanic more explicit. I don't have so much time to look into this right now, but I might give it a try in the future.

BramVanroy · 2022-09-07T21:01:36Z

Almost a year later and I'm in a similar boat. Using custom fingerprints and when using multiprocessing the cached datasets are saved with a template at the end of the filename (something like "000001_of_000008" for every process of num_proc). So if in the next time you run the script you set num_proc to a different number, the cache cannot be used.

Is there any way to get around this? I am processing a huge dataset so I do the processing on one machine and then transfer the processed data to another in its cache dir but currently that's not possible due to num_proc mismatch.

ringohoffman · 2025-03-04T05:52:25Z

Expected results

In the above python example, with num_proc=2, the cache file should exist in the second call of process_dataset_with_cache ("=== Cache does not exist! ====" should not be printed). When the cache is successfully created, map() is called only one time.

Actual results

In the above python example, with num_proc=2, the cache does not exist in the second call of process_dataset_with_cache (this results in printing "=== Cache does not exist! ===="). Because the cache doesn't exist, the map() method is executed a second time and the dataset is not loaded from the cache.

In your example

cache_path = "~/.cache/huggingface/datasets/json/.../cache-3b163736cf4505085d8b5f9b4c266c26.arrow"

but

$ tree~/.cache/huggingface/datasets/json/.../
~/.cache/huggingface/datasets/json/.../
├── cache-3b163736cf4505085d8b5f9b4c266c26_00000_of_00002.arrow
├── cache-3b163736cf4505085d8b5f9b4c266c26_00001_of_00002.arrow

When num_proc > 1, the cache files are sharded and not saved under cache_path. Instead, a suffix appended, and so it is expected that not os.path.exists(cache_path) and that "=== Cache does not exist! ====".

You can see there isn't a 2nd progress bar, also, so it is definitely using the cache on the second call to process_dataset_with_cache with both num_proc=1 and num_proc=2.

vlievin added the bug Something isn't working label Oct 8, 2021

vlievin added a commit to vlievin/datasets that referenced this issue Oct 8, 2021

added unit test for huggingface#3044

0aa49c1

vlievin added a commit to vlievin/datasets that referenced this issue Oct 8, 2021

Fix huggingface#3044 - pass new_fingerprint to each map instance

145a303

vlievin mentioned this issue Oct 8, 2021

Fix inconsistent caching behaviour in Dataset.map() with multiprocessing #3044 #3045

Closed

lhoestq mentioned this issue Nov 2, 2021

"Property couldn't be hashed properly" even though fully picklable #3178

Closed

lhoestq mentioned this issue Mar 3, 2025

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent caching behaviour when using `Dataset.map()` with a `new_fingerprint` and `num_proc>1` #3044

Inconsistent caching behaviour when using `Dataset.map()` with a `new_fingerprint` and `num_proc>1` #3044

vlievin commented Oct 8, 2021

lhoestq commented Oct 21, 2021

vlievin commented Oct 27, 2021

BramVanroy commented Sep 7, 2022

ringohoffman commented Mar 4, 2025 •

edited

Loading

Expected results

Actual results

Inconsistent caching behaviour when using Dataset.map() with a new_fingerprint and num_proc>1 #3044

Inconsistent caching behaviour when using Dataset.map() with a new_fingerprint and num_proc>1 #3044

Comments

vlievin commented Oct 8, 2021

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

lhoestq commented Oct 21, 2021

vlievin commented Oct 27, 2021

BramVanroy commented Sep 7, 2022

ringohoffman commented Mar 4, 2025 • edited Loading

Expected results

Actual results

Inconsistent caching behaviour when using `Dataset.map()` with a `new_fingerprint` and `num_proc>1` #3044

Inconsistent caching behaviour when using `Dataset.map()` with a `new_fingerprint` and `num_proc>1` #3044

ringohoffman commented Mar 4, 2025 •

edited

Loading