`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

ringohoffman · 2025-03-03T05:51:26Z

Describe the bug

If you map a dataset and save it to a specific cache_file_name with a specific num_proc, and then call map again with that same existing cache_file_name but a different num_proc, the dataset will be re-mapped.

Steps to reproduce the bug

Download a dataset

import datasets

dataset = datasets.load_dataset("ylecun/mnist")

Generating train split: 100%|██████████| 60000/60000 [00:00<00:00, 116429.85 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 103310.27 examples/s]

map and cache it with a specific num_proc

cache_file_name="./cache/train.map"
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=2)

Map (num_proc=2): 100%|██████████| 60000/60000 [00:01<00:00, 53764.03 examples/s]

map it with a different num_proc and the same cache_file_name as before

dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=3)

Map (num_proc=3): 100%|██████████| 60000/60000 [00:00<00:00, 65377.12 examples/s]

Expected behavior

If I specify an existing cache_file_name, I don't expect using a different num_proc than the one that was used to generate it to cause the dataset to have be be re-mapped.

Environment info

$ datasets-cli env

- `datasets` version: 3.3.2
- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
- `huggingface_hub` version: 0.29.1
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0

The text was updated successfully, but these errors were encountered:

lhoestq · 2025-03-03T17:47:20Z

This feels related: #3044

ringohoffman · 2025-03-04T05:55:07Z

@lhoestq This comment specifically, I agree:

Inconsistent caching behaviour when using Dataset.map() with a new_fingerprint and num_proc>1 #3044 (comment)

Almost a year later and I'm in a similar boat. Using custom fingerprints and when using multiprocessing the cached datasets are saved with a template at the end of the filename (something like "000001_of_000008" for every process of num_proc). So if in the next time you run the script you set num_proc to a different number, the cache cannot be used.

Is there any way to get around this? I am processing a huge dataset so I do the processing on one machine and then transfer the processed data to another in its cache dir but currently that's not possible due to num_proc mismatch.

Fixes huggingface#7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly

…m_proc` (#7434) * Refactor Dataset.map to reuse cache files mapped with different num_proc Fixes #7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly * Only give reprocessing message doing a partial remap also fix spacing in message * Update logging message to account for if a cache file will be written at all and written by the main process or not * Refactor string_to_dict to return None if there is no match instead of raising ValueError instead of having the pattern of using try-except to handle when there is no match, we can instead check if the return value is None; we can also assert that the return value should not be None if we know that should be true * Simplify existing existing_cache_file_map with string_to_dict #7434 (comment) * Set initial value if there are already existing cache files #7434 (comment) * Allow for source_url_fields to be None they can be local file paths here https://github.com/huggingface/datasets/actions/runs/13683185040/job/38380924390?pr=7435#step:10:9731 * Add unicode escape to handle parsing string_to_dict in Windows paths * Remove glob_pattern_to_regex All the tests still pass when it is removed; I think the unicode escaping must do some of the work that glob_pattern_to_regex was doing here before * fix dependencies --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

ringohoffman changed the title ~~Dataset.map ignores cache_file_name when ran with different num_proc~~ Dataset.map ignores existing caches and remaps when ran with different num_proc Mar 3, 2025

ringohoffman mentioned this issue Mar 4, 2025

Refactor Dataset.map to reuse cache files mapped with different num_proc #7434

Merged

lhoestq closed this as completed in #7434 May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

ringohoffman commented Mar 3, 2025 •

edited

Loading

lhoestq commented Mar 3, 2025

ringohoffman commented Mar 4, 2025

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

Comments

ringohoffman commented Mar 3, 2025 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Mar 3, 2025

ringohoffman commented Mar 4, 2025

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

`Dataset.map` ignores existing caches and remaps when ran with different `num_proc` #7433

ringohoffman commented Mar 3, 2025 •

edited

Loading