Skip to content

Dataset.map ignores existing caches and remaps when ran with different num_proc #7433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ringohoffman opened this issue Mar 3, 2025 · 2 comments · Fixed by #7434
Closed

Comments

@ringohoffman
Copy link
Contributor

ringohoffman commented Mar 3, 2025

Describe the bug

If you map a dataset and save it to a specific cache_file_name with a specific num_proc, and then call map again with that same existing cache_file_name but a different num_proc, the dataset will be re-mapped.

Steps to reproduce the bug

  1. Download a dataset
import datasets

dataset = datasets.load_dataset("ylecun/mnist")
Generating train split: 100%|██████████| 60000/60000 [00:00<00:00, 116429.85 examples/s]
Generating test split: 100%|██████████| 10000/10000 [00:00<00:00, 103310.27 examples/s]
  1. map and cache it with a specific num_proc
cache_file_name="./cache/train.map"
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=2)
Map (num_proc=2): 100%|██████████| 60000/60000 [00:01<00:00, 53764.03 examples/s]
  1. map it with a different num_proc and the same cache_file_name as before
dataset["train"].map(lambda x: x, cache_file_name=cache_file_name, num_proc=3)
Map (num_proc=3): 100%|██████████| 60000/60000 [00:00<00:00, 65377.12 examples/s]

Expected behavior

If I specify an existing cache_file_name, I don't expect using a different num_proc than the one that was used to generate it to cause the dataset to have be be re-mapped.

Environment info

$ datasets-cli env

- `datasets` version: 3.3.2
- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
- `huggingface_hub` version: 0.29.1
- PyArrow version: 19.0.1
- Pandas version: 2.2.3
- `fsspec` version: 2024.12.0
@ringohoffman ringohoffman changed the title Dataset.map ignores cache_file_name when ran with different num_proc Dataset.map ignores existing caches and remaps when ran with different num_proc Mar 3, 2025
@lhoestq
Copy link
Member

lhoestq commented Mar 3, 2025

This feels related: #3044

@ringohoffman
Copy link
Contributor Author

@lhoestq This comment specifically, I agree:

Almost a year later and I'm in a similar boat. Using custom fingerprints and when using multiprocessing the cached datasets are saved with a template at the end of the filename (something like "000001_of_000008" for every process of num_proc). So if in the next time you run the script you set num_proc to a different number, the cache cannot be used.

Is there any way to get around this? I am processing a huge dataset so I do the processing on one machine and then transfer the processed data to another in its cache dir but currently that's not possible due to num_proc mismatch.

ringohoffman added a commit to ringohoffman/datasets that referenced this issue Mar 4, 2025
Fixes huggingface#7433

This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards

Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly
lhoestq added a commit that referenced this issue May 12, 2025
…m_proc` (#7434)

* Refactor Dataset.map to reuse cache files mapped with different num_proc

Fixes #7433

This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards

Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly

* Only give reprocessing message doing a partial remap

also fix spacing in message

* Update logging message to account for if a cache file will be written at all and written by the main process or not

* Refactor string_to_dict to return None if there is no match instead of raising ValueError

instead of having the pattern of using try-except to handle when there is no match, we can instead check if the return value is None; we can also assert that the return value should not be None if we know that should be true

* Simplify existing existing_cache_file_map with string_to_dict

#7434 (comment)

* Set initial value if there are already existing cache files

#7434 (comment)

* Allow for source_url_fields to be None

they can be local file paths here

https://github.com/huggingface/datasets/actions/runs/13683185040/job/38380924390?pr=7435#step:10:9731

* Add unicode escape to handle parsing string_to_dict in Windows paths

* Remove glob_pattern_to_regex

All the tests still pass when it is removed; I think the unicode escaping must do some of the work that glob_pattern_to_regex was doing here before

* fix dependencies

---------

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants