-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Dataset.map
ignores existing caches and remaps when ran with different num_proc
#7433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Dataset.map
ignores cache_file_name
when ran with different num_proc
Dataset.map
ignores existing caches and remaps when ran with different num_proc
This feels related: #3044 |
@lhoestq This comment specifically, I agree:
|
Fixes huggingface#7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly
…m_proc` (#7434) * Refactor Dataset.map to reuse cache files mapped with different num_proc Fixes #7433 This refactor unifies num_proc is None or num_proc == 1 and num_proc > 1; instead of handling them completely separately where one uses a list of kwargs and shards and the other just uses a single set of kwargs and self, by wrapping the num_proc == 1 case in a list and making the difference just whether or not you use a pool, you set up either case to be able to load each other cache_files just by changing num_shards; num_proc == 1 can sequentially load the shards of a dataset mapped num_shards > 1 and sequentially map any missing shards Other than the structural refactor, the main contribution of this PR is get_existing_cache_file_map, which uses a regex of cache_file_name and suffix_template to find existing cache files, grouped by their num_shards; using this data structure, we can reset num_shards to an existing set of cache files, and load them accordingly * Only give reprocessing message doing a partial remap also fix spacing in message * Update logging message to account for if a cache file will be written at all and written by the main process or not * Refactor string_to_dict to return None if there is no match instead of raising ValueError instead of having the pattern of using try-except to handle when there is no match, we can instead check if the return value is None; we can also assert that the return value should not be None if we know that should be true * Simplify existing existing_cache_file_map with string_to_dict #7434 (comment) * Set initial value if there are already existing cache files #7434 (comment) * Allow for source_url_fields to be None they can be local file paths here https://github.com/huggingface/datasets/actions/runs/13683185040/job/38380924390?pr=7435#step:10:9731 * Add unicode escape to handle parsing string_to_dict in Windows paths * Remove glob_pattern_to_regex All the tests still pass when it is removed; I think the unicode escaping must do some of the work that glob_pattern_to_regex was doing here before * fix dependencies --------- Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
Describe the bug
If you
map
a dataset and save it to a specificcache_file_name
with a specificnum_proc
, and then call map again with that same existingcache_file_name
but a differentnum_proc
, the dataset will be re-mapped.Steps to reproduce the bug
map
and cache it with a specificnum_proc
map
it with a differentnum_proc
and the samecache_file_name
as beforeExpected behavior
If I specify an existing
cache_file_name
, I don't expect using a differentnum_proc
than the one that was used to generate it to cause the dataset to have be be re-mapped.Environment info
The text was updated successfully, but these errors were encountered: