-
Notifications
You must be signed in to change notification settings - Fork 131
Question regarding batch processing #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @stilley2 - thank you for the report.
Yep, this is a less common but totally possible pitfall with this approach. Though this could be remedied by creating a lockfile whenever modifying these top-level files to halt concurrent access and then removing it to allow the next process to proceed. Would you have any interest in adding this feature? |
Sure I'll take a stab at in. In the meantime, is my assumption that the subject level files are not altered when the top level files are generated correct? That will allow me to proceed with the current version and just ignore the top level files. |
I looked into file locking, and it seems tricky to come up with a cross platform solution that works on all operating systems and on distributed filesystems such as those in a cluster environment. I think the best plan is to just add a flag that turns off writing these top-level files, and then the user can use the |
@stilley2 - try this one: https://pypi.org/project/filelock/ |
@satra, I just tested that on a cluster environment, and it only worked when the processes were on the same node. So on the system I have access to this would not work with SLURM when requesting more than one node. I understand this is likely filesystem dependent, but I don't think we want to rely on such behavior. |
@stilley2 - thank you for testing. did you use |
Ah good catch, I personally still think the strategy in #344 is the way to go. |
@stilley2 - while the solution in #344 is fine for this scenario, we also have this scenario when creating datalad datasets. the locking option would work for that as well. i.e. any scenario which needs to update a global state of a dataset across distributed processes would benefit from the locking option. |
Ok I agree it sounds like a good feature to have regardless of #344. I'll try to implement it when I get time, but if anyone wants to jump in and do it I won't fill bad :) |
I think for 0.10.0 we addressed/mitigated all known "collisions" between parallel tasks and I would expect this issue to be resolved now. But feel welcome to reopen if you still observe such situations with 0.10.0 |
Summary
The docs say that batch conversion with multiple parallel calls to heudiconv is possible. However, it's not clear what this means for files in the top level of the output directory (e.g. participants.tsv, task*.json). It would seem to me that there would be race conditions updating these files.
I skimmed through the code, and a potential workaround seems to be to ignore all the top level files after a batch run and repopulate them manually. This requires that the lower level json files are not altered based on the top level files. In other words, this requires that the subjects subdirectories do not rely on inheritance from the top level directory. Is this an accurate assumption?
Assuming my understanding of all of this is correct, it seems the easiest way to allow batch processing is to include a flag that prevents writing of top level files, and a subcommand that can generate these after all the batches have complete. Thoughts?
Thanks!
Platform details:
Choose one:
0.5.4
The text was updated successfully, but these errors were encountered: