Question regarding batch processing #340

stilley2 · 2019-05-13T17:27:32Z

Summary

The docs say that batch conversion with multiple parallel calls to heudiconv is possible. However, it's not clear what this means for files in the top level of the output directory (e.g. participants.tsv, task*.json). It would seem to me that there would be race conditions updating these files.

I skimmed through the code, and a potential workaround seems to be to ignore all the top level files after a batch run and repopulate them manually. This requires that the lower level json files are not altered based on the top level files. In other words, this requires that the subjects subdirectories do not rely on inheritance from the top level directory. Is this an accurate assumption?

Assuming my understanding of all of this is correct, it seems the easiest way to allow batch processing is to include a flag that prevents writing of top level files, and a subcommand that can generate these after all the batches have complete. Thoughts?

Thanks!

Platform details:

Choose one:

Local environment

Container

Heudiconv version:

0.5.4

mgxd · 2019-05-13T21:11:19Z

Hi @stilley2 - thank you for the report.

It would seem to me that there would be race conditions updating these files.

Yep, this is a less common but totally possible pitfall with this approach. Though this could be remedied by creating a lockfile whenever modifying these top-level files to halt concurrent access and then removing it to allow the next process to proceed.

Would you have any interest in adding this feature?

stilley2 · 2019-05-14T14:52:29Z

Sure I'll take a stab at in. In the meantime, is my assumption that the subject level files are not altered when the top level files are generated correct? That will allow me to proceed with the current version and just ignore the top level files.

stilley2 · 2019-05-14T18:26:44Z

I looked into file locking, and it seems tricky to come up with a cross platform solution that works on all operating systems and on distributed filesystems such as those in a cluster environment. I think the best plan is to just add a flag that turns off writing these top-level files, and then the user can use the populate-templates command after the batch processing. I've created #344 to do this.

satra · 2019-05-14T18:31:05Z

@stilley2 - try this one: https://pypi.org/project/filelock/

stilley2 · 2019-05-15T18:35:52Z

@satra, I just tested that on a cluster environment, and it only worked when the processes were on the same node. So on the system I have access to this would not work with SLURM when requesting more than one node. I understand this is likely filesystem dependent, but I don't think we want to rely on such behavior.

satra · 2019-05-15T19:26:02Z

@stilley2 - thank you for testing. did you use SoftFileLock ?

stilley2 · 2019-05-15T21:10:54Z

Ah good catch, SoftFileLock works! I guess I should have read further down the examples page :)

I personally still think the strategy in #344 is the way to go.

satra · 2019-05-15T21:36:13Z

@stilley2 - while the solution in #344 is fine for this scenario, we also have this scenario when creating datalad datasets. the locking option would work for that as well. i.e. any scenario which needs to update a global state of a dataset across distributed processes would benefit from the locking option.

stilley2 · 2019-05-16T13:14:22Z

Ok I agree it sounds like a good feature to have regardless of #344. I'll try to implement it when I get time, but if anyone wants to jump in and do it I won't fill bad :)

yarikoptic · 2021-10-11T19:10:17Z

I think for 0.10.0 we addressed/mitigated all known "collisions" between parallel tasks and I would expect this issue to be resolved now. But feel welcome to reopen if you still observe such situations with 0.10.0

mgxd added BIDS bug labels May 13, 2019

stilley2 mentioned this issue May 14, 2019

Add --skiptop flag to prevent writing top level bids files #344

Merged

stilley2 mentioned this issue May 18, 2019

Acquire a lock prior to top level bids file modification #348

Merged

mgxd mentioned this issue Aug 8, 2019

JSON/File not found when populating template files #362

Closed

1 task

yarikoptic closed this as completed Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding batch processing #340

Question regarding batch processing #340

stilley2 commented May 13, 2019 •

edited

Loading

mgxd commented May 13, 2019

stilley2 commented May 14, 2019

stilley2 commented May 14, 2019

satra commented May 14, 2019

stilley2 commented May 15, 2019

satra commented May 15, 2019

stilley2 commented May 15, 2019

satra commented May 15, 2019

stilley2 commented May 16, 2019

yarikoptic commented Oct 11, 2021

Question regarding batch processing #340

Question regarding batch processing #340

Comments

stilley2 commented May 13, 2019 • edited Loading

Summary

Platform details:

mgxd commented May 13, 2019

stilley2 commented May 14, 2019

stilley2 commented May 14, 2019

satra commented May 14, 2019

stilley2 commented May 15, 2019

satra commented May 15, 2019

stilley2 commented May 15, 2019

satra commented May 15, 2019

stilley2 commented May 16, 2019

yarikoptic commented Oct 11, 2021

stilley2 commented May 13, 2019 •

edited

Loading