Skip to content

Question regarding batch processing #340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 2 tasks
stilley2 opened this issue May 13, 2019 · 10 comments
Closed
1 of 2 tasks

Question regarding batch processing #340

stilley2 opened this issue May 13, 2019 · 10 comments

Comments

@stilley2
Copy link
Contributor

stilley2 commented May 13, 2019

Summary

The docs say that batch conversion with multiple parallel calls to heudiconv is possible. However, it's not clear what this means for files in the top level of the output directory (e.g. participants.tsv, task*.json). It would seem to me that there would be race conditions updating these files.

I skimmed through the code, and a potential workaround seems to be to ignore all the top level files after a batch run and repopulate them manually. This requires that the lower level json files are not altered based on the top level files. In other words, this requires that the subjects subdirectories do not rely on inheritance from the top level directory. Is this an accurate assumption?

Assuming my understanding of all of this is correct, it seems the easiest way to allow batch processing is to include a flag that prevents writing of top level files, and a subcommand that can generate these after all the batches have complete. Thoughts?

Thanks!

Platform details:

Choose one:

  • Local environment
  • Container
  • Heudiconv version:

0.5.4

@mgxd
Copy link
Member

mgxd commented May 13, 2019

Hi @stilley2 - thank you for the report.

It would seem to me that there would be race conditions updating these files.

Yep, this is a less common but totally possible pitfall with this approach. Though this could be remedied by creating a lockfile whenever modifying these top-level files to halt concurrent access and then removing it to allow the next process to proceed.

Would you have any interest in adding this feature?

@stilley2
Copy link
Contributor Author

Sure I'll take a stab at in. In the meantime, is my assumption that the subject level files are not altered when the top level files are generated correct? That will allow me to proceed with the current version and just ignore the top level files.

@stilley2
Copy link
Contributor Author

I looked into file locking, and it seems tricky to come up with a cross platform solution that works on all operating systems and on distributed filesystems such as those in a cluster environment. I think the best plan is to just add a flag that turns off writing these top-level files, and then the user can use the populate-templates command after the batch processing. I've created #344 to do this.

@satra
Copy link
Member

satra commented May 14, 2019

@stilley2 - try this one: https://pypi.org/project/filelock/

@stilley2
Copy link
Contributor Author

@satra, I just tested that on a cluster environment, and it only worked when the processes were on the same node. So on the system I have access to this would not work with SLURM when requesting more than one node. I understand this is likely filesystem dependent, but I don't think we want to rely on such behavior.

@satra
Copy link
Member

satra commented May 15, 2019

@stilley2 - thank you for testing. did you use SoftFileLock ?

@stilley2
Copy link
Contributor Author

Ah good catch, SoftFileLock works! I guess I should have read further down the examples page :)

I personally still think the strategy in #344 is the way to go.

@satra
Copy link
Member

satra commented May 15, 2019

@stilley2 - while the solution in #344 is fine for this scenario, we also have this scenario when creating datalad datasets. the locking option would work for that as well. i.e. any scenario which needs to update a global state of a dataset across distributed processes would benefit from the locking option.

@stilley2
Copy link
Contributor Author

Ok I agree it sounds like a good feature to have regardless of #344. I'll try to implement it when I get time, but if anyone wants to jump in and do it I won't fill bad :)

@yarikoptic
Copy link
Member

I think for 0.10.0 we addressed/mitigated all known "collisions" between parallel tasks and I would expect this issue to be resolved now. But feel welcome to reopen if you still observe such situations with 0.10.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants