Skip to content

[FIX] immunize shutil.rmtree to node non-existence for remove_node_di… #3148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

dPys
Copy link

@dPys dPys commented Jan 6, 2020

@effigies -- since it seems you're still actively working on the 1.4.x merge, I just went ahead and opened a fresh PR directly onto that branch since rebasing was causing issues...

@dPys
Copy link
Author

dPys commented Jan 6, 2020

;-)

@effigies
Copy link
Member

effigies commented Jan 6, 2020

Thanks. Do you think you could put together a small regression test?

@codecov
Copy link

codecov bot commented Jan 6, 2020

Codecov Report

Merging #3148 into maint/1.4.x will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##           maint/1.4.x   #3148      +/-   ##
==============================================
+ Coverage        67.59%   67.6%   +<.01%     
==============================================
  Files              299     299              
  Lines            39499   39499              
  Branches          5220    5220              
==============================================
+ Hits             26700   26703       +3     
+ Misses           12086   12081       -5     
- Partials           713     715       +2
Flag Coverage Δ
#smoketests 52.82% <ø> (-0.22%) ⬇️
#unittests 64.87% <ø> (+0.02%) ⬆️
Impacted Files Coverage Δ
nipype/conftest.py 95.65% <ø> (ø) ⬆️
nipype/utils/provenance.py 82.58% <0%> (-1.3%) ⬇️
nipype/pipeline/engine/nodes.py 83.33% <0%> (-0.17%) ⬇️
nipype/pipeline/engine/utils.py 80.61% <0%> (+0.11%) ⬆️
nipype/pipeline/plugins/base.py 61.91% <0%> (+1.91%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dd3b37...b473b6b. Read the comment docs.

@dPys
Copy link
Author

dPys commented Jan 6, 2020

@effigies , Not really sure what you have in mind RE: a regression test, but let me know if this is approaching it:

@pytest.mark.regression
@pytest.mark.parametrize('remove_nodes', 
    [pytest.param('false', marks=pytest.mark.xfail), 'true']
)
def test_remove_nodes(tmpdir, remove_nodes):
    import os
    import nipype.interfaces.utility as niu
    import nipype.pipeline.engine as pe

    wf = pe.Workflow(name='test')

    def func(arg1):
        try:
            if arg1 == 2:
                raise Exception('arg cannot be ' + str(arg1))
        except:
            pass
        return arg1

    funkynode = pe.MapNode(niu.Function(function=func, input_names=['arg1'],
                                        output_names=['out']),
                           iterfield=['arg1'],
                           name = 'functor')
    funkynode.inputs.arg1 = [1,2]

    wf.add_nodes([funkynode])
    wf.base_dir = tmpdir
    wf.config['execution']['remove_node_directories'] = remove_nodes
    wf.config['execution']['stop_on_first_crash'] = 'false'
    wf.config['logging']['crashdump_dir'] = tmpdir
    res = wf.run(plugin='MultiProc')
    
    assert os.path.isdir(Path(tmpdir)/'test/functor') is False

@effigies
Copy link
Member

effigies commented Jan 6, 2020

Yup, something of the sort. You'll want to make sure that the test fails before your fix and passes after it.

You can use the tmp_path pytest fixture to avoid polluting /tmp, but if you're not very familiar with pytest, go ahead and code up a test, and I can make suggestions to clean it up.

@dPys
Copy link
Author

dPys commented Jan 6, 2020

Yup, something of the sort. You'll want to make sure that the test fails before your fix and passes after it.

You can use the tmp_path pytest fixture to avoid polluting /tmp, but if you're not very familiar with pytest, go ahead and code up a test, and I can make suggestions to clean it up.

With the exception of the regression, the above update should be a bit closer to what we want. Not sure how you want to approach the regression exactly-- i.e. across previous nipype versions, by overriding the module with a paramaterized fixture for rmtree with the additional flag added?

@effigies
Copy link
Member

effigies commented Jan 6, 2020

No need to parameterize. Just test on master and on this branch. It should fail and pass, respectively.

…rectories=True in the case that stop_on_first_crash=False

*Add regression test
@dPys
Copy link
Author

dPys commented Jan 7, 2020

Removed parameterization. Tested on maint/1.4.x and it passes. Does not yet seem to fail on master as expected so we might still need to tweak the test to add more nodes to workflow (i.e. nodes after the first node that crashes, yet the workflow continues to run and attempt node directory removal for directories that never get created)...

@effigies
Copy link
Member

effigies commented Jan 7, 2020

Perhaps you could explain more how you were running into this issue? Were you running multiple workflows in parallel?

@dPys
Copy link
Author

dPys commented Jan 7, 2020

Perhaps you could explain more how you were running into this issue? Were you running multiple workflows in parallel?

Hmm, it occurred repeatedly when running a single workflow (with nested workflows) using MultiProc, and adding ignore_errors=True completely inoculated the issue. Hard to say what the fail context is precisely, but the solution was surefire.

@satra
Copy link
Member

satra commented Jan 7, 2020

could this have something to do with the mapnode fix from yesterday?

@effigies
Copy link
Member

effigies commented Jan 7, 2020

I tried this test on 1.4.0, and it still passes. So whatever it's testing, it's not the error that was being hit.

Checking the coverage, however, it does at least hit the shutil.rmtree line, so that's encouraging. I'm just not sure how the node directory is supposed to get deleted prior to that line. It feels like a race condition, but I also don't see how that's going to happen if jobs aren't being run in parallel.

Copy link
Member

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on the test. Also, just so you know, we now use the black styler. If you install pre-commit and run it, it will run black for you when you try to make a commit.

if arg1 == 2:
raise Exception('arg cannot be ' + str(arg1))
except:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal to fail or not? If not, we can skip the entire try/except block. If so, we should not catch the exception.

return arg1

funkynode = pe.MapNode(niu.Function(function=func, input_names=['arg1'],
output_names=['out']),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant:

funkynode = pe.MapNode(niu.Function(function=func),

Also, is there something intrinsic about the problem and MapNodes? If not, then perhaps just make it a Node, to reduce the scope as small as possible.

dPys and others added 3 commits January 7, 2020 09:08
Co-Authored-By: Chris Markiewicz <effigies@gmail.com>
Co-Authored-By: Chris Markiewicz <effigies@gmail.com>
Co-Authored-By: Chris Markiewicz <effigies@gmail.com>
@dPys
Copy link
Author

dPys commented Jan 7, 2020

I tried this test on 1.4.0, and it still passes. So whatever it's testing, it's not the error that was being hit.

Checking the coverage, however, it does at least hit the shutil.rmtree line, so that's encouraging. I'm just not sure how the node directory is supposed to get deleted prior to that line. It feels like a race condition, but I also don't see how that's going to happen if jobs aren't being run in parallel.

It does feel very much like a race condition.

@dPys
Copy link
Author

dPys commented Jan 7, 2020

@effigies -- I'm going to take another look at this on Friday to see if I can't pinpoint exactly what's going on. Things go a little 'nuts' when the iterable expansion starts hitting 10's of thousands of threads...

Stay tuned.

@effigies
Copy link
Member

Let's rebase and reopen if this becomes an issue again.

@effigies effigies closed this Nov 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants