Skip to content

WIP: use args.grouping #200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed

WIP: use args.grouping #200

wants to merge 20 commits into from

Conversation

leej3
Copy link
Contributor

@leej3 leej3 commented Apr 26, 2018

Fixes crawling/grouping mentioned by @yarikoptic in #198

@yarikoptic
Copy link
Member

oh, so it wasn't passed through? thanks @leej3
ideally we should accompany with a dedicated test so things do not get broken again in RF
Note that it already breaks the tests (I guess changes current/default behavior)

@leej3
Copy link
Contributor Author

leej3 commented Apr 26, 2018 via email

@leej3
Copy link
Contributor Author

leej3 commented May 18, 2018

Many scans I encounter do not pass the assertion that studyUID == file_studyUID so using accession_number for grouping is the only way that I can get this working.

I'm under the impression that this is still the default behaviour. "grouping" is None either way. It's only when one changes grouping to accession_number that it alters the behaviour.

I've rebased this onto master.

EDIT: I've since determined that all of these encounters are when I received to study's tarred up together. Perhaps not something that needs to be addressed. Although, if grouping is studyUID instead of None and the heuristics file is able to deal with a dict of seqinfo objects, heudiconv can still handle the tars.

@leej3 leej3 changed the title use args.grouping WIP: use args.grouping May 22, 2018
@leej3 leej3 force-pushed the grouping_fix branch 2 times, most recently from 0660dd0 to 05de1d8 Compare May 23, 2018 22:55
@leej3
Copy link
Contributor Author

leej3 commented May 24, 2018

@yarikoptic perhaps i can get your help with this. I think what was causing confusion for me is that for

args = gen_heudiconv_args(datadir, outdir, subject, heuristic)
runner(args) # run conversion

in test_regression.py the args object is not getting passed to runner() as I expect it to be. args is not the same before and after parse_args(argv). Perhaps you could confirm this?

I changed the keyword argument to main() to be args to prevent this. I'm now passing the tests that I was previously failing. I'm now failing other tests though.

I think its because the object returned by group_dicoms_into_seqinfos changes depending on whether grouping is None or not. it alternates between a) a dict with seqinfo objects as keys and files as values and b) a dict of the aforementioned dicts. Is there a reason for this behaviour or could we make it uniform?

@yarikoptic
Copy link
Member

Thanks for pushing this @leej3 . I will look at it with a fresh eye tomorrow morning.
Cheers

@leej3
Copy link
Contributor Author

leej3 commented May 24, 2018

Great thanks. I've pushed an example of turning everything to a the aforementioned (b) and then from there to (a), which is currently used downstream in a few places. Seems to be better at passing the tests.

@codecov-io
Copy link

codecov-io commented May 24, 2018

Codecov Report

Merging #200 into master will decrease coverage by 0.09%.
The diff coverage is 81.25%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #200     +/-   ##
=========================================
- Coverage   67.81%   67.72%   -0.1%     
=========================================
  Files          30       31      +1     
  Lines        2411     2429     +18     
=========================================
+ Hits         1635     1645     +10     
- Misses        776      784      +8
Impacted Files Coverage Δ
heudiconv/convert.py 78.2% <100%> (+1.07%) ⬆️
tests/test_grouping.py 100% <100%> (ø)
heudiconv/parser.py 87.2% <100%> (+0.46%) ⬆️
heudiconv/cli/run.py 74.05% <100%> (+0.16%) ⬆️
heudiconv/dicoms.py 56.48% <64%> (-3.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ff09b1...09e8850. Read the comment docs.

parser.add_argument('-g', '--grouping', default='studyUID',
choices=('studyUID', 'accession_number'),
parser.add_argument('-g', '--grouping', default='None',
choices=('studyUID', 'accession_number','None'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leej3 - what do you want to say/achieve when groupping is 'None'? do you want

  • to convert all provided DICOMs as a part of the single study as if their studyUID was the same? (then I would call it "all" or something like that)
  • to convert every sequence as it was a study on its own? (that is where None is making sense to me "semantically")

Also I would prefer to keep studyUID as the default for a number of reasons:

  • consistent behavior with previous versions
  • it is the most typical case AFAIK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, default of studyUID sounds good. And agreed, 'all' might be more intuitive. Not sure what I would instinctively interpret 'None' to mean.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think you wanted "to convert all provided DICOMs as a part of the single study as if their studyUID was the same?" Let me try to push/reverse(where needed) forward a bit and see where I get.

Re unified output of group_dicoms_into_seqinfos -- indeed would make sense... I guess I kept it returning original one, whenever no groupping, for compatibility or something like that. But indeed it should (but who knows) unify the code if it would be uniform. Let me see

print("Trying to construct Seqinfo object:",
"{ob}".format(ob=SeqInfo.__doc__),
"\n However the arguments supplied were:")
print([repr(v) for v in seqinfo_args])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use lgr.error not plain print.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes multiple accession_numbers/studyUIDs being converted as a single study. i.e. the scanner crashes and two "unique sessions" should be merged to prevent needless complication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep it seems easy enough to always return the uniform object. The only problem arises when you create filegroup_json. If you have duplicate series_ids it will fail which is undesirable for a large number of dicoms passed in that are trivially separable by the combination of series_id and grouping variable (or even just their seqinfo object). We could alter the filegroup json object to have an additionaly layer of nesting, fail with an assertion like i've done, or something more inventive perhaps

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "scanner crashes and two "unique sessions" should be merged" -- if accession was the same, wouldn't current --groupping=accession_number be sufficient? (just trying to get a fuller picture)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "problem ... filegroup_json. If you have duplicate series_ids " ""unique sessions" should be merged" ... "undesirable for a large number of dicoms passed in". Let me ouline

  1. I think all DICOMs should be passed in and no magical "merge" (or "substitute") should happen automagically
  2. indeed it is a problem (and I think we guard for it) if duplicate series_id comes in since filegroup*.json contains a dict. We should make them "non-conflicting" (indeed question how and when?)
  3. "by combination of series_id and grouping variable", "additionaly layer of nesting" -- too complex imho and then would not be uniform with already existing setup. May be we should just disambiguate by adding a "+1", "+2" or alike whenever conflict arises (and issue a warning). The question is how early such analysis should happen

still digging... will need to switch to other thing soon though... Question: do you have (or could you generate e.g. using nifti2dicom) any sample data you could possibly share to demonstrate your usecase?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satra @mgxd -- I just want to make sure: for the usecase when dicom path template provided, should there be additional splitting into sessions/studies within each of those subject(/session) pairs by anything or all matching dicoms should just belong to that subject(/session)? what would happen if duplicates (canceled sessions, restarted)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "wouldn't current --groupping=accession_number be sufficient", the accession_numbers are different too.

re: "(indeed question how and when?)", not sure I know enough about the different ways in which the pipeline can be run. For example an especially challenging condition is if two different tars, targeted to the same output directory, run at the same time. I guess when filegroup_file is written (save_json(filegroup_file, filegroup)) we could check if it already exists and append to it. This would require a check for duplication and modification of the seqinfo object series_ids too. May be a little bit too challenging. A simpler case to accommodate is the one I outlined where it is all contained in the same tar.

re: "sample data", I was given this data so I don't know for sure about how it was created. From what I can tell 2 study dicom tars were untarred and the two directories were used to create a new tar. So this process would reasonably replicate the data (provided they have some matching series_ids).

lgr.warning("Nothing to be done - displaying usage help")
parser.print_help()
sys.exit(1)
argv = sys.argv[1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet clear to me why we need to overload those and why it all should make a difference. args should be the args to be parsed by the parser, either they come from the tests or cmdline (via sys.argv). main shouldn't get the args as a structure to be overwritten later on etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I may be entirely wrong and the those changes are not required. Its all abit confusing when debugging the regression test test_conversion (I've modified this slightly, posted below). Perhaps it could be changed simply to avoid that. Here's what I observe:

  • Args exists as an object from pytest at the start of the test. Printing this object fails.
  • After "args = gen_heudiconv_args(datadir, outdir, subject, heuristic)" is executed print now works and outputs a list.
  • Stepping into runner(args), args is now:
 argv = ['-d', '/tmp/pytest-of-root/QA/sourcedata/{subject}/*/*/*.tgz', '-c', 'dcm2niix', '-o', '/tmp/pytest-of-root/out', '-s', 'sub-sid000143', '-f', '/src/heudiconv/heudiconv/heuristics/reproin.py', '--bids']

Running parser.parse_args(argv) outputs the namespace object we want but running

args = parser.parse_args(argv)

Does not assign the result to args. Subsequent tests using its attributes are not really going to work for example bool( args.grouping) at this time will yield true.

Do you observe this?

test_conversion:

"""Testing conversion with conversion saved on datalad"""
import json
from glob import glob

import pytest

have_datalad = True
try:
    from datalad import api # to pull and grab data
    from datalad.support.exceptions import IncompleteResultsError
except ImportError:
    have_datalad = False

import heudiconv
from heudiconv.cli.run import main as runner
# testing utilities
from .utils import fetch_data, gen_heudiconv_args
import py
import os
import shutil

@pytest.mark.parametrize('tmpdir', [py._path.local.LocalPath('/tmp/pytest-of-root')])
@pytest.mark.parametrize('datadir', ['pytest-3'])
@pytest.mark.parametrize('subject', ['sub-sid000143'])
@pytest.mark.parametrize('heuristic', ['reproin.py'])
@pytest.mark.skipif(not have_datalad, reason="no datalad")
def test_conversion(tmpdir, subject, heuristic,datadir):
    tmpdir.chdir()
    if os.path.exists('out'):
        shutil.rmtree('out')
    if os.path.exists(datadir):
        outdir = os.path.join(tmpdir.strpath, 'out')
    else:
        try:
            datadir = fetch_data(tmpdir.strpath, subject)
        except IncompleteResultsError as exc:
            pytest.skip("Failed to fetch test data: %s" % str(exc))
        outdir = tmpdir.mkdir('out').strpath

    import pdb; pdb.set_trace()
    args = gen_heudiconv_args(datadir, outdir, subject, heuristic)
    print(args)
    # import pdb;pdb.set_trace()
    runner(args) # run conversion

    # verify functionals were converted
    assert glob('{}/{}/func/*'.format(outdir, subject)) == \
           glob('{}/{}/func/*'.format(datadir, subject))

    # compare some json metadata
    json_ = '{}/task-rest_acq-24mm64sl1000tr32te600dyn_bold.json'.format
    orig, conv = (json.load(open(json_(datadir))),
                  json.load(open(json_(outdir))))
    keys = ['EchoTime', 'MagneticFieldStrength', 'Manufacturer', 'SliceTiming']
    for key in keys:
        assert orig[key] == conv[key]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok part of the confusion is that args is a pdb command. that may solve it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming it something other than other args would be helpful for any future pdb debugging sessions. I forgot args was a pdb command. But yes provided you aren't stepping through the code with pdb it all makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't even know about args pdb command ;-)

leej3 and others added 9 commits June 7, 2018 19:12
this will fail if a tar contains multiple
uids.
don't think this is required
* gh-leej3/grouping_fix:
  remove assertion about unique series_ids
  restore default grouping to None
  report grouping
  add grouping=all option
  use original group_map order
  make run order the same as the archive
  tidy print
  change back code causing nipype error
  provide some help upon SeqInfo object construction failure
  fix args parsing for running heudiconv from cmdline

 Conflicts:
	heudiconv/cli/run.py - took my version, TODO: rename args
	heudiconv/convert.py - OrderedDict, took leej3's version + my multilining
	heudiconv/dicoms.py - minor doc conflict
* origin/master:
  BF: for datalad 0.10.0  - check/populate .gitattributes correctly
  BF+ENH: set dist-restrictions metadata correctly
@yarikoptic yarikoptic mentioned this pull request Jun 21, 2018
8 tasks
@leej3
Copy link
Contributor Author

leej3 commented Jun 21, 2018

Before this pull request the following usecases(implementation) were:

  • Template based handling

    • groups can be specified by -s and -ss. The default grouping here was None (because args.grouping was not passed through). When -ss was not specified, tarball defines session.If multiple tarballs are globbed by the dicom template, currently a warning is logged if "bids"-session is explicitly specified (with -ss). This should just fail and will be a change in behaviour.
  • Files based handing

    • By default grouping is studyUID.
    • grouping is passed through to the "group_dicoms_into_seqinfos"
    • default is a single session unless the heuristic changes this.
    • any tarball grouping is ignored

Need to implement tests to confirm that this is the logic we are implementing. For this, automagic IO context manager for files used in tests. Also want to cache (done with fixture).

@yarikoptic
Copy link
Member

Hi @leej3 ... so did -g accession_number --files work for you?

@leej3
Copy link
Contributor Author

leej3 commented Jun 26, 2018 via email

leej3 added 2 commits July 8, 2018 11:39
…_fix

* commit '2ff09b1a1b77282c9a10a74799d280db1bc4f85b':
  rel: bump to next version
  rel: proper release day
  Fix links to heuristcs in README.md
  doc+enh: update docker dcm2niix, add entry to changelog
  doc: changelog + version bump
  fix: do not load phasediff until exists
  BF: workaround for elderly buggy python 2.7 exec
@leej3
Copy link
Contributor Author

leej3 commented Jul 9, 2018

@yarikoptic, just to sync back up:

I have a tar contain two different studies (2 studyUIDs and 2 unique accession numbers, my center does that last bit differently from yours). I would like for me to be able to pass this single tar to heudiconv and to get a single session as output. Currently this works given certain input parameters (I believe erroneously). I can pass --grouping accession_number for the template usage or --files usage and get back a single session. When using the template I explicitly define the session (-ss session_tag). I'm not sure if i'm using --files correctly but I just got whatever session i defined with infotoids in the heuristics file.

We talked about this previously but just to reiterate: moving this pull request forward shall we drop the template interface and focus on consolidating what is in --files (and I'll learn to use that better).

This would mean that one can use the --files flag and obtain the following behaviours:

  • For specifying subject: this could be specified with the -s argument, or by using infotoids

  • Specifying session:

    • specify with a -ss flag? Not sure about this option. It would be a simple way to define the session for all globbed files. Would be particularly useful with grouping=all
    • grouping=all : all globbed files are merged into a single session specified by the -ss flag, or an arbitrary first session name.
    • grouping=tar (or file source): this behaviour would match with what Satra described for the template usage: glob some tars and each tar becomes a new session with the session number incrementing each time.
    • grouping=studyUID: sessions are defined by the studyUID
      +grouping=accession_number
    • grouping=override: not sure what to name this state but Michael Hanke described using of the -a flag to prevent heudiconv from altering the grouping and have it defined externally instead.

I've pushed a set of proposed tests for this behaviour. I wasn't sure how to use AutomagicIO. Could you help with that.

Do you think this is a useful way forward? I think if we can pass all these tests we don't remove any functionality that we know someone is using but consolidate the interface (by removing the template usage) to reduce the complexity of the code.

@yarikoptic
Copy link
Member

thanks @leej3 ! I will try to get back to it in upcoming days. Also I still have some changes I was trying to meld into yours while we went through all that logic (or "confusion" to say more appropriately) analysis at OHBM

@mgxd mgxd mentioned this pull request Jan 4, 2019
1 task
@mgxd mgxd mentioned this pull request Apr 16, 2019
@mgxd mgxd mentioned this pull request Jul 9, 2019
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants