Skip to content

Hyperspectral extractor #161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Oct 14, 2016
Merged

Hyperspectral extractor #161

merged 10 commits into from
Oct 14, 2016

Conversation

Zodiase
Copy link
Contributor

@Zodiase Zodiase commented Sep 1, 2016

Description

Add the hyperspectral extractor and its dockerfile.
This extractor responds to events when new files are added to a dataset. Then it checks to see if it should run by checking if all the input files are present and if the output file is not present. The output file name is determined from the input files. Then it runs the script to produce the output file and uploads it.

Motivation and Context

Resolves issue #81.

How Has This Been Tested?

Tested locally by building and running the docker container, uploading the test input files and checking the ouput file is uploaded. I'm not sure how to update the test.sh to reproduce the testing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@max-zilla max-zilla self-assigned this Sep 8, 2016
@ghost ghost added this to the September 2016 milestone Sep 27, 2016
@max-zilla
Copy link
Contributor

I'm testing this now, but similar to other completed extractors the code will move here: https://github.com/terraref/extractors-hyperspectral/tree/master/hyperspectral

I will close this pull request once tested/moved.

@max-zilla
Copy link
Contributor

I think it's just about working. I'm running into "Cannot allocate memory" errors from ncap2, but I think I just need to create a new Docker machine with 4 gigs RAM instead of 2.

@yanliu-chn
Copy link

Maybe 8GB to be safe.

From: Max Burnette <notifications@gh.loli.gardenmailto:notifications@github.com>
Reply-To: terraref/computing-pipeline <reply@reply.gh.loli.gardenmailto:reply@reply.github.com>
Date: Thursday, October 6, 2016 at 8:47 AM
To: terraref/computing-pipeline <computing-pipeline@noreply.gh.loli.gardenmailto:computing-pipeline@noreply.github.com>
Subject: Re: [terraref/computing-pipeline] Hyperspectral extractor (#161)

I think it's just about working. I'm running into "Cannot allocate memory" errors from ncap2, but I think I just need to create a new Docker machine with 4 gigs RAM instead of 2.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_terraref_computing-2Dpipeline_pull_161-23issuecomment-2D251965979&d=DQMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=QbY3HDLn4TqJD-LXnNSvwTwwOferKDqWlH-gZd7YVUQ&m=0amwUJ32ct5Ou9pWeeF3th9_vuKqA05gPnaS0k_XRO8&s=kNOAKLMx-3YkuSjt5Pu7VQaXa7iD86auJdhDJ-zgZGc&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEArrYgYke-5FhXD0CZ5omcyqpnVSPJv6Xks5qxPvngaJpZM4JyOk-2D&d=DQMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=QbY3HDLn4TqJD-LXnNSvwTwwOferKDqWlH-gZd7YVUQ&m=0amwUJ32ct5Ou9pWeeF3th9_vuKqA05gPnaS0k_XRO8&s=3uqGjGLEERuGw-YFRXhp67tTVGaJRCJmZ33qyOmJ8vw&e=.

@max-zilla
Copy link
Contributor

max-zilla commented Oct 6, 2016

@yanliu-chn @FlyingWithJerome @Zodiase @czender have you encountered this bug before? I increased memory to 8 GB but there's a problem with calibration still running out of memory:

Terraref hyperspectral data workflow invoked with:
hyperspectral_workflow.sh -d 1 -i /home/ubuntu/input/73d03a9d-a7e0-4147-864d-801f4e5d0083_raw -o /home/ubuntu/sites/ua-mac/Level_1/hyperspectral/2016-05-10/2016-05-10__14-22-30-123/73d03a9d-a7e0-4147-864d-801f4e5d0083.nc
Hyperspectral workflow scripts in directory /home/ubuntu
NCO version "4.6.1" from directory /srv/sw/nco-4.6.1/bin
Intermediate/temporary files written to directory /tmp
Final output stored in directory /home/ubuntu/sites/ua-mac/Level_1/hyperspectral/2016-05-10/2016-05-10__14-22-30-123
Input #00: /home/ubuntu/input/73d03a9d-a7e0-4147-864d-801f4e5d0083_raw
trn(in)  : /home/ubuntu/input/73d03a9d-a7e0-4147-864d-801f4e5d0083_raw
trn(out) : /tmp/terraref_tmp_trn.nc.pid16.fl00.tmp
ncks -O --trr_wxy=955,1600,0 --trr typ_in=NC_USHORT --trr typ_out=NC_USHORT --trr ntl_in=bil --trr ntl_out=bsq --trr_in=/home/ubuntu/input/73d03a9d-a7e0-4147-864d-801f4e5d0083_raw /home/ubuntu/hyperspectral_dummy.nc /tmp/terraref_tmp_trn.nc.pid16.fl00.tmp
att(in)  : /tmp/terraref_tmp_trn.nc.pid16.fl00.tmp
att(out) : /tmp/terraref_tmp_att.nc.pid16.fl00.tmp
ncatted -O --gaa terraref_script=hyperspectral_workflow.sh --gaa terraref_hostname=c96e273ec257 --gaa terraref_version="4.6.1" -a "Conventions,global,o,c,CF-1.5" -a "Project,global,o,c,TERRAREF" --gaa history="Thu Oct 6 15:12:23 UTC 2016: hyperspectral_workflow.sh -d 1 -i /home/ubuntu/input/73d03a9d-a7e0-4147-864d-801f4e5d0083_raw -o /home/ubuntu/sites/ua-mac/Level_1/hyperspectral/2016-05-10/2016-05-10__14-22-30-123/73d03a9d-a7e0-4147-864d-801f4e5d0083.nc" /tmp/terraref_tmp_trn.nc.pid16.fl00.tmp /tmp/terraref_tmp_att.nc.pid16.fl00.tmp
jsn(in)  : /home/ubuntu/input/73d03a9d-a7e0-4147-864d-801f4e5d0083_raw
jsn(out) : /tmp/terraref_tmp_jsn.nc.pid16
python /home/ubuntu/hyperspectral_metadata.py /home/ubuntu/input/73d03a9d-a7e0-4147-864d-801f4e5d0083_raw /tmp/terraref_tmp_jsn.nc.pid16.fl00.tmp
Processing ...
Done.
mrg(in)  : /tmp/terraref_tmp_jsn.nc.pid16.fl00.tmp
mrg(out) : /tmp/terraref_tmp_att.nc.pid16.fl00.tmp
ncks -A /tmp/terraref_tmp_jsn.nc.pid16.fl00.tmp /tmp/terraref_tmp_att.nc.pid16.fl00.tmp
clb(in)  : /tmp/terraref_tmp_att.nc.pid16.fl00.tmp
clb(out) : /tmp/terraref_tmp_clb.nc.pid16.fl00.tmp
ncap2 -A -S /home/ubuntu/hyperspectral_calibration.nco /tmp/terraref_tmp_att.nc.pid16.fl00.tmp /tmp/terraref_tmp_att.nc.pid16.fl00.tmp
ncap2: ERROR nco_malloc() unable to allocate 11673920000 B = 11400312 kB = 11133 MB = 10 GB
ncap2: INFO NCO has reported a malloc() failure. malloc() failures usually indicate that your machine does not have enough free memory (RAM+swap) to perform the requested operation. As such, malloc() failures result from the physical limitations imposed by your hardware. Read http://nco.sf.net/nco.html#mmr for a description of NCO memory usage. The likiest case is that this problem is caused by inadequate RAM on your system, and is not an NCO bug. If so, there are two potential workarounds: First is to process your data in smaller chunks, e.g., smaller or more hyperslabs. The second is to use a machine with more free memory, so that malloc() succeeds.

Large tasks may uncover memory leaks in NCO. This is likeliest to occur with ncap2. ncap2 scripts are completely dynamic and may be of arbitrary length and complexity. A script that contains many thousands of operations may uncover a slow memory leak even though each single operation consumes little additional memory. Memory leaks are usually identifiable by their memory usage signature. Leaks cause peak memory usage to increase monotonically with time regardless of script complexity. Slow leaks are very difficult to find. Sometimes a malloc() failure is the only noticeable clue to their existence. If you have good reasons to believe that your malloc() failure is ultimately due to an NCO memory leak (rather than inadequate RAM on your system), then we would like to receive a detailed bug report.
hyperspectral_workflow.sh: ERROR Failed to calibrate data in ncap2. Debug this:
ncap2 -A -S /home/ubuntu/hyperspectral_calibration.nco /tmp/terraref_tmp_att.nc.pid16.fl00.tmp /tmp/terraref_tmp_att.nc.pid16.fl00.tmp

Specifically this part:

ncap2: ERROR nco_malloc() unable to allocate 11673920000 B = 11400312 kB = 11133 MB = 10 GB

Is that correct? 10 GB of RAM? That seems like a lot...

@Zodiase
Copy link
Contributor Author

Zodiase commented Oct 6, 2016

@max-zilla According to this comment #81 (comment), the memory consumption could go really really bad depending on the input size.

@yanliu-chn
Copy link

what is the input image size?

@max-zilla
Copy link
Contributor

max-zilla commented Oct 6, 2016

Very big.
screen shot 2016-10-06 at 11 44 49 am

...so I guess this implies we'd want ~60 GB of RAM for this? Seems like that could be a problem. @robkooper tagging you just FYI

Comments from Zender in #81:

I reduced the memory required by ncap2 from 5x sizeof(raw) to 4x sizeof(raw). 
best to allot 4.1x so as not to cut it too close. the footprint cannot be further 
reduced without significantly slowing the processing. half the memory holds 
the original image (promoted from short to float) and the other half holds the 
computed reflectance (float). so 4x raw image is the smallest natural size of 
the computation without using loops.

@yanliu-chn
Copy link

I see. I guess we have to let @czender Charlie know about this. We need to get a sense how large an input image can be.

@czender
Copy link
Contributor

czender commented Oct 6, 2016

@yanliu-chn and @max-zilla i don't know what the upper limit on input (raw) image size is. maybe @solmazhajmohammadi can comment. a 14 GB image requires ~60 GB RAM to process now. is the workflow occuring on compute nodes with <64 GB RAM? if so, we could discuss further reductions in RAM. one possibility would be to perform all math using short integers (rather than floating point) and store reflectance as a short integer (with a scale factor) or as a float. Those two options would reduce RAM overhead to 2x sizeof(raw) and 3x sizeof(raw), respectively. If possible would be better to run on fat memory nodes with , e.g., 128 GB RAM though....

@dlebauer
Copy link
Member

dlebauer commented Oct 6, 2016

A few options:

@solmazhajmohammadi
Copy link

@czender @max-zilla @yanliu-chn
raw file from VNIR camera for 4m scan is about 12.22 G, so for full row scan would be around 68G per file.

@smarshall-bmr
Copy link
Collaborator

@dlebauer I don't see any reason why I couldn't collect 2m swaths. I'll ask LemnaTec how this might work and get back to you.

@czender
Copy link
Contributor

czender commented Oct 6, 2016

thank you @solmazhajmohammadi it is good to know the upper bound for input filesize. i suppose we will let @dlebauer and @max-zilla and @yanliu-chn decide if they ever want to actually see files that big in the workflow and, if not, adjust the typical scan length accordingly.

@max-zilla
Copy link
Contributor

@robkooper @jdmaloney the Terra project on Roger is using 32 GB / 125 GB total on Roger Openstack (5 / 10 instances). The compute-medium flavor instance has 64 GB RAM which would put us at 96/125 total, but I don't immediately know of a better solution - especially since we want this on Roger so we don't have to be sending 14 GB files around for every single VNIR extraction. thoughts?

@solmazhajmohammadi
Copy link

@dlebauer Yes, we can program the gantry to store the data every 2m.

@robkooper
Copy link
Member

Why not run this on a nightly basis as a box batch job, so we can get a bunch of nodes with large memory. As long as the queue is created we can process the messages at any time.

@solmazhajmohammadi
Copy link

@dlebauer We started the full row scan last night. So 65 GB files will pop up soon.
@czender Another option would be dividing the data into different range of wavelength and save them in multiple netCDF files. I put the Octave script for extracting a specific bandwidth from hypercube in the gantry cache server. smaller files to deal with ...

@czender
Copy link
Contributor

czender commented Oct 7, 2016

The hyperspectral workflow scripts (though not necessarily the extractor) should work as is on nodes with enough memory (~256 GB) to handle 64 GB raw files. If it is desirable to cap the maximum file size at something smaller, then my preference is to keep the images intact by reducing the scan length, not by splitting wavelengths. But other considerations may point to splitting wavelengths into different files, so it is good to know that's it's possible to do that upstream...

@max-zilla
Copy link
Contributor

@czender @solmazhajmohammadi the other option besides using qsub -i nightly on beefy nodes on Roger (which @yanliu-chn has done with PlantCV I think) is to reduce the scan size. If there are no big issues I think it makes sense to do this so we can deploy current code with minimal modification.

Last Friday the full row scan was ~65 GB = 267 GB RAM - Charlie is going to try a manual test on this big ol' file just to make sure it works.

@dlebauer
Copy link
Member

@smarshall-bmr can you adjust hyperspectra scanner to generate one file per plot (e.g. stop and retart at column borders) instead of doing a full swath? How hard would that be?

@TinoDornbusch
Copy link
Contributor

@dlebauer One could do a go and shot approach with the hyperspectral camera, going to specific plots and using the mirror of the hyperspec to take images. Same approach as in the Rothamsted system.

@czender
Copy link
Contributor

czender commented Oct 13, 2016

I just tried the HS workflow on a 62 GB file on Roger with

hyperspectral_workflow.sh -d 1 -i /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2016-10-07/2016-10-07__12-12-09-294/755e5eca-55b7-4412-a145-e8d1d4833b3f_raw

Workflow exited with error on a new issue not related to input size, possibly due to altered metadata that choked Python script. @FlyingWithJerome please investigate this, and fix if possible:

hyperspectral_workflow.sh: ERROR Failed to parse JSON metadata. Debug this:
python /gpfs/smallblockFS/home/zender/terraref/computing-pipeline/scripts/hyperspectral/hyperspectral_metadata.py /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2016-10-07/2016-10-07__12-12-09-294/755e5eca-55b7-4412-a145-e8d1d4833b3f_raw /gpfs_scratch/arpae/imaging_spectrometer/terraref_tmp_jsn.nc.pid142261.fl00.tmp

Related (though non-fatal) problem for Lemnatec to address. Our workflow identifies

@czender
Copy link
Contributor

czender commented Oct 13, 2016

Pinging Lemnatec folks @LTBen @solmazhajmohammadi and @TinoDornbusch

A related (though not fatal) problem with the workflow has occured with most hyperspectral products for many months. The JSON metadata contains multiple keys with the same name. Running the above workflow yields, e.g.,

--> Warning: Multiple keys are mapped to a single value; such illegal mapping may cause the loss of important data.
--> The file path is /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2016-10-07/2016-10-07__12-12-09-294/755e5eca-55b7-4412-a145-e8d1d4833b3f_metadata.json, and the key is "instrument"

this is caused by these two lines in the JSON file

  "instrument": "gantry at Maricopa phenotyping facility",
  "instrument": "field gantry",

Please combine/fix/change one of them so there are no duplicate keys in the same group.

@solmazhajmohammadi
Copy link

@czender Sure, will follow up to fix it

@FlyingWithJerome
Copy link
Member

@czender
The problem is because of this line:
"date of installation": "april 2016"
This is the only line that does not follow the same format as other keys with the keyword "date"; so Python has problem matching it with the regular expressions I concluded. I will treat this as an exception.

@czender
Copy link
Contributor

czender commented Oct 13, 2016

Thank you @FlyingWithJerome . Once these extractors are really running everyday they should catch problems like this. Until then, we have to do it by hand. Let me know when you have pushed a fix so I can continue testing the 62 GB raw workflow. @hmb1 please reproduce the problem @FlyingWithJerome is working on so you see the workflow in its full glory :)

@FlyingWithJerome
Copy link
Member

@czender I had already pushed the updates for the hyperspectral_metadata.py
I set two keys as normal string variables: "date of installation" and "date of handover,"since

...
"date of installation": "april 2016",
"date of handover"   : "todo",
...

are not legal.

@czender
Copy link
Contributor

czender commented Oct 14, 2016

@Zodiase this pull request (#161) has been waiting for more than a month. It does not touch the scripts I work on. Someone more knowledgeable (you or @max-zilla ?) should go ahead and merge it if it is still up-to-date... I think everyone may be waiting for someone else to merge it. Go ahead!

@czender
Copy link
Contributor

czender commented Oct 14, 2016

In interactive mode on a Roger login node, the 62 GB raw file crashes the workflow with:

ncap2: ERROR nco_malloc() unable to allocate 249919680000 B = 244062187 kB = 238341 MB = 232 GB

@Zodiase or @yanliu-chn what is the command to get interactive access to a dedicated compute node with all its memory (at least 256 GB)? I tried "qsub -I" but it just hung with

zender@cg-gpu01:~$ qsub -I
Job submitted to account: arpae
qsub: waiting for job 44126.cg-gpu01 to start

@jdmaloney
Copy link
Contributor

@czender That output means all slots for interactive jobs are currently in use on the system. That will return a prompt once a node becomes available (how long that takes depends on system utilization). If you want a dedicated node to run a job and don't want to wait around on the command line for it to become available, you can submit a regular job to the job queue. It will wait in line and run once a slot opens up and can email you when finished. I outlined this a bit for @ZongyangLi in Issue #181 just a bit ago which hopefully gives you a place to start. If not feel free to keep posting questions/outputs.

Running on the login node is not advised as that is where the scheduler resides and is the sole user entry point for the system, depending on how your resource utilization goes if you crash the node (generally from running it out of memory) it would disrupt job scheduling across the ROGER resource for all users. That node's primary function is to serve as the clusters entry point and a place where code can be compiled, scripts edited, and some basic file movement either within the cluster or small enough Globus isn't really necessary (think rsyncing/scp'ing scripts from your local machine or similar).

@max-zilla
Copy link
Contributor

@czender I had been waiting to merge this due to the ongoing discussion, but I think that can continue even after merge.

Also in accordance with other extractors this will be moved into its own repo I've created:
https://github.com/terraref/extractors-hyperspectral
...but I need to do a couple more things to properly carry authorship of files to that new repo before we really make the switch. At that point I'll make sure the new repo is sync'd with latest code from this one and we can move ongoing discussion/development there.

@max-zilla max-zilla merged commit c63e094 into master Oct 14, 2016
@czender
Copy link
Contributor

czender commented Oct 14, 2016

@max-zilla @FlyingWithJerome @jdmaloney @hmb1 The roger queue was light this morning and I just tested the 62 GB raw file mentioned above on a dedicated compute node. It failed to malloc() 232 GB just like the login node. The HS workflow will need nodes with more RAM, or HS scans must produce raw files < 62 GB, or the calibration procedure must be modified to consume less RAM. The silver lining was that I fixed a problem that prevented the workflow from completing for any size raw files when run on compute nodes. Now that this pull request has been merged, I will stop commenting on this thread. Still getting used to github.

@max-zilla
Copy link
Contributor

@czender we can keep discussing here - i wasn't sure if merging would close the discussion thread automatically :)

@max-zilla
Copy link
Contributor

@czender have you tried running on Roger since you fixed the 8x / 4x allocation issue? discussions with @robkooper @yanliu-chn and @jterstriep we think you should be able to allocate 232 GB on an empty node since they have 256 GB.

@czender
Copy link
Contributor

czender commented Oct 28, 2016

It does not seem to have a problem allocating the memory. But the file write never completes in the devel queue (1 hr). I tried the non-interactive batch queue but the job won't start. Can you tell me why?

zender@cg-gpu01:~$ qsub -I -A arpae -l walltime=03:00:00 -N hyperspectral -q batch -j oe -m e -o ~/hyperspectral.out ~/hyperspectral.pbs
Job submitted to account: arpae
Job submitted to account: arpae
qsub: waiting for job 44751.cg-gpu01 to start
qsub: job 44751.cg-gpu01 ready

total 16
drwx------. 2 root root 16384 Sep  8 09:18 lost+found
zender@cg-cmp01:~$ qstat -a
socket_connect_unix failed: 15137
socket_connect_unix failed: 15137
socket_connect_unix failed: 15137
qstat: cannot connect to server (null) (errno=15137) could not connect to trqauthd
qstat: Error (15137 - could not connect to trqauthd) 

@yanliu-chn
Copy link

@czender could you paste the command you ran and your module list and necessary env. we can try it on the normal queue and run for longer to see if it finishes.

@czender
Copy link
Contributor

czender commented Oct 28, 2016

@yanliu-chn my fault, i was on a compute node trying to submit to the batch queue.

@czender
Copy link
Contributor

czender commented Oct 28, 2016

The HS workflow completed processing a 62GB raw file in the batch queue. I had thought it was hanging because it did not finish in < 1 hr. I resubmitted with a 3 hour limit and it finished in 94 minutes. Setting the batch time to 2 hours will should be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.