-
Notifications
You must be signed in to change notification settings - Fork 13
Hyperspectral extractor #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remove terraref.sh from extractor directory
…ill tries to delete the output file. Now using return code to detect errors in the subprocess and only upload the output file when no error is detected.
I'm testing this now, but similar to other completed extractors the code will move here: https://github.com/terraref/extractors-hyperspectral/tree/master/hyperspectral I will close this pull request once tested/moved. |
I think it's just about working. I'm running into "Cannot allocate memory" errors from ncap2, but I think I just need to create a new Docker machine with 4 gigs RAM instead of 2. |
Maybe 8GB to be safe. From: Max Burnette <notifications@gh.loli.gardenmailto:notifications@github.com> I think it's just about working. I'm running into "Cannot allocate memory" errors from ncap2, but I think I just need to create a new Docker machine with 4 gigs RAM instead of 2. — |
@yanliu-chn @FlyingWithJerome @Zodiase @czender have you encountered this bug before? I increased memory to 8 GB but there's a problem with calibration still running out of memory:
Specifically this part:
Is that correct? 10 GB of RAM? That seems like a lot... |
@max-zilla According to this comment #81 (comment), the memory consumption could go really really bad depending on the input size. |
what is the input image size? |
...so I guess this implies we'd want ~60 GB of RAM for this? Seems like that could be a problem. @robkooper tagging you just FYI Comments from Zender in #81:
|
I see. I guess we have to let @czender Charlie know about this. We need to get a sense how large an input image can be. |
@yanliu-chn and @max-zilla i don't know what the upper limit on input (raw) image size is. maybe @solmazhajmohammadi can comment. a 14 GB image requires ~60 GB RAM to process now. is the workflow occuring on compute nodes with <64 GB RAM? if so, we could discuss further reductions in RAM. one possibility would be to perform all math using short integers (rather than floating point) and store reflectance as a short integer (with a scale factor) or as a float. Those two options would reduce RAM overhead to 2x sizeof(raw) and 3x sizeof(raw), respectively. If possible would be better to run on fat memory nodes with , e.g., 128 GB RAM though.... |
A few options:
|
@czender @max-zilla @yanliu-chn |
@dlebauer I don't see any reason why I couldn't collect 2m swaths. I'll ask LemnaTec how this might work and get back to you. |
thank you @solmazhajmohammadi it is good to know the upper bound for input filesize. i suppose we will let @dlebauer and @max-zilla and @yanliu-chn decide if they ever want to actually see files that big in the workflow and, if not, adjust the typical scan length accordingly. |
@robkooper @jdmaloney the Terra project on Roger is using 32 GB / 125 GB total on Roger Openstack (5 / 10 instances). The compute-medium flavor instance has 64 GB RAM which would put us at 96/125 total, but I don't immediately know of a better solution - especially since we want this on Roger so we don't have to be sending 14 GB files around for every single VNIR extraction. thoughts? |
@dlebauer Yes, we can program the gantry to store the data every 2m. |
Why not run this on a nightly basis as a box batch job, so we can get a bunch of nodes with large memory. As long as the queue is created we can process the messages at any time. |
@dlebauer We started the full row scan last night. So 65 GB files will pop up soon. |
The hyperspectral workflow scripts (though not necessarily the extractor) should work as is on nodes with enough memory (~256 GB) to handle 64 GB raw files. If it is desirable to cap the maximum file size at something smaller, then my preference is to keep the images intact by reducing the scan length, not by splitting wavelengths. But other considerations may point to splitting wavelengths into different files, so it is good to know that's it's possible to do that upstream... |
@czender @solmazhajmohammadi the other option besides using qsub -i nightly on beefy nodes on Roger (which @yanliu-chn has done with PlantCV I think) is to reduce the scan size. If there are no big issues I think it makes sense to do this so we can deploy current code with minimal modification. Last Friday the full row scan was ~65 GB = 267 GB RAM - Charlie is going to try a manual test on this big ol' file just to make sure it works. |
@smarshall-bmr can you adjust hyperspectra scanner to generate one file per plot (e.g. stop and retart at column borders) instead of doing a full swath? How hard would that be? |
@dlebauer One could do a go and shot approach with the hyperspectral camera, going to specific plots and using the mirror of the hyperspec to take images. Same approach as in the Rothamsted system. |
I just tried the HS workflow on a 62 GB file on Roger with hyperspectral_workflow.sh -d 1 -i /projects/arpae/terraref/sites/ua-mac/raw_data/VNIR/2016-10-07/2016-10-07__12-12-09-294/755e5eca-55b7-4412-a145-e8d1d4833b3f_raw Workflow exited with error on a new issue not related to input size, possibly due to altered metadata that choked Python script. @FlyingWithJerome please investigate this, and fix if possible: hyperspectral_workflow.sh: ERROR Failed to parse JSON metadata. Debug this: Related (though non-fatal) problem for Lemnatec to address. Our workflow identifies |
Pinging Lemnatec folks @LTBen @solmazhajmohammadi and @TinoDornbusch A related (though not fatal) problem with the workflow has occured with most hyperspectral products for many months. The JSON metadata contains multiple keys with the same name. Running the above workflow yields, e.g., --> Warning: Multiple keys are mapped to a single value; such illegal mapping may cause the loss of important data. this is caused by these two lines in the JSON file
Please combine/fix/change one of them so there are no duplicate keys in the same group. |
@czender Sure, will follow up to fix it |
@czender |
Thank you @FlyingWithJerome . Once these extractors are really running everyday they should catch problems like this. Until then, we have to do it by hand. Let me know when you have pushed a fix so I can continue testing the 62 GB raw workflow. @hmb1 please reproduce the problem @FlyingWithJerome is working on so you see the workflow in its full glory :) |
@czender I had already pushed the updates for the hyperspectral_metadata.py ...
"date of installation": "april 2016",
"date of handover" : "todo",
... are not legal. |
@Zodiase this pull request (#161) has been waiting for more than a month. It does not touch the scripts I work on. Someone more knowledgeable (you or @max-zilla ?) should go ahead and merge it if it is still up-to-date... I think everyone may be waiting for someone else to merge it. Go ahead! |
In interactive mode on a Roger login node, the 62 GB raw file crashes the workflow with: ncap2: ERROR nco_malloc() unable to allocate 249919680000 B = 244062187 kB = 238341 MB = 232 GB @Zodiase or @yanliu-chn what is the command to get interactive access to a dedicated compute node with all its memory (at least 256 GB)? I tried "qsub -I" but it just hung with
|
@czender That output means all slots for interactive jobs are currently in use on the system. That will return a prompt once a node becomes available (how long that takes depends on system utilization). If you want a dedicated node to run a job and don't want to wait around on the command line for it to become available, you can submit a regular job to the job queue. It will wait in line and run once a slot opens up and can email you when finished. I outlined this a bit for @ZongyangLi in Issue #181 just a bit ago which hopefully gives you a place to start. If not feel free to keep posting questions/outputs. Running on the login node is not advised as that is where the scheduler resides and is the sole user entry point for the system, depending on how your resource utilization goes if you crash the node (generally from running it out of memory) it would disrupt job scheduling across the ROGER resource for all users. That node's primary function is to serve as the clusters entry point and a place where code can be compiled, scripts edited, and some basic file movement either within the cluster or small enough Globus isn't really necessary (think rsyncing/scp'ing scripts from your local machine or similar). |
@czender I had been waiting to merge this due to the ongoing discussion, but I think that can continue even after merge. Also in accordance with other extractors this will be moved into its own repo I've created: |
@max-zilla @FlyingWithJerome @jdmaloney @hmb1 The roger queue was light this morning and I just tested the 62 GB raw file mentioned above on a dedicated compute node. It failed to malloc() 232 GB just like the login node. The HS workflow will need nodes with more RAM, or HS scans must produce raw files < 62 GB, or the calibration procedure must be modified to consume less RAM. The silver lining was that I fixed a problem that prevented the workflow from completing for any size raw files when run on compute nodes. Now that this pull request has been merged, I will stop commenting on this thread. Still getting used to github. |
@czender we can keep discussing here - i wasn't sure if merging would close the discussion thread automatically :) |
@czender have you tried running on Roger since you fixed the 8x / 4x allocation issue? discussions with @robkooper @yanliu-chn and @jterstriep we think you should be able to allocate 232 GB on an empty node since they have 256 GB. |
It does not seem to have a problem allocating the memory. But the file write never completes in the devel queue (1 hr). I tried the non-interactive batch queue but the job won't start. Can you tell me why?
|
@czender could you paste the command you ran and your |
@yanliu-chn my fault, i was on a compute node trying to submit to the batch queue. |
The HS workflow completed processing a 62GB raw file in the batch queue. I had thought it was hanging because it did not finish in < 1 hr. I resubmitted with a 3 hour limit and it finished in 94 minutes. Setting the batch time to 2 hours will should be sufficient. |
Description
Add the hyperspectral extractor and its dockerfile.
This extractor responds to events when new files are added to a dataset. Then it checks to see if it should run by checking if all the input files are present and if the output file is not present. The output file name is determined from the input files. Then it runs the script to produce the output file and uploads it.
Motivation and Context
Resolves issue #81.
How Has This Been Tested?
Tested locally by building and running the docker container, uploading the test input files and checking the ouput file is uploaded. I'm not sure how to update the test.sh to reproduce the testing.
Types of changes
Checklist: