Skip to content

Define pipeline for converting bin files to NetCDF/HDF5 data products and transferring from MAC to NCSA #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
dlebauer opened this issue Dec 5, 2015 · 15 comments
Assignees

Comments

@dlebauer
Copy link
Member

dlebauer commented Dec 5, 2015

Description

Goal is to convert, compress, and efficiently write data from imaging spectrometers to NetCDF / HDF5 data products. The data product format is specified in terraref/reference-data#14 and I posted a longer discussion of related issues, including data volume, transfer, and bottlenecks are described on the terraref website.

Implementation of this feature should be coordinated with / in support of the .bil->.nc and .bil->.hdf BrownDog DTS (Data Tilling Services) described in BrownDog issue 852 assigned to Eugene Roeder;

Based on discussion to date, this is the current draft.

  1. Data is collected by spectrometer sensor as .bil + .hdr files
  2. Data is written to a gantry-mounted computer
    • 2 x 1TB solid state drives (fast read/write)
    • writes to one drive while reading / sending data from the other
  3. Data goes over 10 Gigabit line to MAC Server
    • 70 TB cache server installed January 2016
    • 2 week cache + anything to support compression
  4. Data goes over dedicated 1 Gigabit line (avail. mid-January 2016) to U. Arizona
  5. Globus (?) transfer to Roger
  6. Clowder pipeline triggered

Output file should look like terraref/reference-data#14 (see 'TODO', below)

Example Raw data

These are from the HEADWALL sensors; the format is what we expect; the content is very different from what we will observe.

References:

TODO

  • Define necessary and preferred hardware / software at MAC cache server
  • Std. cmte discussion of Proposed format for hyperspectral and other imaging data reference-data#14 2015-12-09 14:00 CST
  • Early 2016 finalize pipeline specification (C Zender, BrownDog/ISDA, MAC, Lemnatec, others)
  • Begin implementation with a script to convert from .bil to .nc and create new issue when necessary re: implementation

Contacts

  • Kenton McHenry, collaborator and BrownDog PI @mchenry
  • Rob Kooper, Sr. Res. Programmer (BrownDog, TERRA Ref) @robkooper
  • Eugene Roeder, leading implementation for BrownDog @ch1eroe1
  • Charlie Zender, Special Operations @czender
  • Bob Strand, Lemnatec Engineer @rjstrand
@dlebauer
Copy link
Member Author

(from #2)

My inclination is to parse all this JSON metadata into an attribute tree in the netCDF4/HDF file.
The file level-0 (root) group would contain a level-1 group called "lemnatec_measurement_metadata", which would contain six level-2 groups "user_given_data"..."measurement_additional_data" and each of those groups would contain group attributes for the fields listed above. We will use the appropriate atomic data type for each of the values encountered, e.g., String for most text, float for 32-bit reals, unsigned byte for boolean,... Some of the "gantry variable data" (like x,y,z location) will need to be variables not (or as well as) attributes, so that their time-varying values can be easily manipulated by data processing tools. They may become record variables with time as the unlimited dimension.

I think you have the right idea of parsing this to attributes, but I will note that the .json files are not designed to meet a standard metadata convention. But presumably a CF-compliant file will? Ultimately, we will want them to be compliant or interoperable with an FGDC-endorsed ISO standard (https://www.fgdc.gov/metadata/geospatial-metadata-standards). Does that sound reasonable?

Regarding gantry variable data like x,y,z location and time, I think it would be useful to store this as a meta-data attribute in addition to either dimensions or variables. When you say 'variables' do you mean to store the single value of x,y,z in the metadata as a set of variables? Ultimately these will be used to define the coordinates of each pixel. This is something that I don't understand well and don't know if there is an easy answer. As I understand it, we could transform the images to a flat xy plane that would allow gridded dimensions, but if we map to xyz then they would be treated as variables. I'd appreciate your thoughts on this and if you want to chat off line let me know.

@dlebauer
Copy link
Member Author

Sorry the details of the hyperspectral data formats are specified in issue terraref/reference-data#14

@czender
Copy link
Contributor

czender commented Jan 26, 2016

I will treat this (issue #38) as the correct place to discuss development of the pipeline. IMHO, aim to produce a CF-compliant file now. Later map that to whatever ISO flavor you want.

My understanding is that we will receive much metadata in JSON format and need to store it in the final product. Once the JSON is in the netCDF file, the JSON file will be redundant. JSON in no way is the final product. JSON is "just" a useful way of transmitting structured information in key/value syntax.

Putting something (like x, y, z) in both data and metdata is prone to error because people often manipulate one but not the other. If you want a spatial grid then x,y,z information must be present as variables to facilitate spatial hyperslabbing. Same with time; time needs to be a variable for the sake of hyperslabbing.

@dlebauer
Copy link
Member Author

I see your point, but maybe I wasn't clear and /or not familiar enough with the technical aspects but I think we need to save the xyz location of the camera at the time of capture, which is distinct from the xyz of each pixel. The camera position seems like immutable metadata required for 'level 0 products' while downstream projection and calibration will be required to assign coordinates to pixels (aka dimensions or variables).

As you suggest, we will likely revise the projection algorithms and thus dimensions over time. So the key will be having enough metadata for this.

Getting a prototype to get feedback on is probably the best way forward.

@dlebauer dlebauer changed the title Define pipeline for converting bil files to NetCDF/HDF5 data products Define pipeline for converting bil files to NetCDF/HDF5 data products and transferring from MAC to NCSA Jan 26, 2016
@dlebauer
Copy link
Member Author

@czender

A few notes

  1. To clarify, above you proposed to pass the json tree to netcdf attributes in the same data structure of nested key-value pairs. That should make it easy, and flexible as we refine the format, correct?
  2. The basic structure of the sensor metadata files have a good chance of being stable, major changes should include addition of fields - (Determine meta-data format for raw data from Lemnatec  reference-data#2).
  3. In particular, Location data will be passed using geojson as described by @max-zilla in #2.

Let me know if you have any questions or want to discuss via phone etc.

@ghost ghost added 2 - Working <= 5 and removed 1 - Ready labels Jan 26, 2016
@ghost
Copy link

ghost commented Jan 26, 2016

@czender Can you use Gdal? Talk to @robkooper

@robkooper
Copy link
Member

@ch1eroe1 has done some work on this as well and is working on some code to add it to BrownDog. I believe we just used the gdal_convert to convert the .hdr file to a cdf file.

@dlebauer
Copy link
Member Author

@ch1eroe1 do you have a link to the code that you wrote (or if it is a one-liner, paste here).

@czender Not necessary to use gdal, but will be useful to coordinate with @ch1eroe1, @robkooper and the BrownDog team. From what I understand, the gdal tool converts the file type but does not address optimization, handling of metadata, or developing data products.

@czender
Copy link
Contributor

czender commented Jan 26, 2016

@dlebauer on the clarification

  1. Yes. If we have JSON as input then we will transfer its structure straight into metadata. Changes in structure can be made upstream (to the JSON by Lemnatec). We're assuming they've already grouped the metadata logically.
  2. That's all fine.
  3. Getting the location into a standard form that is simultaneously useful for analysis will require care, and probably some iteration.

@czender
Copy link
Contributor

czender commented Jan 26, 2016

@robkooper and @ch1eroe1 yes just looked and we can try to use gdal_convert. if you already have a command to convert .hdr files (or similar) please post it with a link to a .hdf file that it works on and we will modify that to work on the reference images above. thanks!

@robkooper
Copy link
Member

gdal_translate -of netCDF test_envi_class.envi test_envi_class.nc

convert bil file to netcdf. Need to specify the bil file as the first argument, assumption is that there is a second file with .hdr. Second argument is output. -of netCDF makes it so netcdf is output instead of GeoTiff.

@czender
Copy link
Contributor

czender commented Jan 28, 2016

Thanks Rob. I have this working now.

@dlebauer
Copy link
Member Author

@czender Sorry about the trouble with the Box link. I've put the SWIR sample files (~600MB) here: http://file-server.igb.illinois.edu/~dlebauer/terraref/

@ghost ghost mentioned this issue Mar 3, 2016
3 tasks
@dlebauer dlebauer changed the title Define pipeline for converting bil files to NetCDF/HDF5 data products and transferring from MAC to NCSA Define pipeline for converting bin files to NetCDF/HDF5 data products and transferring from MAC to NCSA Mar 11, 2016
@dlebauer
Copy link
Member Author

@czender : please create additional issues, add documentation / links to scripts in github.com/terraref/documentation (make a new file called hyperspectral_data_pipeline.md) or similar and then close this

  1. how does this tie into bigger picture?
    • where should files land, where should the outputs go?
    • inputs: /projects/arpae/terraref/raw_data/lemnatec_field/
    • outputs: /projects/arpae/terraref/outputs/lemnatec_field/
  2. How to speed up and compress?
  3. anything else?

@czender
Copy link
Contributor

czender commented Mar 14, 2016

An alpha version of the pipeline now exists. As requested, I will close this issue and open a new issue at terraref/documentation#6

@czender czender closed this as completed Mar 14, 2016
@ghost ghost added sensor-data and removed sensor data labels Jan 3, 2017
@ghost ghost added 4 - Done and removed 3 - Review labels Jan 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants