Skip to content

Mass conversion of raw data to plot-level tiles #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
max-zilla opened this issue Feb 27, 2017 · 62 comments
Closed

Mass conversion of raw data to plot-level tiles #265

max-zilla opened this issue Feb 27, 2017 · 62 comments

Comments

@max-zilla
Copy link
Contributor

in brief discussion between @dlebauer and I last Friday, we talked pros & cons of just splitting all our imagery data into plot tiles, like for every sensor we would:

  1. transfer raw data to Roger
  2. level 1 extractors convert raw data into GeoTIFF
  3. we have process that scans a day of GeoTIFFs from a sensor and creates plot-level tiles
  4. level 2+ extractors operate on plot GeoTIFFs
  5. later on we remove original raw files and only keep plot-level

idea being that in step 3 we use a more generalized version of the stereoTop full field mosaic extractor that operates on geoTIFFs. Thus we funnel all extractors -> GeoTIFF format (we can still make JPGs for quick previewing). To kick things off:

PROS

  • scientists likely to care about plot-level analysis in most cases; easier to work with and understand what you're seeing
  • easier to compare same plot location across time/sensors w/ aligned imagery
  • reduce stitching/overlap handling required on data scientist side if we don't perform this clip for them, or reduce computational overhead if we otherwise would dynamically query plots from netCDF files (for example)
  • we could consistently organize data by sensor x time x plot across entire project structure, not just in geostreams/bety GIS databases (e.g. in Clowder and in filesystem directories)

CONS

  • muddles data organization somewhat; datasets as organized on Roger disk and in Clowder would no longer apply. instead each plot could include 2+ distinct timestamps that were stitched together to form the plot image you're seeing so we'd have:
raw_data/
  stereoTop/
    2016-02-02/
      2016-02-02_12-32-32-123 Range 23 Pass 6/
        files

Now, maybe if we apply givens about the gantry we could assume we wont have more than 1 capture per plot, per sensor, per minute, so we drop the milliseconds from the timestamp or something. or we sort by plot:

raw_data/
  stereoTop/
    2016-02-02/
      Range 23 Pass 6/
        2016-02-02_12-32-32-123 
          files
  • impact on storage plans. effectively plot tiles would become defacto "lowest level raw data" that is in the data release, so the actual raw imagery could be archived and deleted. this would be a complex process with all the issues JD and i face elsewhere in terms of making sure a full day of data is present before performing the tiling,
  • less flexibility if, theoretically, one wanted to query some arbitrary ground footprint later on that is not a specific plot boundary - if we dont implement mechanisms to dynamically query netCDF, for example. if we DON'T precompute the tiles, perhaps we can offer a dynamic way to chop things on the fly.

I think one way or another we need to provide imagery at the plot-level; either pre-clipped, or some means to quickly clip on demand (e.g. for plot-level extractors). so maybe a better question is, what's easier: a bunch of complexity to achieve the clipping above, or something like THREDDS to juggle big netCDFs on the fly? As I write this out, pre-clipping seems like a tall order, but maybe it saves us headaches later on... decided to just toss it on the table.

tagging @jdmaloney @jterstriep @yanliu-chn as FYI also.

@max-zilla
Copy link
Contributor Author

Note that the script I linked does NOT create plot-level tiles. Basic logic for mosaic extractor is:

  • convert stereoTop .BIN files to GeoTIFF (this shouldn't be necessary since demosaic already does that - we should update to make it check for those first)
  • get list of all GeoTIFFs inside the field bounds for one day, and create a VRT (virtual TIF) that combines them all
  • feed VRT to gdal2tiles_parallel.py which prepares for Google Maps global tiles

The theoretical plot clipping extractor would chop the VRT into new TIFs based on each polygon in our plot definition.

@max-zilla
Copy link
Contributor Author

max-zilla commented Mar 15, 2017

@robkooper @jdmaloney @yanliu-chn @dlebauer @czender @craig-willis

Just spoke with Rob about moving this forward. I basically suggested:

  • we have extractors that convert raw sensor images into standard geocoded GeoTIFF format
  • we have a work script we run on Roger filesystem directly outside Clowder, probably a worker with some hefty memory.
    • loop across each sensor x day directory of GeoTIFFs
    • stitch the day's images together into one full field image (using VRT, or streaming or something)
    • clip full field image into images by plot
    • create aggregate metadata for the new plot image (e.g. timestamp down to the minute it was captured, even if multiple captures are blended into one plot)
    • organize data in Clowder by plot & date, once you get to single day collection you see a nice list of a single plot-level image for each sensor for that plot
    • we no longer store raw data/metadata in Clowder - the plot-level GeoTIFFs are "base level" data products for outside world

Main questions/concerns:

  • What does this mean for data usage projections i.e. our 10 PB estimate? we need to get a basic sample of the code working on one of the datasets to see how big the plot images are - the stereoTop existing mosaic code isn't a perfect comparison
  • How quickly can we be confident we have all the images from a day to be able to stitch & chop? 48 hours? 36?
  • do we need a buffer around plot boundaries/is there concern about the non-overlap between polys in the shapefile? this could be both a boon in terms of overall file size for a day, and a negative if there is anything of interest in between:
    screen shot 2017-03-15 at 10 35 21 am
    Eyeballing it, it feels like 50% of the pixels for the width of the field go away if we just keep the plots.

@czender
Copy link
Contributor

czender commented Mar 15, 2017

Mentioning @hmb1 since he is looking into efficient means of clipping netCDF to shapefiles...

@max-zilla
Copy link
Contributor Author

@czender thanks. forgot to mention that I tagged you because I imagine the hyperspectral data might do this slightly differently than the others... e.g. maybe we bypass the GeoTIFF and modify your script to generate the .nc files per plot instead?

@czender
Copy link
Contributor

czender commented Mar 15, 2017

That's one possibility. If you have an ESRI shapefile or geoTIFF for a plot for which there's a corresponding hyperspectral image, please send it our way for experimentation...

@dlebauer
Copy link
Member

@max-zilla To answer some questions

What does this mean for data usage projections i.e. our 10 PB estimate?

we are well below the 10PB estimate at this point, and if we do get closer, we can eliminate some of the intermediate data products and then if we still need to save space we can use lossy compression / downscaling

How quickly can we be confident we have all the images from a day to be able to stitch & chop? 48 hours? 36?

Isn't this information available as the difference between collection and transfer dates? (or maybe the more important time range is the beginning to end of transfer of one day's data)

(ignoring outliers caused by transfer downtime)

do we need a buffer around plot boundaries/is there concern about the non-overlap between polys in the shapefile? this could be both a boon in terms of overall file size for a day, and a negative if there is anything of interest in between:

You are correct that ~ 50% of the plot goes away if you just keep the plots. This is because of four rows, only the two in the middle are kept because the outside rows are 'border rows'. While the border rows do provide useful information prior to canopy closure, we have decided to start with these plot boundaries #187. Although the discussion of plot definitions is ongoing (terraref/reference-data#60 (comment)) I suggest we move forward with these plot definitions and exclude the alleys and border rows.

I am still confused about the use of a shapefile. I know we have discussed this before, but I was under the impression that our workflows would be based on OGC compliant data formats, and that we would import the plot boundaries into the PostGIS database where it can be queried in geojson or WKT format.

@dlebauer
Copy link
Member

@czender @hmb1 are these files sufficiently aligned to the x,y grid that ncks -d x,xmin,xmax -d y,ymin,ymax in.nc out.nc would be sufficient and efficient?

@NewcombMaria
Copy link

@dlebauer and @max-zilla the 4-row plot arrangement and the issue of the 2 outer rows of each 4-row plot being 'border rows' only applies to Sorghum-season2. Other seasons (Sorghum-season1, Wheat-season3, and the upcoming Sorghum-season4) are all planted as 2-row plots with both rows considered as data-collection rows. The big differences between 2-row plot and 4-row plot arrangements is why Roman suggested making single rows the smallest units and building 'plots' by combining the single-rows of interest.

@max-zilla
Copy link
Contributor Author

@dlebauer we have the plot boundaries in PostGIS and can build on that, I just used the shapefile as representative of the geometry we're working with so I could include that little picture.

@NewcombMaria that is helpful, thanks.

@czender
Copy link
Contributor

czender commented Mar 15, 2017

@dlebauer @hmb1 the rectangles above look rectangular to me so ncks will be efficient and fast on them. i am thinking of how best to use shapefiles at the "plant scale", e.g., how to mask the irregular regions of soil (or leaves) within a plot, as this seems like a useful capability for manipulating the hyperspectral data. we do not yet have a specific request like "can we get a netCDF file with valid values within this shapefile and _FillValues elsewhere", though it seems only a matter of time and we want to stay ahead of the curve...

@dlebauer
Copy link
Member

@czender: @remotesensinglab is working on per-pixel classifications of sun/shade and plant/soil. That should be easier than irregular polygons (unlike plots, these irregular shapes will change in each image).

@NewcombMaria good point. We should re-visit Roman's proposal moving forward. For now we will keep these plot definitions for Sorghum Season 2.

@czender
Copy link
Contributor

czender commented Mar 15, 2017

OK. then the ncks -d x,xmin,xmax -d y,ymin,ymax in.nc out.nc method should suffice for HS imagery once @FlyingWithJerome commits the branch with the correct downsampled coordinates...

@dlebauer
Copy link
Member

PS what I envision is that the classification algorithm could add new boolean variables like 'sun_leaf' 'shade_soil', etc to the reflectance index data product. @remotesensinglab does that make sense?

@remotesensinglab
Copy link

Hi David,
Yes, that is what I have in mind as well. Thanks,

Wasit

@max-zilla
Copy link
Contributor Author

max-zilla commented Mar 16, 2017

@czender that min/max arg looks perfect. FWIW you can grab the shapefile from my image above here:
https://github.com/terraref/extractors-metadata/tree/master/sensorposition/shp/sorghumexpfall2016v5

...we can (and soon should) use geostreams API to get boundaries of each plot like so:

GET https://terraref.ncsa.illinois.edu/clowder/api/geostreams/sensors?sensor_name=Range 10 Pass 2

RESPONSE
[
  {
    "id": 288,
    "name": "Range 10 Pass 2",
    "created": "2017-02-02T18:04:52Z",
    "type": "Feature",
    "properties": {
      "type": {
        "id": "MAC Field Scanner",
        "title": "MAC Field Scanner",
        "sensorType": 4
      },
      "name": "Range 10 Pass 2",
      "popupContent": "Range 10 Pass 2",
      "region": "Maricopa"
    },
    "geometry": {
      "type": "Point",
      "coordinates": [
        -111.974990376,
        33.0748710865,
        0
      ]
    },
    "min_start_time": "2016-06-30T18:11:40Z",
    "max_end_time": "2016-12-31T21:58:08Z",
    "parameters": [
      "centroid",
      "file_ids",
      "fov",
      "sources"
    ]
  }
]

Right now it just has the centroid point in geometry but I have a script to run soon that'll store the polygon of each plot in these. Range 10 Pass 2 is the plot in this case.

@max-zilla
Copy link
Contributor Author

max-zilla commented Mar 16, 2017

As a first target of this effort, I would propose:

@yanliu-chn do you have suggestions for who could write the script in item 3? Also tagging @ZongyangLi @solmazhajmohammadi as they haven't been mentioned in this thread yet

@ghost
Copy link

ghost commented Mar 16, 2017

@solmazhajmohammadi , @remotesensinglab @Paheding @pless @ZongyangLi - what is the standard way of handling data for overlapping images? Is the data averaged?

@max-zilla
Copy link
Contributor Author

@yanliu-chn suggests stitching together raw data as non-georeferenced image.

@yanliu-chn
Copy link

@ZongyangLi Yes.

@yanliu-chn
Copy link

@ZongyangLi let me explain. gantry->mac will get you the coord in meters; then you convert the coord to latlon (UTM12->EPSG:4326); then use the mac->usda formula to get the correct lat lon.

you can directly adjust the 2.5m in UTM12 along x and y direction if you know how many meters 0.000015258894 degrees of lat and 0.000020308287 degrees of lon are.

@ZongyangLi
Copy link
Contributor

@max-zilla I have just uploaded a bounding box method with formula in the repo.

https://github.com/terraref/extractors-stereo-rgb/tree/master/demosaic

@max-zilla
Copy link
Contributor Author

@ZongyangLi this looks close to correct based on the formula I applied above (original is the lower image, fixed is upper):
screen shot 2017-03-29 at 2 37 01 pm
Again, for:

gantry_system_variable_metadata:
position x [m]: 207
position y [m]: 4.501
position z [m]: 0.631

location in camera box x [m]: 0.877
location in camera box y [m]: 2.276
location in camera box z [m]: 0.578

This seems to agree with our basic plot model:

# GANTRY GEOM (GANTRY CRS) ############
#			      N(x)             
#			      ^                   
#			      |                   
#			      |                   
#			      |                  
#      W(y)<------SE                  
#                                     
# NW: (207.3,	22.135,	5.5)      
# SE: (3.8,	0.0,	0.0)              
#######################################

North corner X is 207.3 and our gantry position is 207; since in this CRS X is along the north/south axis and increases as we move north, we expect to be just below the northern boundary.

Similarly, East Y is 0 (aka the rightmost boundary) and West Y is 22.135; ours is 4.5 so we expect to be roughly 1/5 of the way across the field.

So this looks reasonable, although I notice the left/right images look identical - does that surprise you?

I think next step is to trigger this new version of extractor on a whole day of datasets and bring them all into a map to see how it looks and start to work on stitching. I'm also going to look at the metadata work @craig-willis has done to see if we can pull the parameters in this script from a standard location instead of having them hardcoded.

@ghost ghost removed the help wanted label Mar 29, 2017
@ghost ghost added this to the March 2017 milestone Mar 29, 2017
@max-zilla
Copy link
Contributor Author

@ZongyangLi @yanliu-chn I am generating a full day of demosaic TIFFs here:

/sites/ua-mac/Level_1/demosaic/2017-02-07/

there are ~4,200 of these to generate, might be completed by the time of the TERRA meeting but not sure.

@max-zilla
Copy link
Contributor Author

max-zilla commented Mar 30, 2017

@pless and @ZongyangLi can focus on the stitching component of this. getting one big GeoTIFF for Feb 7.

Field of view of each sensor is really important. @yanliu-chn formula does not take into account the FOV - we need to solve this problem. We need to have someone like @smarshall-bmr determine true field of view for each sensor. The stereo camera FOV parameter is not true, for example. @craig-willis indicates that Stuart has provided updated metadata for some sensors but not all.

@jterstriep has volunteered to handle the clipping to individual plot boundaries from that big GeoTIFF.

@ghost ghost removed the help wanted label Mar 30, 2017
@ghost
Copy link

ghost commented Mar 30, 2017

FOV issue is #126

@max-zilla
Copy link
Contributor Author

@ZongyangLi @yanliu-chn @jterstriep @smarshall-bmr looks better w/ new calculations, but we can see where we need the FOV correction:
screen shot 2017-03-31 at 2 28 55 pm

Four images of same basic part of field with different images turned on/off:

Image A:
screen shot 2017-03-31 at 2 29 25 pm

Image A+B (see split in left tire tread)
screen shot 2017-03-31 at 2 29 36 pm

Image A+B+C (see another split in center dirt)
screen shot 2017-03-31 at 2 29 49 pm

Image A+C (gap between images despite areas that should overlap)
screen shot 2017-03-31 at 2 30 19 pm

In phone discussion with Zongyang, thinking was that FOV correction would correct for this.

@dlebauer
Copy link
Member

dlebauer commented Apr 6, 2017

What is a robust, simple approach that we can use at this point for merging images?
Should we differentiate between images before and after canopy closure?

  • when plants are small and lots of ground is visible, adjust plot boundary for scanner height so that ground footprint is consistent
  • when canopy is fully closed, use plot boundary at canopy height?
    Does it help if we stitch using both the image features and geospatial (field of view) information?

@solmazhajmohammadi could you please update this issue with a summary of your findings to date?

@solmazhajmohammadi
Copy link

Feature based registration without considering the z height, this is working for registration in small area. When the image area gets bigger, number of the features increases and it is hard to register the images after that. (due to false match)
I did a little bit research on this: ASIFT algorithm seams to be a solution for this problem. This is a great opensource work, which forces the algorithm to only consider affine transformation. So we dont get false arbitrary matches.
http://www.ipol.im/pub/art/2011/my-asift/article_lr.pdf

@Paheding
Copy link

Paheding commented Apr 6, 2017

@solmazhajmohammadi SIFT based algorithms are very robust to image translation, rotation, and scaling, however the SIFT might be still patented. Alternatively, we can use SURF method ( article:
http://www.vision.ee.ethz.ch/~surf/eccv06.pdf). As for mismatched features, RANSAC http://www.cs.columbia.edu/~belhumeur/courses/compPhoto/ransac.pdf method can be used before feature extraction to eliminate outlier features, so that the false matches can be resolved.

@solmazhajmohammadi
Copy link

@Paheding take a look at the figure 26 on the paper I posted. I have not tried ASIFT in the big plot size images yet, I think this approach warranty the affine transformation between the features. When the number of the registered images get higher RANSAC didnot help on the false matches.

@Paheding
Copy link

Paheding commented Apr 6, 2017

@solmazhajmohammadi Correct, there are some drawbacks of RANSAC. Here is an optimized RANSAC algorithm https://pdfs.semanticscholar.org/6dc8/6d312ca1c5b18e53ecb98fa6b5fc1053e023.pdf, which tends to be better than the classic RANSAC. Source code: http://www.cb.uu.se/~aht/code.html. Combination of the optimized RANSAC and ASIFT would be more robust.
Besides, another quick solution could be make a threshold to restrict the number of registered images.

@dlebauer
Copy link
Member

Is this ready to be closed? #306 covers some remaining issues.

@max-zilla
Copy link
Contributor Author

Yes, I believe so - current plan is for @jterstriep to develop service to query plots on-demand instead of mass conversion ahead of time. There are other issues where this is being discussed.

@dlebauer
Copy link
Member

@jterstriep could you link to or create the issue that covers the on demand plot subsetting?

@ghost ghost removed the help wanted label May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests