Skip to content

Feat: Async/Cloud GeoParquet reader via object-store #492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kylebarron opened this issue Feb 1, 2024 · 4 comments · Fixed by #575
Closed

Feat: Async/Cloud GeoParquet reader via object-store #492

kylebarron opened this issue Feb 1, 2024 · 4 comments · Fixed by #575

Comments

@kylebarron
Copy link
Member

kylebarron commented Feb 1, 2024

Right now our read_geoparquet function is relatively simple.

  1. Take in a reader that implements ChunkReader
  2. Infer the output arrow schema and target geo data type from the geoparquet metadata.
  3. Load each Arrow batch into memory
  4. Call GeoTable::from_arrow, which parses the geometry columns into GeoArrow columns.

So changes:

  1. Create a new function pub async fn read_geoparquet_async with a bound on T: AsyncFileReader + Send + 'static. This punts to the user the task of creating a reader that implements AsyncFileReader. The Object Store struct implements it, so we can give an example of that.
  2. Open an async ParquetRecordBatchStreamBuilder instead of a synchronous one.
  3. Parsing the GeoParquet metadata, inferring the output arrow schema, should all be the same. To simplify, we can refactor that into a function with a bound on ArrowReaderBuilder, which both the sync ParquetRecordBatchReaderBuilder and the async ParquetRecordBatchStreamBuilder implement (so the same function can be used from the async and sync functions.
  4. Convert the sync iterator to an async iterator.

@weiji14

kylebarron added a commit that referenced this issue Feb 5, 2024
Implementing an asynchronous GeoParquet file reader using
[`ParquetRecordBatchStream`](https://docs.rs/parquet/50.0.0/parquet/arrow/async_reader/struct.ParquetRecordBatchStream.html).

TODO:
- [x] Initial implementation in `src/io/parquet/reader.rs`
- [x] Fix trait bounds
- [x] Refactor to have both `read_geoparquet` and
`read_geoparquet_async` functions parse the GeoParquet metadata using
the same function
- [ ] Bring in `object-store` crate to read from URL (if it gets
complicated, maybe split it into a separate PR)
- [x] Document new function
- [x] Add unit test

Addresses #492

P.S. This is my first ever Rust PR, so take it easy 🙈

---------

Co-authored-by: Kyle Barron <[email protected]>
Co-authored-by: Kyle Barron <[email protected]>
@weiji14
Copy link
Contributor

weiji14 commented Feb 6, 2024

Ok, the initial implementation of read_geoparquet_async on the Rust side is done at #493. Example usage passing in a tokio::fs::File:

use geoarrow::io::parquet::read_geoparquet_async;
use geoarrow::io::parquet::GeoParquetReaderOptions;
use tokio::fs::File;

#[tokio::main]
async fn main() {
    let file = File::open("fixtures/geoparquet/nybb.parquet")
        .await
        .unwrap();
    let options = GeoParquetReaderOptions::new(65536, Default::default());
    let output_geotable = read_geoparquet_async(file, options).await.unwrap();
    println!("GeoTable schema: {}", output_geotable.schema());
}

Another example with object-store from #493 (comment):

let storage_container = Arc::new(MicrosoftAzureBuilder::from_env().build().unwrap());
let location = Path::from("path/to/blob.parquet");
let meta = storage_container.head(&location).await.unwrap();
println!("Found Blob with {}B at {}", meta.size, meta.location);

let reader = ParquetObjectReader::new(storage_container, meta);
let table = read_geoparquet_async(reader, options).await?;

Next steps are to work out how the pyo3 API should look like. Copying from the threads in the PR:

My idea with bringing in object-store was more on applying it on the Pyo3/Python side around here:

pub fn read_parquet(path: String, batch_size: usize) -> PyGeoArrowResult<GeoTable> {
let file = File::open(path).map_err(|err| PyFileNotFoundError::new_err(err.to_string()))?;
let options = GeoParquetReaderOptions::new(batch_size, Default::default());
let table = _read_geoparquet(file, options)?;
Ok(GeoTable(table))
}

Specifically, making it possible for a user to pass in either a local path or a remote path like s3://bucket/example.geoparquet directly, and there would be a match statement to handle different object storage services (s3/az/gs/etc). There's this object_store::parse_url function that can be used to build the ObjectStore. But let's discuss the implementation in #492 first, I'm not entirely sure if it's better to have this parsing logic in geoarrow-rs, or leave it to the user to handle the object-store part themselves.

Originally posted by @weiji14 in #493 (review)

Yeah let's work out the Python bits in a different PR. In particular, I think it makes sense to have both an async def read_parquet_async for power users building a web server or something on top of geoarrow, but also a def read_parquet that handles stuff automatically under the hood.

I'm also not exactly sure what public Python API is ideal, so we should come back to that.

Originally posted by @kylebarron in #493 (comment)

Do we want both an async and sync function? Should paths starting with s3://, az://, etc be handled directly by the read_parquet function, or should it be left to the user to use object-store's parse_url or some fsspec-like parser to handle the connection?

@kylebarron
Copy link
Member Author

Do we want both an async and sync function? Should paths starting with s3://, az://, etc be handled directly by the read_parquet function, or should it be left to the user to use object-store's parse_url or some fsspec-like parser to handle the connection?

I think the rust api should be generic over the existing trait bounds. The Python API might want a higher level API. See also roeap/object-store-python#3

@sunng87
Copy link

sunng87 commented Feb 19, 2024

Instead of building the data access layer by yourself, I will recommend you to take a look at opendal which is working on an abstraction for a few storage backend.

@kylebarron
Copy link
Member Author

I don't follow: opendal and object-store are primarily equivalent, no? Given that I'm working with the parquet crate, which is already integrated into object-store, it seems more straightforward to use object-store directly

kylebarron added a commit that referenced this issue Mar 22, 2024
Follow up for #552.

Also closes #492. Closes
#551.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants