-
Notifications
You must be signed in to change notification settings - Fork 27
Feat: Async/Cloud GeoParquet reader via object-store #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Implementing an asynchronous GeoParquet file reader using [`ParquetRecordBatchStream`](https://docs.rs/parquet/50.0.0/parquet/arrow/async_reader/struct.ParquetRecordBatchStream.html). TODO: - [x] Initial implementation in `src/io/parquet/reader.rs` - [x] Fix trait bounds - [x] Refactor to have both `read_geoparquet` and `read_geoparquet_async` functions parse the GeoParquet metadata using the same function - [ ] Bring in `object-store` crate to read from URL (if it gets complicated, maybe split it into a separate PR) - [x] Document new function - [x] Add unit test Addresses #492 P.S. This is my first ever Rust PR, so take it easy 🙈 --------- Co-authored-by: Kyle Barron <[email protected]> Co-authored-by: Kyle Barron <[email protected]>
Ok, the initial implementation of use geoarrow::io::parquet::read_geoparquet_async;
use geoarrow::io::parquet::GeoParquetReaderOptions;
use tokio::fs::File;
#[tokio::main]
async fn main() {
let file = File::open("fixtures/geoparquet/nybb.parquet")
.await
.unwrap();
let options = GeoParquetReaderOptions::new(65536, Default::default());
let output_geotable = read_geoparquet_async(file, options).await.unwrap();
println!("GeoTable schema: {}", output_geotable.schema());
} Another example with let storage_container = Arc::new(MicrosoftAzureBuilder::from_env().build().unwrap());
let location = Path::from("path/to/blob.parquet");
let meta = storage_container.head(&location).await.unwrap();
println!("Found Blob with {}B at {}", meta.size, meta.location);
let reader = ParquetObjectReader::new(storage_container, meta);
let table = read_geoparquet_async(reader, options).await?; Next steps are to work out how the pyo3 API should look like. Copying from the threads in the PR:
Originally posted by @weiji14 in #493 (review)
Originally posted by @kylebarron in #493 (comment) Do we want both an async and sync function? Should paths starting with |
I think the rust api should be generic over the existing trait bounds. The Python API might want a higher level API. See also roeap/object-store-python#3 |
Instead of building the data access layer by yourself, I will recommend you to take a look at opendal which is working on an abstraction for a few storage backend. |
I don't follow: opendal and object-store are primarily equivalent, no? Given that I'm working with the |
Uh oh!
There was an error while loading. Please reload this page.
Right now our
read_geoparquet
function is relatively simple.ChunkReader
GeoTable::from_arrow
, which parses the geometry columns into GeoArrow columns.So changes:
pub async fn read_geoparquet_async
with a bound onT: AsyncFileReader + Send + 'static
. This punts to the user the task of creating a reader that implementsAsyncFileReader
. The Object Store struct implements it, so we can give an example of that.ParquetRecordBatchStreamBuilder
instead of a synchronous one.ArrowReaderBuilder
, which both the syncParquetRecordBatchReaderBuilder
and the asyncParquetRecordBatchStreamBuilder
implement (so the same function can be used from the async and sync functions.@weiji14
The text was updated successfully, but these errors were encountered: