-
Notifications
You must be signed in to change notification settings - Fork 125
feat: to_gbq
uses Parquet by default, use api_method="load_csv"
for old behavior
#413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like we need to bump minimum numpy. Per https://numpy.org/neps/nep-0029-deprecation_policy.html, we should be on Numpy 1.18 already, so requiring |
Looks like most of the CircleCI failures are caused by the same issue.
The same tests pass with the latest versions of packages in the 3.9 tests. We don't test with the system tests with 3.7 on the Kokoro session, so I can't tell if the same issue happens with the pip version. Will update the noxfile to run with 3.7. |
Looks like we're not the only ones encountering this: https://stackoverflow.com/questions/59682833/pyarrow-lib-arrowinvalid-casting-from-timestampns-to-timestampms-would-los I wonder if we bump the minimum pyarrow to 4.0.0 if it would fix it? |
…sue366-null-strings
I've tried pyarrow 4, 5, and 6. None of which fixed it. Possibly a problem with pandas? |
Tests pass with
I'll try different versions of pandas. |
This issue is fixed by upgrading to pandas 1.1.0+. Looking at the pandas 1.1.0 changelog, there have been several bug fixes relating to timestamp data. I'm not sure which one in particular would have helped here, but potentially the fix for mixing and matching different timezones. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#parsing-timezone-aware-format-with-different-timezones-in-to-datetime I don't think we want to require pandas 1.1.0 just yet. Perhaps the tests could be updated not to mix and match timezones, since that's not actually supported by pandas until 1.1.0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm the fix gets rid of the linked issue.
Overall it looks good, the comments are just some nits and one possible refactoring opportunity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The Python 3.10 check fails, because the BigQuery client does not yet support Python 3.10 and cannot be installed as a dependency.
if schema is not None: | ||
schema = pandas_gbq.schema.remove_policy_tags(schema) | ||
job_config.schema = pandas_gbq.schema.to_google_cloud_bigquery(schema) | ||
# If not, let BigQuery determine schema unless we are encoding the CSV files ourselves. | ||
elif not FEATURES.bigquery_has_from_dataframe_with_csv: | ||
schema = pandas_gbq.schema.generate_bq_schema(dataframe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may introduce a failure if the schema is None and the generate_bq_schema
is left unused.
The parquet conversion may be successful, but the actual BQ table schema type may not match the resultant conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that our tests wouldn't have caught that. Do you have an example of a dataframe that demonstrates this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, the reason we don't have this here is that the google-cloud-bigquery
library does similar dataframe to BQ schema conversion logic if the schema is not populated on the job config: https://github.com/googleapis/python-bigquery/blob/66b3dd9f9aec3fda9610a3ceec8d8a477f2ab3b9/google/cloud/bigquery/client.py#L2625
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #366 🦕