Skip to content

BUG: Fix uploading of dataframes containing int64 and float64 columns #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 12, 2018

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Feb 10, 2018

Fixes #116 and #96 by loading data in CSV chunks.

Copy link
Contributor

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks superb!

return six.BytesIO(body)


def encode_chunks(dataframe, chunksize):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the multiple chunks, rather than using a single chunk? Is it a memory issue? A UI / status bar updating issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because previously chunksize was required and I wasn't ready to make it option. I've just added a commit to this PR to make it optional. We'll want to update the default in Pandas after we release a package with this change.

@@ -0,0 +1,26 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this file name currently has two underscores

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm aware. I'm following the convention that the filename should be test_ + the filename of the file under test.

@codecov-io
Copy link

codecov-io commented Feb 10, 2018

Codecov Report

Merging #117 into master will increase coverage by 2.74%.
The diff coverage is 71.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #117      +/-   ##
==========================================
+ Coverage   28.25%   30.99%   +2.74%     
==========================================
  Files           4        8       +4     
  Lines        1561     1626      +65     
==========================================
+ Hits          441      504      +63     
- Misses       1120     1122       +2
Impacted Files Coverage Δ
pandas_gbq/tests/test__schema.py 100% <100%> (ø)
pandas_gbq/tests/test__load.py 100% <100%> (ø)
pandas_gbq/_schema.py 100% <100%> (ø)
pandas_gbq/tests/test_gbq.py 26.88% <14.28%> (-0.99%) ⬇️
pandas_gbq/gbq.py 20.56% <46.66%> (+1.94%) ⬆️
pandas_gbq/_load.py 62.5% <62.5%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f040c18...1ce8b0b. Read the comment docs.

@tswast
Copy link
Collaborator Author

tswast commented Feb 10, 2018

Full Travis build with system tests running at https://travis-ci.org/tswast/pandas-gbq/builds/339826702

@tswast
Copy link
Collaborator Author

tswast commented Feb 10, 2018

Ah, Travis uncovered a potential problem with using CSV.

=================================== FAILURES ===================================
 TestToGBQIntegrationWithServiceAccountKeyPath.test_upload_subset_columns_if_table_exists_append 
self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f28719af610>
    def test_upload_subset_columns_if_table_exists_append(self):
        # Issue 24: Upload is succesful if dataframe has columns
        # which are a subset of the current schema
        test_id = "16"
        test_size = 10
        df = make_mixed_dataframe_v2(test_size)
        df_subset_cols = df.iloc[:, :2]
    
        # Initialize table with sample data
        gbq.to_gbq(df, self.destination_table + test_id, _get_project_id(),
                   chunksize=10000, private_key=_get_private_key_path())
    
        # Test the if_exists parameter with value 'append'
        gbq.to_gbq(df_subset_cols,
                   self.destination_table + test_id, _get_project_id(),
>                  if_exists='append', private_key=_get_private_key_path())
pandas_gbq/tests/test_gbq.py:1096: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas_gbq/gbq.py:978: in to_gbq
    connector.load_data(dataframe, dataset_id, table_id, chunksize=chunksize)
pandas_gbq/gbq.py:572: in load_data
    self.process_http_error(ex)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ex = BadRequest(u'Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1.',)
    @staticmethod
    def process_http_error(ex):
        # See `BigQuery Troubleshooting Errors
        # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
    
>       raise GenericGBQException("Reason: {0}".format(ex))
E       GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1.
pandas_gbq/gbq.py:455: GenericGBQException
 TestToGBQIntegrationWithServiceAccountKeyPath.test_upload_data_flexible_column_order 
self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f2872fff310>
    def test_upload_data_flexible_column_order(self):
        test_id = "13"
        test_size = 10
        df = make_mixed_dataframe_v2(test_size)
    
        # Initialize table with sample data
        gbq.to_gbq(df, self.destination_table + test_id, _get_project_id(),
                   chunksize=10000, private_key=_get_private_key_path())
    
        df_columns_reversed = df[df.columns[::-1]]
    
        gbq.to_gbq(df_columns_reversed, self.destination_table + test_id,
                   _get_project_id(), if_exists='append',
>                  private_key=_get_private_key_path())

I think I probably need to include the schema definition in the load job, since we want to be able to upload a data frame even if the columns are out of order or there are extra columns.

@max-sixty
Copy link
Contributor

I think I probably need to include the schema definition in the load job, since we want to be able to upload a data frame even if the columns are out of order or there are extra columns.

Yes, for sure. I think you can do that fairly easily with schema=_generate_bq_schema(df)

(we have this but can't remember why we do the obj construction)

def _bq_schema(df):
    schema_dict = _gbq._generate_bq_schema(df)
    schema = [bigquery.schema.SchemaField(x['name'], x['type'])
              for x in schema_dict['fields']]
    return schema

@tswast
Copy link
Collaborator Author

tswast commented Feb 12, 2018

I imagine you had to do the object construction as a workaround for googleapis/google-cloud-python#4456

@tswast
Copy link
Collaborator Author

tswast commented Feb 12, 2018

Okay, I think I got it this time. Full build in-progress at https://travis-ci.org/tswast/pandas-gbq/builds/340616570

@tswast tswast merged commit 62ec85b into googleapis:master Feb 12, 2018
@tswast tswast deleted the issue-116-int-col branch February 12, 2018 19:30
@max-sixty
Copy link
Contributor

Congrats @tswast ! Thanks for pushing this through!

@tswast
Copy link
Collaborator Author

tswast commented Feb 12, 2018

Yeah! I'll plan to do a release this week to get all of these to_gbq fixes out there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants