BUG: Fix uploading of dataframes containing int64 and float64 columns #117

tswast · 2018-02-10T01:49:04Z

Fixes #116 and #96 by loading data in CSV chunks.

Fixes googleapis#116 and googleapis#96 by loading data in CSV chunks.

max-sixty

This looks superb!

max-sixty · 2018-02-10T07:09:26Z

pandas_gbq/_load.py

+    return six.BytesIO(body)
+
+
+def encode_chunks(dataframe, chunksize):


Why the multiple chunks, rather than using a single chunk? Is it a memory issue? A UI / status bar updating issue?

Because previously chunksize was required and I wasn't ready to make it option. I've just added a commit to this PR to make it optional. We'll want to update the default in Pandas after we release a package with this change.

max-sixty · 2018-02-10T07:10:55Z

pandas_gbq/tests/test__load.py

@@ -0,0 +1,26 @@
+


FYI this file name currently has two underscores

Thanks. I'm aware. I'm following the convention that the filename should be test_ + the filename of the file under test.

codecov-io · 2018-02-10T12:30:25Z

Codecov Report

Merging #117 into master will increase coverage by 2.74%.
The diff coverage is 71.28%.

@@            Coverage Diff             @@
##           master     #117      +/-   ##
==========================================
+ Coverage   28.25%   30.99%   +2.74%     
==========================================
  Files           4        8       +4     
  Lines        1561     1626      +65     
==========================================
+ Hits          441      504      +63     
- Misses       1120     1122       +2

Impacted Files	Coverage Δ
pandas_gbq/tests/test__schema.py	`100% <100%> (ø)`
pandas_gbq/tests/test__load.py	`100% <100%> (ø)`
pandas_gbq/_schema.py	`100% <100%> (ø)`
pandas_gbq/tests/test_gbq.py	`26.88% <14.28%> (-0.99%)`	⬇️
pandas_gbq/gbq.py	`20.56% <46.66%> (+1.94%)`	⬆️
pandas_gbq/_load.py	`62.5% <62.5%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f040c18...1ce8b0b. Read the comment docs.

tswast · 2018-02-10T12:33:19Z

Full Travis build with system tests running at https://travis-ci.org/tswast/pandas-gbq/builds/339826702

Also, fixes lint errors.

tswast · 2018-02-10T12:43:58Z

Ah, Travis uncovered a potential problem with using CSV.

=================================== FAILURES ===================================
 TestToGBQIntegrationWithServiceAccountKeyPath.test_upload_subset_columns_if_table_exists_append 
self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f28719af610>
    def test_upload_subset_columns_if_table_exists_append(self):
        # Issue 24: Upload is succesful if dataframe has columns
        # which are a subset of the current schema
        test_id = "16"
        test_size = 10
        df = make_mixed_dataframe_v2(test_size)
        df_subset_cols = df.iloc[:, :2]
    
        # Initialize table with sample data
        gbq.to_gbq(df, self.destination_table + test_id, _get_project_id(),
                   chunksize=10000, private_key=_get_private_key_path())
    
        # Test the if_exists parameter with value 'append'
        gbq.to_gbq(df_subset_cols,
                   self.destination_table + test_id, _get_project_id(),
>                  if_exists='append', private_key=_get_private_key_path())
pandas_gbq/tests/test_gbq.py:1096: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas_gbq/gbq.py:978: in to_gbq
    connector.load_data(dataframe, dataset_id, table_id, chunksize=chunksize)
pandas_gbq/gbq.py:572: in load_data
    self.process_http_error(ex)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ex = BadRequest(u'Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1.',)
    @staticmethod
    def process_http_error(ex):
        # See `BigQuery Troubleshooting Errors
        # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
    
>       raise GenericGBQException("Reason: {0}".format(ex))
E       GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 1; errors: 1.
pandas_gbq/gbq.py:455: GenericGBQException
 TestToGBQIntegrationWithServiceAccountKeyPath.test_upload_data_flexible_column_order 
self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7f2872fff310>
    def test_upload_data_flexible_column_order(self):
        test_id = "13"
        test_size = 10
        df = make_mixed_dataframe_v2(test_size)
    
        # Initialize table with sample data
        gbq.to_gbq(df, self.destination_table + test_id, _get_project_id(),
                   chunksize=10000, private_key=_get_private_key_path())
    
        df_columns_reversed = df[df.columns[::-1]]
    
        gbq.to_gbq(df_columns_reversed, self.destination_table + test_id,
                   _get_project_id(), if_exists='append',
>                  private_key=_get_private_key_path())

I think I probably need to include the schema definition in the load job, since we want to be able to upload a data frame even if the columns are out of order or there are extra columns.

max-sixty · 2018-02-10T22:59:19Z

I think I probably need to include the schema definition in the load job, since we want to be able to upload a data frame even if the columns are out of order or there are extra columns.

Yes, for sure. I think you can do that fairly easily with schema=_generate_bq_schema(df)

(we have this but can't remember why we do the obj construction)

def _bq_schema(df):
    schema_dict = _gbq._generate_bq_schema(df)
    schema = [bigquery.schema.SchemaField(x['name'], x['type'])
              for x in schema_dict['fields']]
    return schema

tswast · 2018-02-12T16:18:44Z

I imagine you had to do the object construction as a workaround for googleapis/google-cloud-python#4456

tswast · 2018-02-12T17:02:17Z

Okay, I think I got it this time. Full build in-progress at https://travis-ci.org/tswast/pandas-gbq/builds/340616570

max-sixty · 2018-02-12T20:52:57Z

Congrats @tswast ! Thanks for pushing this through!

tswast · 2018-02-12T22:08:12Z

Yeah! I'll plan to do a release this week to get all of these to_gbq fixes out there.

BUG: Fix uploading of dataframes containing int64 and float64 columns

616d306

Fixes googleapis#116 and googleapis#96 by loading data in CSV chunks.

max-sixty reviewed Feb 10, 2018

View reviewed changes

ENH: allow chunksize=None to disable chunking in to_gbq()

64ff345

Also, fixes lint errors.

tswast force-pushed the issue-116-int-col branch from cc219bf to 64ff345 Compare February 10, 2018 12:36

TST: update min g-c-bq lib to 0.29.0 in CI

f6bb63d

BUG: pass schema to load job for to_gbq

da4ddec

tswast added 2 commits February 12, 2018 10:23

Generate schema if needed for table creation.

8d820ed

Restore _generate_bq_schema, as it is used in tests.

b8c933d

tswast force-pushed the issue-116-int-col branch from e28bce8 to b8c933d Compare February 12, 2018 18:28

Add fixes to changelog.

1ce8b0b

tswast merged commit 62ec85b into googleapis:master Feb 12, 2018

tswast deleted the issue-116-int-col branch February 12, 2018 19:30

tswast mentioned this pull request Feb 12, 2018

Do not manually loop over all rows when encoding a dataframe as JSON #96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix uploading of dataframes containing int64 and float64 columns #117

BUG: Fix uploading of dataframes containing int64 and float64 columns #117

tswast commented Feb 10, 2018

max-sixty left a comment

max-sixty Feb 10, 2018

tswast Feb 10, 2018

max-sixty Feb 10, 2018

tswast Feb 10, 2018

codecov-io commented Feb 10, 2018 •

edited

Loading

tswast commented Feb 10, 2018

tswast commented Feb 10, 2018

max-sixty commented Feb 10, 2018

tswast commented Feb 12, 2018 •

edited

Loading

tswast commented Feb 12, 2018 •

edited

Loading

max-sixty commented Feb 12, 2018

tswast commented Feb 12, 2018

		return six.BytesIO(body)


		def encode_chunks(dataframe, chunksize):

BUG: Fix uploading of dataframes containing int64 and float64 columns #117

BUG: Fix uploading of dataframes containing int64 and float64 columns #117

Conversation

tswast commented Feb 10, 2018

max-sixty left a comment

Choose a reason for hiding this comment

max-sixty Feb 10, 2018

Choose a reason for hiding this comment

tswast Feb 10, 2018

Choose a reason for hiding this comment

max-sixty Feb 10, 2018

Choose a reason for hiding this comment

tswast Feb 10, 2018

Choose a reason for hiding this comment

codecov-io commented Feb 10, 2018 • edited Loading

Codecov Report

tswast commented Feb 10, 2018

tswast commented Feb 10, 2018

max-sixty commented Feb 10, 2018

tswast commented Feb 12, 2018 • edited Loading

tswast commented Feb 12, 2018 • edited Loading

max-sixty commented Feb 12, 2018

tswast commented Feb 12, 2018

codecov-io commented Feb 10, 2018 •

edited

Loading

tswast commented Feb 12, 2018 •

edited

Loading

tswast commented Feb 12, 2018 •

edited

Loading