Encode before uploading #108

max-sixty · 2018-01-12T03:09:04Z

Potential fix for #106

...but someone with better bytes / str understanding needs to review

codecov-io · 2018-01-12T03:19:22Z

Codecov Report

Merging #108 into master will decrease coverage by 45.65%.
The diff coverage is 9.09%.

@@             Coverage Diff             @@
##           master     #108       +/-   ##
===========================================
- Coverage   73.92%   28.26%   -45.66%     
===========================================
  Files           4        4               
  Lines        1507     1560       +53     
===========================================
- Hits         1114      441      -673     
- Misses        393     1119      +726

Impacted Files	Coverage Δ
pandas_gbq/gbq.py	`20.37% <0%> (-56.53%)`	⬇️
pandas_gbq/tests/test_gbq.py	`26.84% <10.2%> (-55.88%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61bc28f...42882a2. Read the comment docs.

max-sixty · 2018-01-12T04:27:39Z

Travis passes. Not sure what's going on with codecov, but I think the PR should be good

jasonqng · 2018-01-14T00:50:50Z

pandas_gbq/tests/test_gbq.py

+            self.destination_table + test_id),
+            project_id=_get_project_id())
+
+        assert result['num_rows'][0] == test_si


Typo: test_size

Also, perhaps a test that downloads the uploaded dataframe and verifies that the special characters are preserved would be nice?

jasonqng · 2018-01-14T00:55:53Z

Currently fails in my Python 2 env (error below) because the returned object from '{}\n'.format('\n'.join(rows)) is a string, which thus fails to encode. Not sure if this is the most elegant way to handle this, but you can do a check on the joined object, and if it is a string, perform a decode to turn it into a unicode object:

                joined_rows = '{}\n'.format('\n'.join(rows))
                if isinstance(joined_rows, str):
                    joined_rows = joined_rows.decode('utf-8')
                body = BytesIO(joined_rows.encode('utf-8'))

Above tweak works for me, and both your tests now pass. Verified data is good in GUI and reading back via pandas-gbq.

>>> gbq.read_gbq("select * from `XXXXX.ad_hoc.jn_test`", 'XXXXX', verbose=False, dialect='standard')
                  Date  integer      string
0  2017-12-13 17:40:39      300        lego
1  2017-12-13 17:40:39      200  Skywalker™
2  2017-12-13 17:40:39      400       hülle

Nice catch and fix!

Error:

>>> import sys; print sys.version_info
sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
>>> from pandas import DataFrame
>>> import gbq
>>> df = DataFrame({
... 'string': ['Skywalker™', 'lego', 'hülle'],
... 'integer': [200, 300, 400],
... 'Date': [
...     '2017-12-13 17:40:39', '2017-12-13 17:40:39',
...     '2017-12-13 17:40:39'
... ]
... })
>>> gbq.to_gbq(df, "ad_hoc.jn_test", project_id='XXXXX', if_exists='replace')



Load is 100% Complete
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "gbq.py", line 989, in to_gbq
    connector.load_data(dataframe, dataset_id, table_id, chunksize)
  File "gbq.py", line 584, in load_data
    body = BytesIO('{}\n'.format('\n'.join(rows)).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 63: ordinal not in range(128)

max-sixty · 2018-01-14T01:32:39Z

@jasonqng thank you v much. Even if a bit of a hack, I think it's good for the moment

jasonqng · 2018-01-14T02:39:10Z

pandas_gbq/tests/test_gbq.py

+            project_id=_get_project_id())
+
+        assert result['num_rows'][0] == test_size
+        tm.assert_series_equal(result['string'], df['string'])


You'll want to either sort the result dataframe by integer or do an order by in the query to ensure that the assertion will pass (otherwise, the rows of result could be in a different order than your original df).

max-sixty · 2018-01-14T03:15:12Z

@jasonqng on second try, unfortunately that solution didn't work for Python3. I've pushed something that's not elegant either, but it works.

The one thing that doesn't work is the comparison when testing in Py2. If you look at the df it produces, it looks good, but comparing the strings is not successful. So I've skipped a subset of the test for moment

max-sixty · 2018-01-17T15:31:35Z

@tswast @jreback this is ready to go. Original reporter in #106 confirmed the fix works for them

tswast · 2018-01-17T17:53:20Z

pandas_gbq/gbq.py

@@ -581,7 +581,11 @@ def load_data(self, dataframe, dataset_id, table_id, chunksize):
                self._print("\rLoad is {0}% Complete".format(
                    ((total_rows - remaining_rows) * 100) / total_rows))

-                body = StringIO('{}\n'.format('\n'.join(rows)))
+                body = '{}\n'.format('\n'.join(rows))


If you use u'{}\n'.format(u'\n'.join(rows)) is the if statement checking for bytes necessary?

Unfortunately not:

> body = u'{}\n'.format(u'\n'.join(rows)) E UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 77: ordinal not in range(128)

I think the nub problem is that row.to_json comes out as either bytes or str depending on the python version - so we need some branching somewhere. Unless there's a function in python that can deal with both (this all seems a bit inelegant)

(I also tried decoding the row first on 576, which made Py2 pass, but then python3 failed, because it can't decode unicode.

tswast · 2018-01-17T17:55:13Z

Please add this to the changelog at https://github.com/pydata/pandas-gbq/blob/master/docs/source/changelog.rst under a new section heading 0.3.1 / (unreleased).

max-sixty · 2018-01-17T19:15:58Z

@tswast added whatsnew

tswast · 2018-01-18T01:17:43Z

The build failed for this after merging because the test was waiting on user credentials:

pandas_gbq/tests/test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_chinese_unicode_data Please visit this URL to authorize this application:
...
Enter the authorization code: 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

I'm not entirely sure why this is happening, since this test is in the TestToGBQIntegrationWithServiceAccountKeyPath class.

max-sixty · 2018-01-18T04:17:11Z

Hmmm. I did add some tests to that class. I've removed them in this branch: https://github.com/maxim-lian/pandas-gbq/tree/test-fix , (+ included a fix that I didn't carry over to those in the ServiceAccount class).

Could you try running that in Travis?

skion · 2018-01-18T15:20:20Z

Confirming that this made the following (note the accented character) work again for me on Python 3.6.3...

pd.DataFrame({'my_string': ['María']}).to_gbq('xxx.yyy', 'project-id')

tswast · 2018-01-18T22:23:30Z

@maxim-lian That looks like it should fix it, thanks. Build running at https://travis-ci.org/tswast/pandas-gbq/builds/330577200

Note: there are instructions at https://pandas-gbq.readthedocs.io/en/latest/contributing.html#running-google-bigquery-integration-tests for setting up your personal fork to build with integration tests on Travis.

max-sixty · 2018-01-18T22:36:52Z

Ah great! Not sure how I missed that. That's super

tswast · 2018-01-18T22:40:49Z

Ugh. Still failing with a timeout on Travis.

pandas_gbq/tests/test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_chinese_unicode_data 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

It's not asking for an authorization code anymore, so I'm not sure how it's getting stuck.

tswast · 2018-01-18T23:28:51Z

I think I figured it out.

tswast@fb6a2f6

These tests need a private key to be manually set. Building at https://travis-ci.org/tswast/pandas-gbq/builds/330598420

max-sixty · 2018-01-19T14:36:25Z

Ooof, sorry to leave you with that. #109 would solve all our problems forevermore

max-sixty added 2 commits January 11, 2018 22:07

encode before uploading

12b59d0

set py file coding for py2

937596f

max-sixty added 2 commits January 11, 2018 22:36

lint

a64f67a

move test to travis test class

3ed5a24

joshcarty mentioned this pull request Jan 12, 2018

Authenticate without browser joshcarty/google-searchconsole#1

Closed

max-sixty added 2 commits January 12, 2018 10:03

try forcing utf-8 encoding

0434e34

add test

085018f

max-sixty force-pushed the encoding branch from 7102c16 to 085018f Compare January 12, 2018 15:47

correct expected sizes

8cd6991

jasonqng reviewed Jan 14, 2018

View reviewed changes

max-sixty added 2 commits January 13, 2018 20:42

test data matches

4d52f95

test unicode locally

b3bbff0

jasonqng reviewed Jan 14, 2018

View reviewed changes

Py2/Py3 compat

fca5d29

max-sixty force-pushed the encoding branch from 1ead3ed to fca5d29 Compare January 14, 2018 03:13

typo

e0b80eb

tswast reviewed Jan 17, 2018

View reviewed changes

what's new

42882a2

tswast approved these changes Jan 17, 2018

View reviewed changes

tswast merged commit 3b112bf into googleapis:master Jan 17, 2018

max-sixty deleted the encoding branch January 17, 2018 22:30

tswast mentioned this pull request Jan 19, 2018

TST: Use private key for unicode tests #111

Merged

tswast mentioned this pull request Feb 5, 2018

JSON error python 3 #115

Closed

max-sixty mentioned this pull request Feb 10, 2018

to_gbq result in UnicodeEncodeError #106

Closed

tswast mentioned this pull request Mar 16, 2018

BigQuery: Failed to Parse JSON: Closing quote expected in string (non-ASCII characters) googleapis/google-cloud-python#4753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode before uploading #108

Encode before uploading #108

max-sixty commented Jan 12, 2018

codecov-io commented Jan 12, 2018 •

edited

Loading

max-sixty commented Jan 12, 2018

jasonqng Jan 14, 2018

jasonqng Jan 14, 2018

jasonqng commented Jan 14, 2018 •

edited

Loading

max-sixty commented Jan 14, 2018

jasonqng Jan 14, 2018 •

edited

Loading

max-sixty commented Jan 14, 2018

max-sixty commented Jan 17, 2018

tswast Jan 17, 2018 •

edited

Loading

max-sixty Jan 17, 2018

tswast commented Jan 17, 2018 •

edited

Loading

max-sixty commented Jan 17, 2018

tswast commented Jan 18, 2018

max-sixty commented Jan 18, 2018

skion commented Jan 18, 2018

tswast commented Jan 18, 2018

max-sixty commented Jan 18, 2018

tswast commented Jan 18, 2018

tswast commented Jan 18, 2018

max-sixty commented Jan 19, 2018

Encode before uploading #108

Encode before uploading #108

Conversation

max-sixty commented Jan 12, 2018

codecov-io commented Jan 12, 2018 • edited Loading

Codecov Report

max-sixty commented Jan 12, 2018

jasonqng Jan 14, 2018

Choose a reason for hiding this comment

jasonqng Jan 14, 2018

Choose a reason for hiding this comment

jasonqng commented Jan 14, 2018 • edited Loading

max-sixty commented Jan 14, 2018

jasonqng Jan 14, 2018 • edited Loading

Choose a reason for hiding this comment

max-sixty commented Jan 14, 2018

max-sixty commented Jan 17, 2018

tswast Jan 17, 2018 • edited Loading

Choose a reason for hiding this comment

max-sixty Jan 17, 2018

Choose a reason for hiding this comment

tswast commented Jan 17, 2018 • edited Loading

max-sixty commented Jan 17, 2018

tswast commented Jan 18, 2018

max-sixty commented Jan 18, 2018

skion commented Jan 18, 2018

tswast commented Jan 18, 2018

max-sixty commented Jan 18, 2018

tswast commented Jan 18, 2018

tswast commented Jan 18, 2018

max-sixty commented Jan 19, 2018

codecov-io commented Jan 12, 2018 •

edited

Loading

jasonqng commented Jan 14, 2018 •

edited

Loading

jasonqng Jan 14, 2018 •

edited

Loading

tswast Jan 17, 2018 •

edited

Loading

tswast commented Jan 17, 2018 •

edited

Loading