Skip to content

Encode before uploading #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 17, 2018
Merged

Encode before uploading #108

merged 12 commits into from
Jan 17, 2018

Conversation

max-sixty
Copy link
Contributor

Potential fix for #106

...but someone with better bytes / str understanding needs to review

@codecov-io
Copy link

codecov-io commented Jan 12, 2018

Codecov Report

Merging #108 into master will decrease coverage by 45.65%.
The diff coverage is 9.09%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #108       +/-   ##
===========================================
- Coverage   73.92%   28.26%   -45.66%     
===========================================
  Files           4        4               
  Lines        1507     1560       +53     
===========================================
- Hits         1114      441      -673     
- Misses        393     1119      +726
Impacted Files Coverage Δ
pandas_gbq/gbq.py 20.37% <0%> (-56.53%) ⬇️
pandas_gbq/tests/test_gbq.py 26.84% <10.2%> (-55.88%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61bc28f...42882a2. Read the comment docs.

@max-sixty
Copy link
Contributor Author

Travis passes. Not sure what's going on with codecov, but I think the PR should be good

self.destination_table + test_id),
project_id=_get_project_id())

assert result['num_rows'][0] == test_si
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: test_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, perhaps a test that downloads the uploaded dataframe and verifies that the special characters are preserved would be nice?

@jasonqng
Copy link
Contributor

jasonqng commented Jan 14, 2018

Currently fails in my Python 2 env (error below) because the returned object from '{}\n'.format('\n'.join(rows)) is a string, which thus fails to encode. Not sure if this is the most elegant way to handle this, but you can do a check on the joined object, and if it is a string, perform a decode to turn it into a unicode object:

                joined_rows = '{}\n'.format('\n'.join(rows))
                if isinstance(joined_rows, str):
                    joined_rows = joined_rows.decode('utf-8')
                body = BytesIO(joined_rows.encode('utf-8'))

Above tweak works for me, and both your tests now pass. Verified data is good in GUI and reading back via pandas-gbq.

>>> gbq.read_gbq("select * from `XXXXX.ad_hoc.jn_test`", 'XXXXX', verbose=False, dialect='standard')
                  Date  integer      string
0  2017-12-13 17:40:39      300        lego
1  2017-12-13 17:40:39      200  Skywalker™
2  2017-12-13 17:40:39      400       hülle

Nice catch and fix!

Error:

>>> import sys; print sys.version_info
sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
>>> from pandas import DataFrame
>>> import gbq
>>> df = DataFrame({
... 'string': ['Skywalker™', 'lego', 'hülle'],
... 'integer': [200, 300, 400],
... 'Date': [
...     '2017-12-13 17:40:39', '2017-12-13 17:40:39',
...     '2017-12-13 17:40:39'
... ]
... })
>>> gbq.to_gbq(df, "ad_hoc.jn_test", project_id='XXXXX', if_exists='replace')



Load is 100% Complete
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "gbq.py", line 989, in to_gbq
    connector.load_data(dataframe, dataset_id, table_id, chunksize)
  File "gbq.py", line 584, in load_data
    body = BytesIO('{}\n'.format('\n'.join(rows)).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 63: ordinal not in range(128)

@max-sixty
Copy link
Contributor Author

@jasonqng thank you v much. Even if a bit of a hack, I think it's good for the moment

project_id=_get_project_id())

assert result['num_rows'][0] == test_size
tm.assert_series_equal(result['string'], df['string'])
Copy link
Contributor

@jasonqng jasonqng Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to either sort the result dataframe by integer or do an order by in the query to ensure that the assertion will pass (otherwise, the rows of result could be in a different order than your original df).

@max-sixty
Copy link
Contributor Author

@jasonqng on second try, unfortunately that solution didn't work for Python3. I've pushed something that's not elegant either, but it works.

The one thing that doesn't work is the comparison when testing in Py2. If you look at the df it produces, it looks good, but comparing the strings is not successful. So I've skipped a subset of the test for moment

@max-sixty
Copy link
Contributor Author

@tswast @jreback this is ready to go. Original reporter in #106 confirmed the fix works for them

@@ -581,7 +581,11 @@ def load_data(self, dataframe, dataset_id, table_id, chunksize):
self._print("\rLoad is {0}% Complete".format(
((total_rows - remaining_rows) * 100) / total_rows))

body = StringIO('{}\n'.format('\n'.join(rows)))
body = '{}\n'.format('\n'.join(rows))
Copy link
Collaborator

@tswast tswast Jan 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use u'{}\n'.format(u'\n'.join(rows)) is the if statement checking for bytes necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not:


>               body = u'{}\n'.format(u'\n'.join(rows))
E               UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 77: ordinal not in range(128)

I think the nub problem is that row.to_json comes out as either bytes or str depending on the python version - so we need some branching somewhere. Unless there's a function in python that can deal with both (this all seems a bit inelegant)

(I also tried decoding the row first on 576, which made Py2 pass, but then python3 failed, because it can't decode unicode.

@tswast
Copy link
Collaborator

tswast commented Jan 17, 2018

Please add this to the changelog at https://github.com/pydata/pandas-gbq/blob/master/docs/source/changelog.rst under a new section heading 0.3.1 / (unreleased).

@max-sixty
Copy link
Contributor Author

@tswast added whatsnew

@tswast tswast merged commit 3b112bf into googleapis:master Jan 17, 2018
@max-sixty max-sixty deleted the encoding branch January 17, 2018 22:30
@tswast
Copy link
Collaborator

tswast commented Jan 18, 2018

The build failed for this after merging because the test was waiting on user credentials:

pandas_gbq/tests/test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_chinese_unicode_data Please visit this URL to authorize this application:
...
Enter the authorization code: 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

I'm not entirely sure why this is happening, since this test is in the TestToGBQIntegrationWithServiceAccountKeyPath class.

@max-sixty
Copy link
Contributor Author

Hmmm. I did add some tests to that class. I've removed them in this branch: https://github.com/maxim-lian/pandas-gbq/tree/test-fix , (+ included a fix that I didn't carry over to those in the ServiceAccount class).

Could you try running that in Travis?

@skion
Copy link

skion commented Jan 18, 2018

Confirming that this made the following (note the accented character) work again for me on Python 3.6.3...

pd.DataFrame({'my_string': ['María']}).to_gbq('xxx.yyy', 'project-id')

@tswast
Copy link
Collaborator

tswast commented Jan 18, 2018

@maxim-lian That looks like it should fix it, thanks. Build running at https://travis-ci.org/tswast/pandas-gbq/builds/330577200

Note: there are instructions at https://pandas-gbq.readthedocs.io/en/latest/contributing.html#running-google-bigquery-integration-tests for setting up your personal fork to build with integration tests on Travis.

@max-sixty
Copy link
Contributor Author

Ah great! Not sure how I missed that. That's super

@tswast
Copy link
Collaborator

tswast commented Jan 18, 2018

Ugh. Still failing with a timeout on Travis.

pandas_gbq/tests/test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_chinese_unicode_data 

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

It's not asking for an authorization code anymore, so I'm not sure how it's getting stuck.

@tswast
Copy link
Collaborator

tswast commented Jan 18, 2018

I think I figured it out.

tswast@fb6a2f6

These tests need a private key to be manually set. Building at https://travis-ci.org/tswast/pandas-gbq/builds/330598420

@max-sixty
Copy link
Contributor Author

Ooof, sorry to leave you with that. #109 would solve all our problems forevermore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants