-
Notifications
You must be signed in to change notification settings - Fork 125
to_gbq result in UnicodeEncodeError #106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you slim the dataframe down to a small subset and post it? |
Sure, the following would fail when I try to post to BigQuery using to_gbq( ) on heroku but seems to run fine on my Mac.
|
It works fine in linux for me (though oddly won't print the characters on the screen, though my mac will). |
Thanks. I've checked SO and have event tried to set the locale. I think the issue is earlier. I'm not sure why by the time the dataframe gets to http.client that it's still a str. Shouldn't it have been converted to bytes much earlier? The error is a result of http.client having to try to encode the str with latin-1 which it fails one chinese chars. |
Same issue here with the 0.3.0 version and it works on the 0.2.1 version with exactly the same data. Tried to push with the same df from a csv with encoding='UTF-8' and still the same error |
Are you on Py2 or Py3? |
@2legit that makes sense. Do you happen to know where it's attempting to convert to latin-1? |
Hi, yes in the track trace you can see it's during the _encode( ) on line 161 of client.py. I'm using python 3.6.3. python3.6/http/client.py", line 161, in _encode |
I think @northlaender is right. The Mac vs linux is a red-herring. Something is broken in pandas-gbq v 0.3.0 that was working fine in v. 0.2.1. I'm using the latter on my mac but on Heroku, it's pulling the latest pandas-gbq since I didn't pin the version. Just checked and Heroku is running v 0.3.0 which is breaking on non-latin chars. |
@maxim-lian I'm on Python 3.6.2 |
If you want to try the fix, do |
hi @maxim-lian just did a quick test with the same dataset and still no success. gbq.__version '0.1.2+66.g61bc28f' Load is 100.0% CompleteUnicodeEncodeError Traceback (most recent call last) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\pandas_gbq\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\pandas_gbq\gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\cloud\bigquery\client.py in load_table_from_file(self, file_obj, destination, rewind, size, num_retries, job_id, job_id_prefix, job_config) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\cloud\bigquery\client.py in _do_resumable_upload(self, stream, metadata, num_retries) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media\requests\upload.py in transmit_next_chunk(self, transport) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media\requests_helpers.py in http_request(transport, method, url, data, headers, retry_strategy, **transport_kwargs) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\resumable_media_helpers.py in wait_and_retry(func, get_status_code, retry_strategy) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\google\auth\transport\requests.py in request(self, method, url, data, headers, **kwargs) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\sessions.py in send(self, request, **kwargs) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\requests\adapters.py in send(self, request, stream, timeout, verify, cert, proxies) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\urllib3\connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) c:\users\northlaender\stack\python-stack\p3w\lib\site-packages\urllib3\connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in request(self, method, url, body, headers, encode_chunked) c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked) c:\users\northlaender\stack\python-stack\p3w\lib\http\client.py in _encode(data, name) UnicodeEncodeError: 'latin-1' codec can't encode character '\u2122' in position 163736: Body ('™') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8. |
Thanks for testing. Do you have a reproducible example? |
hi @maxim-lian happy to help. to_gbq breaks on the ™ |
@northlaender I've made a change and added that exact case as a test - if you get a moment it would be super if you could try again. Thank you! |
hi @maxim-lian the latest fix didn't yield any other results. |
@northlaender Thanks for checking. Though when I run the test, I can see the unicode in gbq in Python2 and the test passes fine in Python3. Would you be able to paste a stack trace? (tbc, you need to reinstall pip3 install git+https://github.com/maxim-lian/pandas-gbq.git#encoding for the new version to install) |
Hi @maxim-lian issue was clearly with updating the code via pip. I now rechecked and it works ! 👍 |
I can't get this to work :( I'm installing with pip as I don't have pip3 and can't figure out if it's really different given my root install is Python3. |
@DanielWFrancis what do you see when you run This will confirm where python is sourcing the library from: In [1]: import pandas_gbq
In [2]: pandas_gbq.__file__
Out[2]: '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas_gbq/__init__.py' |
Hi, Python 3.6 |
@smred Can you post your versions? |
@smred I worked around by opening the JSON file that I built my dataframe from with
I presume this will work for any data source you might be using. |
Hi, I'm using Heroku to run a python based ETL process where I'm pushing the contents of a Pandas dataframe into Google BQ using to_gbq. However, it's generating a UnicodeEncodeError with the following stack trace, due to some non-latin characters.
Strangely this works fine on my Mac but when I try to run it on Heroku, it's failing. It seems that for some reason, http.client.py is getting an un-encoded string rather than bytes and therefore, it's trying to encode with latin-1, which is the default but obviously would choke on anything non-latin, like Chinese chars.
2018-01-08T04:54:17.307496+00:00 app[run.2251]:
Load is 100.0% Complete044+00:00 app[run.2251]:
2018-01-08T04:54:20.443238+00:00 app[run.2251]: Traceback (most recent call last):
2018-01-08T04:54:20.443267+00:00 app[run.2251]: File "AllCostAndRev.py", line 534, in
2018-01-08T04:54:20.443708+00:00 app[run.2251]: main(yaml.dump(data=ads_dict))
2018-01-08T04:54:20.443710+00:00 app[run.2251]: File "AllCostAndRev.py", line 475, in main
2018-01-08T04:54:20.443915+00:00 app[run.2251]: private_key=environ['skynet_bq_pk']
2018-01-08T04:54:20.443917+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 989, in to_gbq
2018-01-08T04:54:20.444390+00:00 app[run.2251]: connector.load_data(dataframe, dataset_id, table_id, chunksize)
2018-01-08T04:54:20.444391+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/pandas_gbq/gbq.py", line 590, in load_data
2018-01-08T04:54:20.444653+00:00 app[run.2251]: job_config=job_config).result()
2018-01-08T04:54:20.444656+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 748, in load_table_from_file
2018-01-08T04:54:20.445248+00:00 app[run.2251]: response = upload.transmit_next_chunk(transport)
2018-01-08T04:54:20.445250+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/upload.py", line 395, in transmit_next_chunk
2018-01-08T04:54:20.444942+00:00 app[run.2251]: file_obj, job_resource, num_retries)
2018-01-08T04:54:20.445457+00:00 app[run.2251]: retry_strategy=self._retry_strategy)
2018-01-08T04:54:20.444943+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/cloud/bigquery/client.py", line 777, in _do_resumable_upload
2018-01-08T04:54:20.445458+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/requests/_helpers.py", line 101, in http_request
2018-01-08T04:54:20.445592+00:00 app[run.2251]: func, RequestsMixin._get_status_code, retry_strategy)
2018-01-08T04:54:20.445594+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/resumable_media/_helpers.py", line 146, in wait_and_retry
2018-01-08T04:54:20.445725+00:00 app[run.2251]: response = func()
2018-01-08T04:54:20.445726+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/google/auth/transport/requests.py", line 186, in request
2018-01-08T04:54:20.445866+00:00 app[run.2251]: method, url, data=data, headers=request_headers, **kwargs)
2018-01-08T04:54:20.445867+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
2018-01-08T04:54:20.446099+00:00 app[run.2251]: resp = self.send(prep, **send_kwargs)
2018-01-08T04:54:20.446101+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 618, in send
2018-01-08T04:54:20.446456+00:00 app[run.2251]: r = adapter.send(request, **kwargs)
2018-01-08T04:54:20.446457+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/requests/adapters.py", line 440, in send
2018-01-08T04:54:20.446728+00:00 app[run.2251]: timeout=timeout
2018-01-08T04:54:20.446730+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
2018-01-08T04:54:20.446969+00:00 app[run.2251]: chunked=chunked)
2018-01-08T04:54:20.446970+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 357, in _make_request
2018-01-08T04:54:20.447229+00:00 app[run.2251]: conn.request(method, url, **httplib_request_kw)
2018-01-08T04:54:20.447231+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1239, in request
2018-01-08T04:54:20.447690+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 1284, in _send_request
2018-01-08T04:54:20.448232+00:00 app[run.2251]: body = _encode(body, 'body')
2018-01-08T04:54:20.448234+00:00 app[run.2251]: File "/app/.heroku/python/lib/python3.6/http/client.py", line 161, in _encode
2018-01-08T04:54:20.448405+00:00 app[run.2251]: UnicodeEncodeError: 'latin-1' codec can't encode characters in position 553626-553628: Body ('信用卡') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.
2018-01-08T04:54:20.447689+00:00 app[run.2251]: self._send_request(method, url, body, headers, encode_chunked)
2018-01-08T04:54:20.448396+00:00 app[run.2251]: (name.title(), data[err.start:err.end], name)) from None
2018-01-08T04:54:20.621819+00:00 heroku[run.2251]: State changed from up to complete
2018-01-08T04:54:20.609814+00:00 heroku[run.2251]: Process exited with status 1
The text was updated successfully, but these errors were encountered: