-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k #14064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
820a4a6
to
a6543ec
Compare
Codecov Report
@@ Coverage Diff @@
## master #14064 +/- ##
==========================================
- Coverage 86.32% 86.32% -0.01%
==========================================
Files 141 141
Lines 51165 51170 +5
==========================================
+ Hits 44169 44170 +1
- Misses 6996 7000 +4
Continue to review full report at Codecov.
|
pandas/io/gbq.py
Outdated
return float(field_value) | ||
elif field_type == 'TIMESTAMP': | ||
timestamp = datetime.utcfromtimestamp(float(field_value)) | ||
return np.datetime64(timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For two reasons:
- it is independent from datetime package
- it is 4x faster
In [1]:
import timeit
timeit.timeit('import numpy as np; from datetime import datetime; np.datetime64(datetime.utcfromtimestamp(float(1234567890.123)))', number=1000000)
Out[1]:
6.163272747769952
In [2]:
timeit.timeit('import numpy as np; np.datetime64(int(float(1234567890.123)*1e6), "us")', number=1000000)
Out[2]:
1.5235848873853683
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- doesn't really matter, but ok
- sure, for a single value. you are much better off leaving these as floats, then coercing all in one go if you care about perf (using
to_datetime
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the point of this PR. I can restore previous version or leave mine - what do you prefer?
you are changing lots of things, but have minimal tests for this. pls add some. |
is this meant to close #14020 ? |
I didn't see #14020 before, but this PR seems to answer @aschmolck request |
pandas/io/tests/test_gbq.py
Outdated
@@ -51,10 +51,6 @@ def _skip_if_no_private_key_contents(): | |||
raise nose.SkipTest("Cannot run integration tests without a " | |||
"private key json contents") | |||
|
|||
_skip_if_no_project_id() | |||
_skip_if_no_private_key_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls restore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs more tests for datetimes |
this PR is all about numeric precision loss. I'll restore my datetime improvement. It can be added as separate PR I think. |
a6543ec
to
8e44062
Compare
8e44062
to
e36b30b
Compare
Ok, I've rebased and added full tests for this change (+ test clean up) |
e36b30b
to
505792e
Compare
cc @parthea |
@@ -4428,16 +4428,11 @@ DataFrame with a shape and data types derived from the source table. | |||
Additionally, DataFrames can be inserted into new BigQuery tables or appended | |||
to existing tables. | |||
|
|||
You will need to install some additional dependencies: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not true anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see Dependencies
sub-chapter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you need a pointer from the install.rst to the deps section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add a pointer from here to the Dependency section of the docs (that you added below).
doc/source/io.rst
Outdated
@@ -38,7 +38,7 @@ object. | |||
* :ref:`read_json<io.json_reader>` | |||
* :ref:`read_msgpack<io.msgpack>` (experimental) | |||
* :ref:`read_html<io.read_html>` | |||
* :ref:`read_gbq<io.bigquery_reader>` (experimental) | |||
* :ref:`read_gbq<io.bigquery>` (experimental) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beause I think a new user should start reading about BQ support from the beginning of the section to acknowledge all prerequisites (eg. additional deps)
doc/source/io.rst
Outdated
Pandas supports these all `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__: | ||
``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and | ||
``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD`` | ||
are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these unsupported types validated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no need to validate them
read_gbq
can not encounterRECORD
(because query can not return RECORD column), andBYTES
are stored asobject
(so, yes, there is some kind of support for it)to_gbq
does not generate those types within schema generation (see_generate_bq_schema
)
doc/source/io.rst
Outdated
are not supported. | ||
|
||
Integer and boolean ``NA`` handling | ||
+++++++++++++++++++++++++++++++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be the same length as the heading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
doc/source/io.rst
Outdated
++++++++++++ | ||
|
||
This module requires these additional dependencies: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh I c, ok well then make sure the pointer from the installation page goes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from where exactly - what file?
doc/source/io.rst
Outdated
|
||
.. _io.bigquery_authentication: | ||
|
||
Authentication | ||
'''''''''''''' | ||
|
||
.. versionadded:: 0.18.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change things that are not related to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
doc/source/io.rst
Outdated
Authentication via ``application default credentials`` is also possible. This is only valid | ||
if the parameter ``private_key`` is not provided. This method also requires that | ||
the credentials can be fetched from the environment the code is running in. | ||
Otherwise, the OAuth2 client-side authentication is used. | ||
Additional information on | ||
`application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__. | ||
|
||
.. versionadded:: 0.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you taking things out like this? pls don't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because I don't see them in resulting doc... can this 'thing' cause something to be displayed in generated html?
doc/source/whatsnew/v0.19.0.txt
Outdated
@@ -396,8 +396,9 @@ For ``MultiIndex``, values are dropped if any level is missing by default. Speci | |||
|
|||
Google BigQuery Enhancements | |||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
The :func:`pandas.io.gbq.read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`). | |||
The :func:`pandas.io.gbq.to_gbq` method now allows the DataFrame column order to differ from the destination table schema (:issue:`11359`). | |||
- The :func:`pandas.io.gbq.read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.19.1 (only this particular change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather 0.20.0 ? As the current state of this PR changes the behaviour for certain integer columns, so I don't think it is needed to go into 0.19.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback what's your final preference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche makes a good point, so 0.20.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
This is opposite to default pandas behaviour which will promote integer | ||
type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>` | ||
for detailed explaination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acutally I disagree with this approach. I would only cast to object
if the values are not representable (IOW big), otherwise follow the pandas standard and promot to float
. It is much more natural from a user point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I were trying to get your opinion on this approach about a month ago, but there was silence which was sign to me that you have no objections.
Now when code is ready and tested I see this. That's one point.
Second point. If I understand correctly, your proposed approach will take into account three data types to store BQ INTEGER
columns:
int
(orint64
) by defaultfloat
when there arenull
s and all other values are less than2**53
object
when there arenull
s and values greater than2**53
Are you sure it is worth complicating things wich can be much simpler (vide my solution)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tworec we have 96 PR's open at the moment. I don't always comment till someone indicates things are ready.
can you see whether the column is nullable or not a-priori?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I understand :)
no, all columns in BigQuery queries are nullable
pandas/io/tests/test_gbq.py
Outdated
@@ -493,6 +497,39 @@ def test_should_read_as_service_account_with_key_contents(self): | |||
private_key=_get_private_key_contents()) | |||
tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']})) | |||
|
|||
|
|||
class TestReadGBQIntegrationWithServiceAccountKeyPath(tm.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just make a new class in stead of changing existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parthea changed this class to use private key path to authorize (before it used user account) but did not changed name of the class. This rename is ment to fix this inconsistency.
505792e
to
f32edcb
Compare
I think it might be better to force the user to specify a |
I offer solution with default behavior. Can we stop here? |
@jorisvandenbossche I've corrected docs. regarding float storage: my proposition is described in comment and implemented with tests |
gentelmen @jreback @jorisvandenbossche 😃 |
@tworec sorry we got a bit lost in the shuffle. can you rebase and we can get this in. |
f641fad
to
aa248da
Compare
@jreback ok, I've rebased |
aa248da
to
f97fcdb
Compare
@parthea thoughts? |
f97fcdb
to
5a22fca
Compare
hi @jreback , can we finally merge this? Please decide if you want this change or drop it. |
I was waiting for @parthea to come back on this. |
oh, I c. We'll wait then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few minor comments. I think _get_private_key_path()
should not be called in TestToGBQIntegrationWithLocalUserAccountAuth
.
Separately, it would be great if you could follow the steps in the contributing docs to run the integration tests from your fork on Travis-CI. You only need to do the Travis configuration described in the contributing docs once and you're all set for future PRs.
doc/source/io.rst
Outdated
Integer and boolean ``NA`` handling | ||
+++++++++++++++++++++++++++++++++++ | ||
|
||
.. versionadded:: 0.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the .. versionadded::
pandas/io/tests/test_gbq.py
Outdated
from math import pi | ||
query = '''select * from | ||
(select PI() * POW(10, 307) as NULLABLE_DOUBLE), | ||
(select null as NULLABLE_DOUBLE)''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL reserved words should be upper case for consistency with existing code. We should either use the following or change existing code if there is a preference to change the SQL formatting.
query = '''SELECT * FROM
(SELECT PI() * POW(10, 307) AS Nullable_Double),
(SELECT NULL AS Nullable_Double)'''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/io/tests/test_gbq.py
Outdated
from math import pi | ||
query = '''select * from | ||
(select PI() as NULLABLE_FLOAT), | ||
(select null as NULLABLE_FLOAT)''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL keywords should be uppercase for consistency with existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)
pandas/io/tests/test_gbq.py
Outdated
@@ -1124,7 +1228,7 @@ def setUpClass(cls): | |||
# executing *ALL* tests described below. | |||
|
|||
_skip_if_no_project_id() | |||
_skip_if_no_private_key_path() | |||
_skip_local_auth_if_in_travis_env() | |||
|
|||
_setup_common() | |||
clean_gbq_environment(_get_private_key_path()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are testing local user account auth, can we change this to clean_gbq_environment()
? We should also make the same change in tearDownClass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. done.
5a22fca
to
5e476d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed things as you @parthea suggested. Please review once again.
I don't fully trust travis CI so I need to create separete empty project for this purpose. And I'm not on my own. My company need to set up billing for it. It will take time.
pandas/io/tests/test_gbq.py
Outdated
@@ -1124,7 +1228,7 @@ def setUpClass(cls): | |||
# executing *ALL* tests described below. | |||
|
|||
_skip_if_no_project_id() | |||
_skip_if_no_private_key_path() | |||
_skip_local_auth_if_in_travis_env() | |||
|
|||
_setup_common() | |||
clean_gbq_environment(_get_private_key_path()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. done.
doc/source/io.rst
Outdated
Integer and boolean ``NA`` handling | ||
+++++++++++++++++++++++++++++++++++ | ||
|
||
.. versionadded:: 0.19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, I've missed it
pandas/io/tests/test_gbq.py
Outdated
from math import pi | ||
query = '''select * from | ||
(select PI() as NULLABLE_FLOAT), | ||
(select null as NULLABLE_FLOAT)''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)
pandas/io/tests/test_gbq.py
Outdated
from math import pi | ||
query = '''select * from | ||
(select PI() * POW(10, 307) as NULLABLE_DOUBLE), | ||
(select null as NULLABLE_DOUBLE)''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
7d978d9
to
65e3327
Compare
@parthea please review, because every time I look at this change it need manual rebase (conflicts). |
@parthea my Travis CI setup is broken, it causes errors with pickles eg. https://travis-ci.org/RTBHOUSE/pandas/jobs/199354963 Can you help? |
@tworec I'm sorry for the delay. The latest commit looks good. Could you try running |
@parthea thx I've done it. Output is below. Why is this needed -- tags should not affect building right?
|
65e3327
to
50d31ce
Compare
fixes: - lost precision for longs above 2^53 - and floats above 10k
50d31ce
to
788ccee
Compare
i've squashed |
@tworec when git clones a repo it doesn't clone the tags, so they are from your master whenever it was cloned, NOT the current one. This behavior baffles me, but that's how it works :< |
ok, now my travis build works well |
thanks for the patience @tworec you can also put up a PR to: https://github.com/pydata/pandas-gbq (will need to take out the documentation changes, but outherwise should be clean). |
…e 10k closes pandas-dev#14020 closes pandas-dev#14305 Author: Piotr Chromiec <piotr.chromiec@rtbhouse.com> Closes pandas-dev#14064 from tworec/read_gbq_full_long_support and squashes the following commits: 788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision
fixes:
Also contains
test_gbq.py
clean upgit diff upstream/master | flake8 --diff