BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k #14064

tworec · 2016-08-22T14:39:57Z

fixes:

lost precision for longs above 2^53
and floats above 10^4

Also contains test_gbq.py clean up

closes gbq.py: silently downcasting INTEGER columns to FLOAT is problematic #14020, read_gbq precision lost for floats above 10^4 #14305
tests added / passed
passes git diff upstream/master | flake8 --diff
updated docs
whatsnew entry

codecov-io · 2016-08-22T16:22:32Z

Codecov Report

Merging #14064 into master will not impact coverage by -0.01%.

@@            Coverage Diff             @@
##           master   #14064      +/-   ##
==========================================
- Coverage   86.32%   86.32%   -0.01%     
==========================================
  Files         141      141              
  Lines       51165    51170       +5     
==========================================
+ Hits        44169    44170       +1     
- Misses       6996     7000       +4

Impacted Files	Coverage Δ
pandas/io/gbq.py	`17.21% <ø> (-0.19%)`	❌
pandas/core/common.py	`91.36% <ø> (+0.33%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 153da50...788ccee. Read the comment docs.

jreback · 2016-08-25T10:15:48Z

pandas/io/gbq.py

        return float(field_value)
    elif field_type == 'TIMESTAMP':
-        timestamp = datetime.utcfromtimestamp(float(field_value))
-        return np.datetime64(timestamp)


why are you changing this?

For two reasons:

it is independent from datetime package

it is 4x faster

In [1]: import timeit timeit.timeit('import numpy as np; from datetime import datetime; np.datetime64(datetime.utcfromtimestamp(float(1234567890.123)))', number=1000000) Out[1]: 6.163272747769952 In [2]: timeit.timeit('import numpy as np; np.datetime64(int(float(1234567890.123)*1e6), "us")', number=1000000) Out[2]: 1.5235848873853683

doesn't really matter, but ok

sure, for a single value. you are much better off leaving these as floats, then coercing all in one go if you care about perf (using to_datetime)

It's not the point of this PR. I can restore previous version or leave mine - what do you prefer?

jreback · 2016-08-25T10:17:08Z

you are changing lots of things, but have minimal tests for this. pls add some.

jreback · 2016-08-25T10:18:01Z

is this meant to close #14020 ?

tworec · 2016-08-26T23:59:34Z

I didn't see #14020 before, but this PR seems to answer @aschmolck request

jreback · 2016-09-08T10:39:48Z

pandas/io/tests/test_gbq.py

@@ -51,10 +51,6 @@ def _skip_if_no_private_key_contents():
        raise nose.SkipTest("Cannot run integration tests without a "
                            "private key json contents")

-        _skip_if_no_project_id()
-        _skip_if_no_private_key_path()


pls restore

it's a junk code. @parthea already removed these lines in f92cd7e

jreback · 2016-09-08T10:41:21Z

needs more tests for datetimes

tworec · 2016-09-08T13:04:03Z

this PR is all about numeric precision loss. I'll restore my datetime improvement. It can be added as separate PR I think.

tworec · 2016-09-27T14:16:02Z

Ok, I've rebased and added full tests for this change (+ test clean up)

jorisvandenbossche · 2016-09-30T12:32:31Z

cc @parthea

tworec · 2016-10-05T09:54:34Z

@jreback, @parthea please review

jreback · 2016-10-06T10:28:42Z

doc/source/io.rst

@@ -4428,16 +4428,11 @@ DataFrame with a shape and data types derived from the source table.
 Additionally, DataFrames can be inserted into new BigQuery tables or appended
 to existing tables.

-You will need to install some additional dependencies:
-


is this not true anymore?

see Dependencies sub-chapter

then you need a pointer from the install.rst to the deps section

you need to add a pointer from here to the Dependency section of the docs (that you added below).

jreback · 2016-10-06T10:29:29Z

doc/source/io.rst

@@ -38,7 +38,7 @@ object.
    * :ref:`read_json<io.json_reader>`
    * :ref:`read_msgpack<io.msgpack>` (experimental)
    * :ref:`read_html<io.read_html>`
-    * :ref:`read_gbq<io.bigquery_reader>` (experimental)
+    * :ref:`read_gbq<io.bigquery>` (experimental)


why did you change this?

beause I think a new user should start reading about BQ support from the beginning of the section to acknowledge all prerequisites (eg. additional deps)

jreback · 2016-10-06T10:29:46Z

doc/source/io.rst

+Pandas supports these all `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
+``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
+``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
+are not supported.


are these unsupported types validated?

there is no need to validate them

read_gbq can not encounter RECORD (because query can not return RECORD column), and BYTES are stored as object (so, yes, there is some kind of support for it)

to_gbq does not generate those types within schema generation (see _generate_bq_schema)

jreback · 2016-10-06T10:29:58Z

doc/source/io.rst

+are not supported.
+
+Integer and boolean ``NA`` handling
+++++++++++++++++++++++++++++++


needs to be the same length as the heading

jreback · 2016-10-06T10:30:47Z

doc/source/io.rst

++++++++++++
+
+This module requires these additional dependencies:
+


ahh I c, ok well then make sure the pointer from the installation page goes here.

from where exactly - what file?

jreback · 2016-10-06T10:31:06Z

doc/source/io.rst


 .. _io.bigquery_authentication:

 Authentication
 ''''''''''''''

-.. versionadded:: 0.18.0


don't change things that are not related to this PR.

jreback · 2016-10-06T10:31:27Z

doc/source/io.rst

 Authentication via ``application default credentials`` is also possible. This is only valid
 if the parameter ``private_key`` is not provided. This method also requires that
 the credentials can be fetched from the environment the code is running in.
 Otherwise, the OAuth2 client-side authentication is used.
 Additional information on
 `application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__.

-.. versionadded:: 0.19.0


why are you taking things out like this? pls don't

because I don't see them in resulting doc... can this 'thing' cause something to be displayed in generated html?

jreback · 2016-10-06T10:31:55Z

doc/source/whatsnew/v0.19.0.txt

@@ -396,8 +396,9 @@ For ``MultiIndex``, values are dropped if any level is missing by default. Speci

 Google BigQuery Enhancements
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The :func:`pandas.io.gbq.read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`).
-The :func:`pandas.io.gbq.to_gbq` method now allows the DataFrame column order to differ from the destination table schema (:issue:`11359`).
+- The :func:`pandas.io.gbq.read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`).


move to 0.19.1 (only this particular change)

Or rather 0.20.0 ? As the current state of this PR changes the behaviour for certain integer columns, so I don't think it is needed to go into 0.19.1

@jreback what's your final preference here?

@jorisvandenbossche makes a good point, so 0.20.0

jreback · 2016-10-06T10:33:16Z

doc/source/io.rst

+
+This is opposite to default pandas behaviour which will promote integer
+type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
+for detailed explaination.


acutally I disagree with this approach. I would only cast to object if the values are not representable (IOW big), otherwise follow the pandas standard and promot to float. It is much more natural from a user point of view.

I were trying to get your opinion on this approach about a month ago, but there was silence which was sign to me that you have no objections.
Now when code is ready and tested I see this. That's one point.

Second point. If I understand correctly, your proposed approach will take into account three data types to store BQ INTEGER columns:

int (or int64) by default

float when there are nulls and all other values are less than 2**53

object when there are nulls and values greater than 2**53

Are you sure it is worth complicating things wich can be much simpler (vide my solution)?

@tworec we have 96 PR's open at the moment. I don't always comment till someone indicates things are ready.

can you see whether the column is nullable or not a-priori?

ok, I understand :)

no, all columns in BigQuery queries are nullable

jreback · 2016-10-06T10:34:12Z

pandas/io/tests/test_gbq.py

@@ -493,6 +497,39 @@ def test_should_read_as_service_account_with_key_contents(self):
                          private_key=_get_private_key_contents())
        tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']}))

+
+class TestReadGBQIntegrationWithServiceAccountKeyPath(tm.TestCase):


just make a new class in stead of changing existing code.

@parthea changed this class to use private key path to authorize (before it used user account) but did not changed name of the class. This rename is ment to fix this inconsistency.

jreback · 2016-10-07T10:51:12Z

I think it might be better to force the user to specify a dtype keyword to override / set behavior for particular columns (like we do in read_csv; we sort of do this with coerce_float in read_sql.

@jorisvandenbossche

tworec · 2016-10-07T10:58:43Z

I offer solution with default behavior.
Specifing dtype is cool, but it's new feature which I do not want to focus on right now. This PR is ment to fix things up, not to put new features.

Can we stop here?

tworec · 2016-10-25T12:54:16Z

@jorisvandenbossche I've corrected docs.

regarding float storage: my proposition is described in comment and implemented with tests
whats your proposition?

tworec · 2016-11-04T15:13:15Z

gentelmen @jreback @jorisvandenbossche 😃
please decide if we go in current shape or we need to handle it in other way

jreback · 2016-12-21T23:08:36Z

@tworec sorry we got a bit lost in the shuffle.

can you rebase and we can get this in.

tworec · 2016-12-29T18:13:13Z

@jreback ok, I've rebased

jreback · 2016-12-30T19:37:05Z

@parthea thoughts?

tworec · 2017-01-30T09:45:48Z

hi @jreback , can we finally merge this? Please decide if you want this change or drop it.

jreback · 2017-01-30T14:22:01Z

I was waiting for @parthea to come back on this.

tworec · 2017-01-31T14:45:21Z

oh, I c. We'll wait then.
@parthea can you please check this out? :)

parthea

I added a few minor comments. I think _get_private_key_path() should not be called in TestToGBQIntegrationWithLocalUserAccountAuth.

Separately, it would be great if you could follow the steps in the contributing docs to run the integration tests from your fork on Travis-CI. You only need to do the Travis configuration described in the contributing docs once and you're all set for future PRs.

parthea · 2017-02-01T03:56:16Z

doc/source/io.rst

+Integer and boolean ``NA`` handling
+++++++++++++++++++++++++++++++++++
+
+.. versionadded:: 0.19


Please update the .. versionadded::

parthea · 2017-02-01T04:06:34Z

pandas/io/tests/test_gbq.py

+        from math import pi
+        query = '''select * from
+                    (select PI() * POW(10, 307) as NULLABLE_DOUBLE),
+                    (select null as NULLABLE_DOUBLE)'''


SQL reserved words should be upper case for consistency with existing code. We should either use the following or change existing code if there is a preference to change the SQL formatting.

query = '''SELECT * FROM (SELECT PI() * POW(10, 307) AS Nullable_Double), (SELECT NULL AS Nullable_Double)'''

parthea · 2017-02-01T04:08:26Z

pandas/io/tests/test_gbq.py

+        from math import pi
+        query = '''select * from
+                    (select PI() as NULLABLE_FLOAT),
+                    (select null as NULLABLE_FLOAT)'''


SQL keywords should be uppercase for consistency with existing code.

Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)

parthea · 2017-02-01T04:43:16Z

pandas/io/tests/test_gbq.py

@@ -1124,7 +1228,7 @@ def setUpClass(cls):
        # executing *ALL* tests described below.

        _skip_if_no_project_id()
-        _skip_if_no_private_key_path()
+        _skip_local_auth_if_in_travis_env()

        _setup_common()
        clean_gbq_environment(_get_private_key_path())


Since we are testing local user account auth, can we change this to clean_gbq_environment()? We should also make the same change in tearDownClass.

you're right. done.

tworec

I fixed things as you @parthea suggested. Please review once again.

I don't fully trust travis CI so I need to create separete empty project for this purpose. And I'm not on my own. My company need to set up billing for it. It will take time.

tworec · 2017-02-01T10:04:32Z

pandas/io/tests/test_gbq.py

@@ -1124,7 +1228,7 @@ def setUpClass(cls):
        # executing *ALL* tests described below.

        _skip_if_no_project_id()
-        _skip_if_no_private_key_path()
+        _skip_local_auth_if_in_travis_env()

        _setup_common()
        clean_gbq_environment(_get_private_key_path())


you're right. done.

tworec · 2017-02-01T10:05:57Z

doc/source/io.rst

+Integer and boolean ``NA`` handling
+++++++++++++++++++++++++++++++++++
+
+.. versionadded:: 0.19


thx, I've missed it

tworec · 2017-02-01T10:55:04Z

pandas/io/tests/test_gbq.py

+        from math import pi
+        query = '''select * from
+                    (select PI() as NULLABLE_FLOAT),
+                    (select null as NULLABLE_FLOAT)'''


Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)

tworec · 2017-02-01T10:55:12Z

pandas/io/tests/test_gbq.py

+        from math import pi
+        query = '''select * from
+                    (select PI() * POW(10, 307) as NULLABLE_DOUBLE),
+                    (select null as NULLABLE_DOUBLE)'''


tworec · 2017-02-07T19:59:29Z

@parthea please review, because every time I look at this change it need manual rebase (conflicts).

tworec · 2017-02-08T10:53:23Z

@parthea my Travis CI setup is broken, it causes errors with pickles eg.
ERROR: pandas.io.tests.test_pickle.TestPickle.test_pickles('0.14.0',)

https://travis-ci.org/RTBHOUSE/pandas/jobs/199354963

Can you help?

parthea · 2017-02-08T12:19:38Z

@tworec I'm sorry for the delay. The latest commit looks good.

Could you try running git push --tags on your master branch to see if that resolves the issue ? Let me know if this works. I will submit a PR to update the BigQuery integration testing steps in the contributing docs to make it clear that you need to push tags to your master branch.

tworec · 2017-02-08T12:29:04Z

@parthea thx
@jreback can you merge it? :)

I've done it. Output is below. Why is this needed -- tags should not affect building right?
I guess this 141 commits is/was the problem. We'll see.

$ g co master 
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 141 commits.
  (use "git push" to publish your local commits)

$ g ps --tags
To git@github.com:RTBHOUSE/pandas.git
 * [new tag]         v0.16.2 -> v0.16.2
 * [new tag]         v0.17.0 -> v0.17.0
 * [new tag]         v0.17.0rc1 -> v0.17.0rc1
 * [new tag]         v0.17.0rc2 -> v0.17.0rc2
 * [new tag]         v0.17.1 -> v0.17.1
 * [new tag]         v0.18.0 -> v0.18.0
 * [new tag]         v0.18.0rc1 -> v0.18.0rc1
 * [new tag]         v0.18.0rc2 -> v0.18.0rc2
 * [new tag]         v0.18.1 -> v0.18.1
 * [new tag]         v0.19.0 -> v0.19.0
 * [new tag]         v0.19.0rc1 -> v0.19.0rc1
 * [new tag]         v0.19.1 -> v0.19.1
 * [new tag]         v0.19.2 -> v0.19.2

fixes: - lost precision for longs above 2^53 - and floats above 10k

tworec · 2017-02-08T12:38:44Z

i've squashed

jreback · 2017-02-08T13:09:07Z

@tworec when git clones a repo it doesn't clone the tags, so they are from your master whenever it was cloned, NOT the current one. This behavior baffles me, but that's how it works :<

tworec · 2017-02-08T13:55:19Z

ok, now my travis build works well
thx for explanation

jreback · 2017-02-09T17:12:17Z

thanks for the patience @tworec

you can also put up a PR to: https://github.com/pydata/pandas-gbq (will need to take out the documentation changes, but outherwise should be clean).

…e 10k closes pandas-dev#14020 closes pandas-dev#14305 Author: Piotr Chromiec <piotr.chromiec@rtbhouse.com> Closes pandas-dev#14064 from tworec/read_gbq_full_long_support and squashes the following commits: 788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision

tworec force-pushed the read_gbq_full_long_support branch from 820a4a6 to a6543ec Compare August 22, 2016 16:22

jreback reviewed Aug 25, 2016
View reviewed changes

jreback added the IO Google label Aug 25, 2016

jreback reviewed Sep 8, 2016
View reviewed changes

tworec force-pushed the read_gbq_full_long_support branch from a6543ec to 8e44062 Compare September 8, 2016 13:35

tworec force-pushed the read_gbq_full_long_support branch from 8e44062 to e36b30b Compare September 27, 2016 14:06

tworec force-pushed the read_gbq_full_long_support branch from e36b30b to 505792e Compare September 27, 2016 15:22

jreback reviewed Oct 6, 2016

View reviewed changes

tworec force-pushed the read_gbq_full_long_support branch from 505792e to f32edcb Compare October 6, 2016 18:32

tworec force-pushed the read_gbq_full_long_support branch from f641fad to aa248da Compare December 29, 2016 18:11

tworec force-pushed the read_gbq_full_long_support branch from aa248da to f97fcdb Compare December 30, 2016 13:26

tworec force-pushed the read_gbq_full_long_support branch from f97fcdb to 5a22fca Compare January 3, 2017 16:34

parthea reviewed Feb 1, 2017

View reviewed changes

tworec force-pushed the read_gbq_full_long_support branch from 5a22fca to 5e476d6 Compare February 1, 2017 14:00

tworec commented Feb 1, 2017

View reviewed changes

tworec force-pushed the read_gbq_full_long_support branch 3 times, most recently from 7d978d9 to 65e3327 Compare February 7, 2017 19:57

tworec force-pushed the read_gbq_full_long_support branch from 65e3327 to 50d31ce Compare February 8, 2017 12:37

BUG: fix read_gbq lost numeric precision

788ccee

fixes: - lost precision for longs above 2^53 - and floats above 10k

tworec force-pushed the read_gbq_full_long_support branch from 50d31ce to 788ccee Compare February 8, 2017 12:38

jreback closed this in c23b1a4 Feb 9, 2017

		++++++++++++

		This module requires these additional dependencies:

BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k #14064

BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k #14064

Conversation

tworec commented Aug 22, 2016 • edited Loading

codecov-io commented Aug 22, 2016 • edited Loading

Codecov Report

Choose a reason for hiding this comment

tworec Aug 26, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Aug 25, 2016

jreback commented Aug 25, 2016

tworec commented Aug 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 8, 2016

tworec commented Sep 8, 2016 • edited Loading

tworec commented Sep 27, 2016

jorisvandenbossche commented Sep 30, 2016

tworec commented Oct 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec Oct 6, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Oct 7, 2016

tworec commented Oct 7, 2016

tworec commented Oct 25, 2016

tworec commented Nov 4, 2016

jreback commented Dec 21, 2016

tworec commented Dec 29, 2016

jreback commented Dec 30, 2016

tworec commented Jan 30, 2017

jreback commented Jan 30, 2017

tworec commented Jan 31, 2017

parthea left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tworec commented Feb 7, 2017

tworec commented Aug 22, 2016 •

edited

Loading

codecov-io commented Aug 22, 2016 •

edited

Loading

tworec Aug 26, 2016 •

edited

Loading

tworec commented Sep 8, 2016 •

edited

Loading

tworec Oct 6, 2016 •

edited

Loading

tworec Oct 6, 2016 •

edited

Loading

tworec Oct 6, 2016 •

edited

Loading

tworec Oct 6, 2016 •

edited

Loading

tworec Oct 6, 2016 •

edited

Loading

tworec Oct 6, 2016 •

edited

Loading

tworec commented Feb 8, 2017 •

edited

Loading