Skip to content

BUG: fix read_gbq lost precision for longs above 2^53 and floats above 10k #14064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tworec
Copy link
Contributor

@tworec tworec commented Aug 22, 2016

fixes:

  • lost precision for longs above 2^53
  • and floats above 10^4

Also contains test_gbq.py clean up

@tworec tworec force-pushed the read_gbq_full_long_support branch from 820a4a6 to a6543ec Compare August 22, 2016 16:22
@codecov-io
Copy link

codecov-io commented Aug 22, 2016

Codecov Report

Merging #14064 into master will not impact coverage by -0.01%.

@@            Coverage Diff             @@
##           master   #14064      +/-   ##
==========================================
- Coverage   86.32%   86.32%   -0.01%     
==========================================
  Files         141      141              
  Lines       51165    51170       +5     
==========================================
+ Hits        44169    44170       +1     
- Misses       6996     7000       +4
Impacted Files Coverage Δ
pandas/io/gbq.py 17.21% <ø> (-0.19%)
pandas/core/common.py 91.36% <ø> (+0.33%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 153da50...788ccee. Read the comment docs.

pandas/io/gbq.py Outdated
return float(field_value)
elif field_type == 'TIMESTAMP':
timestamp = datetime.utcfromtimestamp(float(field_value))
return np.datetime64(timestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this?

Copy link
Contributor Author

@tworec tworec Aug 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For two reasons:

  1. it is independent from datetime package
  2. it is 4x faster
In [1]:
import timeit
timeit.timeit('import numpy as np; from datetime import datetime; np.datetime64(datetime.utcfromtimestamp(float(1234567890.123)))', number=1000000)
Out[1]:
6.163272747769952

In [2]:
timeit.timeit('import numpy as np; np.datetime64(int(float(1234567890.123)*1e6), "us")', number=1000000)
Out[2]:
1.5235848873853683

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. doesn't really matter, but ok
  2. sure, for a single value. you are much better off leaving these as floats, then coercing all in one go if you care about perf (using to_datetime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the point of this PR. I can restore previous version or leave mine - what do you prefer?

@jreback
Copy link
Contributor

jreback commented Aug 25, 2016

you are changing lots of things, but have minimal tests for this. pls add some.

@jreback
Copy link
Contributor

jreback commented Aug 25, 2016

is this meant to close #14020 ?

@tworec
Copy link
Contributor Author

tworec commented Aug 26, 2016

I didn't see #14020 before, but this PR seems to answer @aschmolck request

@@ -51,10 +51,6 @@ def _skip_if_no_private_key_contents():
raise nose.SkipTest("Cannot run integration tests without a "
"private key json contents")

_skip_if_no_project_id()
_skip_if_no_private_key_path()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls restore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a junk code. @parthea already removed these lines in f92cd7e

@jreback
Copy link
Contributor

jreback commented Sep 8, 2016

needs more tests for datetimes

@tworec
Copy link
Contributor Author

tworec commented Sep 8, 2016

this PR is all about numeric precision loss. I'll restore my datetime improvement. It can be added as separate PR I think.

@tworec tworec force-pushed the read_gbq_full_long_support branch from a6543ec to 8e44062 Compare September 8, 2016 13:35
@tworec tworec force-pushed the read_gbq_full_long_support branch from 8e44062 to e36b30b Compare September 27, 2016 14:06
@tworec
Copy link
Contributor Author

tworec commented Sep 27, 2016

Ok, I've rebased and added full tests for this change (+ test clean up)

@tworec tworec force-pushed the read_gbq_full_long_support branch from e36b30b to 505792e Compare September 27, 2016 15:22
@jorisvandenbossche
Copy link
Member

cc @parthea

@tworec
Copy link
Contributor Author

tworec commented Oct 5, 2016

@jreback, @parthea please review

@@ -4428,16 +4428,11 @@ DataFrame with a shape and data types derived from the source table.
Additionally, DataFrames can be inserted into new BigQuery tables or appended
to existing tables.

You will need to install some additional dependencies:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not true anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see Dependencies sub-chapter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then you need a pointer from the install.rst to the deps section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add a pointer from here to the Dependency section of the docs (that you added below).

@@ -38,7 +38,7 @@ object.
* :ref:`read_json<io.json_reader>`
* :ref:`read_msgpack<io.msgpack>` (experimental)
* :ref:`read_html<io.read_html>`
* :ref:`read_gbq<io.bigquery_reader>` (experimental)
* :ref:`read_gbq<io.bigquery>` (experimental)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change this?

Copy link
Contributor Author

@tworec tworec Oct 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beause I think a new user should start reading about BQ support from the beginning of the section to acknowledge all prerequisites (eg. additional deps)

Pandas supports these all `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
are not supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these unsupported types validated?

Copy link
Contributor Author

@tworec tworec Oct 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need to validate them

  • read_gbq can not encounter RECORD (because query can not return RECORD column), and BYTES are stored as object (so, yes, there is some kind of support for it)
  • to_gbq does not generate those types within schema generation (see _generate_bq_schema)

are not supported.

Integer and boolean ``NA`` handling
+++++++++++++++++++++++++++++++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be the same length as the heading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

++++++++++++

This module requires these additional dependencies:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh I c, ok well then make sure the pointer from the installation page goes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from where exactly - what file?


.. _io.bigquery_authentication:

Authentication
''''''''''''''

.. versionadded:: 0.18.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change things that are not related to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Authentication via ``application default credentials`` is also possible. This is only valid
if the parameter ``private_key`` is not provided. This method also requires that
the credentials can be fetched from the environment the code is running in.
Otherwise, the OAuth2 client-side authentication is used.
Additional information on
`application default credentials <https://developers.google.com/identity/protocols/application-default-credentials>`__.

.. versionadded:: 0.19.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you taking things out like this? pls don't

Copy link
Contributor Author

@tworec tworec Oct 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I don't see them in resulting doc... can this 'thing' cause something to be displayed in generated html?

@@ -396,8 +396,9 @@ For ``MultiIndex``, values are dropped if any level is missing by default. Speci

Google BigQuery Enhancements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The :func:`pandas.io.gbq.read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`).
The :func:`pandas.io.gbq.to_gbq` method now allows the DataFrame column order to differ from the destination table schema (:issue:`11359`).
- The :func:`pandas.io.gbq.read_gbq` method has gained the ``dialect`` argument to allow users to specify whether to use BigQuery's legacy SQL or BigQuery's standard SQL. See the :ref:`docs <io.bigquery_reader>` for more details (:issue:`13615`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.19.1 (only this particular change)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather 0.20.0 ? As the current state of this PR changes the behaviour for certain integer columns, so I don't think it is needed to go into 0.19.1

Copy link
Contributor Author

@tworec tworec Oct 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback what's your final preference here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche makes a good point, so 0.20.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


This is opposite to default pandas behaviour which will promote integer
type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
for detailed explaination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acutally I disagree with this approach. I would only cast to object if the values are not representable (IOW big), otherwise follow the pandas standard and promot to float. It is much more natural from a user point of view.

Copy link
Contributor Author

@tworec tworec Oct 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I were trying to get your opinion on this approach about a month ago, but there was silence which was sign to me that you have no objections.
Now when code is ready and tested I see this. That's one point.

Second point. If I understand correctly, your proposed approach will take into account three data types to store BQ INTEGER columns:

  • int (or int64) by default
  • float when there are nulls and all other values are less than 2**53
  • object when there are nulls and values greater than 2**53

Are you sure it is worth complicating things wich can be much simpler (vide my solution)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tworec we have 96 PR's open at the moment. I don't always comment till someone indicates things are ready.

can you see whether the column is nullable or not a-priori?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I understand :)

no, all columns in BigQuery queries are nullable

@@ -493,6 +497,39 @@ def test_should_read_as_service_account_with_key_contents(self):
private_key=_get_private_key_contents())
tm.assert_frame_equal(df, DataFrame({'VALID_STRING': ['PI']}))


class TestReadGBQIntegrationWithServiceAccountKeyPath(tm.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make a new class in stead of changing existing code.

Copy link
Contributor Author

@tworec tworec Oct 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parthea changed this class to use private key path to authorize (before it used user account) but did not changed name of the class. This rename is ment to fix this inconsistency.

@tworec tworec force-pushed the read_gbq_full_long_support branch from 505792e to f32edcb Compare October 6, 2016 18:32
@jreback
Copy link
Contributor

jreback commented Oct 7, 2016

I think it might be better to force the user to specify a dtype keyword to override / set behavior for particular columns (like we do in read_csv; we sort of do this with coerce_float in read_sql.

@jorisvandenbossche

@tworec
Copy link
Contributor Author

tworec commented Oct 7, 2016

I offer solution with default behavior.
Specifing dtype is cool, but it's new feature which I do not want to focus on right now. This PR is ment to fix things up, not to put new features.

Can we stop here?

@tworec
Copy link
Contributor Author

tworec commented Oct 25, 2016

@jorisvandenbossche I've corrected docs.

regarding float storage: my proposition is described in comment and implemented with tests
whats your proposition?

@tworec
Copy link
Contributor Author

tworec commented Nov 4, 2016

gentelmen @jreback @jorisvandenbossche 😃
please decide if we go in current shape or we need to handle it in other way

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

@tworec sorry we got a bit lost in the shuffle.

can you rebase and we can get this in.

@tworec tworec force-pushed the read_gbq_full_long_support branch from f641fad to aa248da Compare December 29, 2016 18:11
@tworec
Copy link
Contributor Author

tworec commented Dec 29, 2016

@jreback ok, I've rebased

@tworec tworec force-pushed the read_gbq_full_long_support branch from aa248da to f97fcdb Compare December 30, 2016 13:26
@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

@parthea thoughts?

@tworec tworec force-pushed the read_gbq_full_long_support branch from f97fcdb to 5a22fca Compare January 3, 2017 16:34
@tworec
Copy link
Contributor Author

tworec commented Jan 30, 2017

hi @jreback , can we finally merge this? Please decide if you want this change or drop it.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2017

I was waiting for @parthea to come back on this.

@tworec
Copy link
Contributor Author

tworec commented Jan 31, 2017

oh, I c. We'll wait then.
@parthea can you please check this out? :)

Copy link
Contributor

@parthea parthea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few minor comments. I think _get_private_key_path() should not be called in TestToGBQIntegrationWithLocalUserAccountAuth.

Separately, it would be great if you could follow the steps in the contributing docs to run the integration tests from your fork on Travis-CI. You only need to do the Travis configuration described in the contributing docs once and you're all set for future PRs.

Integer and boolean ``NA`` handling
+++++++++++++++++++++++++++++++++++

.. versionadded:: 0.19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the .. versionadded::

from math import pi
query = '''select * from
(select PI() * POW(10, 307) as NULLABLE_DOUBLE),
(select null as NULLABLE_DOUBLE)'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL reserved words should be upper case for consistency with existing code. We should either use the following or change existing code if there is a preference to change the SQL formatting.

        query = '''SELECT * FROM
                    (SELECT PI() * POW(10, 307) AS Nullable_Double),
                    (SELECT NULL AS Nullable_Double)'''

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from math import pi
query = '''select * from
(select PI() as NULLABLE_FLOAT),
(select null as NULLABLE_FLOAT)'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL keywords should be uppercase for consistency with existing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)

@@ -1124,7 +1228,7 @@ def setUpClass(cls):
# executing *ALL* tests described below.

_skip_if_no_project_id()
_skip_if_no_private_key_path()
_skip_local_auth_if_in_travis_env()

_setup_common()
clean_gbq_environment(_get_private_key_path())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are testing local user account auth, can we change this to clean_gbq_environment()? We should also make the same change in tearDownClass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. done.

@tworec tworec force-pushed the read_gbq_full_long_support branch from 5a22fca to 5e476d6 Compare February 1, 2017 14:00
Copy link
Contributor Author

@tworec tworec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed things as you @parthea suggested. Please review once again.

I don't fully trust travis CI so I need to create separete empty project for this purpose. And I'm not on my own. My company need to set up billing for it. It will take time.

@@ -1124,7 +1228,7 @@ def setUpClass(cls):
# executing *ALL* tests described below.

_skip_if_no_project_id()
_skip_if_no_private_key_path()
_skip_local_auth_if_in_travis_env()

_setup_common()
clean_gbq_environment(_get_private_key_path())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. done.

Integer and boolean ``NA`` handling
+++++++++++++++++++++++++++++++++++

.. versionadded:: 0.19
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, I've missed it

from math import pi
query = '''select * from
(select PI() as NULLABLE_FLOAT),
(select null as NULLABLE_FLOAT)'''
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, much of old code was not consistent in that sense. But I've tried to fix all inconsistencies. Please check me :)

from math import pi
query = '''select * from
(select PI() * POW(10, 307) as NULLABLE_DOUBLE),
(select null as NULLABLE_DOUBLE)'''
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tworec tworec force-pushed the read_gbq_full_long_support branch 3 times, most recently from 7d978d9 to 65e3327 Compare February 7, 2017 19:57
@tworec
Copy link
Contributor Author

tworec commented Feb 7, 2017

@parthea please review, because every time I look at this change it need manual rebase (conflicts).

@tworec
Copy link
Contributor Author

tworec commented Feb 8, 2017

@parthea my Travis CI setup is broken, it causes errors with pickles eg.
ERROR: pandas.io.tests.test_pickle.TestPickle.test_pickles('0.14.0',)

https://travis-ci.org/RTBHOUSE/pandas/jobs/199354963

Can you help?

@parthea
Copy link
Contributor

parthea commented Feb 8, 2017

@tworec I'm sorry for the delay. The latest commit looks good.

Could you try running git push --tags on your master branch to see if that resolves the issue ? Let me know if this works. I will submit a PR to update the BigQuery integration testing steps in the contributing docs to make it clear that you need to push tags to your master branch.

@tworec
Copy link
Contributor Author

tworec commented Feb 8, 2017

@parthea thx
@jreback can you merge it? :)

I've done it. Output is below. Why is this needed -- tags should not affect building right?
I guess this 141 commits is/was the problem. We'll see.

$ g co master 
Switched to branch 'master'
Your branch is ahead of 'origin/master' by 141 commits.
  (use "git push" to publish your local commits)

$ g ps --tags
To git@github.com:RTBHOUSE/pandas.git
 * [new tag]         v0.16.2 -> v0.16.2
 * [new tag]         v0.17.0 -> v0.17.0
 * [new tag]         v0.17.0rc1 -> v0.17.0rc1
 * [new tag]         v0.17.0rc2 -> v0.17.0rc2
 * [new tag]         v0.17.1 -> v0.17.1
 * [new tag]         v0.18.0 -> v0.18.0
 * [new tag]         v0.18.0rc1 -> v0.18.0rc1
 * [new tag]         v0.18.0rc2 -> v0.18.0rc2
 * [new tag]         v0.18.1 -> v0.18.1
 * [new tag]         v0.19.0 -> v0.19.0
 * [new tag]         v0.19.0rc1 -> v0.19.0rc1
 * [new tag]         v0.19.1 -> v0.19.1
 * [new tag]         v0.19.2 -> v0.19.2

@tworec tworec force-pushed the read_gbq_full_long_support branch from 65e3327 to 50d31ce Compare February 8, 2017 12:37
fixes:
- lost precision for longs above 2^53
- and floats above 10k
@tworec tworec force-pushed the read_gbq_full_long_support branch from 50d31ce to 788ccee Compare February 8, 2017 12:38
@tworec
Copy link
Contributor Author

tworec commented Feb 8, 2017

i've squashed

@jreback
Copy link
Contributor

jreback commented Feb 8, 2017

@tworec when git clones a repo it doesn't clone the tags, so they are from your master whenever it was cloned, NOT the current one. This behavior baffles me, but that's how it works :<

@tworec
Copy link
Contributor Author

tworec commented Feb 8, 2017

ok, now my travis build works well
thx for explanation

@jreback jreback closed this in c23b1a4 Feb 9, 2017
@jreback
Copy link
Contributor

jreback commented Feb 9, 2017

thanks for the patience @tworec

you can also put up a PR to: https://github.com/pydata/pandas-gbq (will need to take out the documentation changes, but outherwise should be clean).

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
…e 10k

closes pandas-dev#14020
closes pandas-dev#14305

Author: Piotr Chromiec <piotr.chromiec@rtbhouse.com>

Closes pandas-dev#14064 from tworec/read_gbq_full_long_support and squashes the following commits:

788ccee [Piotr Chromiec] BUG: fix read_gbq lost numeric precision
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gbq.py: silently downcasting INTEGER columns to FLOAT is problematic
5 participants