Skip to content

fail to read INT64 values on Windows #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
blose opened this issue Feb 13, 2018 · 14 comments
Closed

fail to read INT64 values on Windows #119

blose opened this issue Feb 13, 2018 · 14 comments

Comments

@blose
Copy link
Contributor

blose commented Feb 13, 2018

On pandas-gbq-0.3.0 it fails to get INT64 values:
new_df = gbq.read_gbq( "SELECT 12345678901234567 as iii", project_id=project_id, dialect="standard" )

both on standard and legacy dialects. Querying from the real database does the same.

Requesting query... ok.
Job ID: 82d0024d-3e31-4518-a6d5-755cf96f108c
Query running...
Query done.
Processed: 0.0 B Billed: 0.0 B
Standard price: $0.00 USD

Retrieving results...
Got 1 rows.


OverflowError Traceback (most recent call last)
in ()
7 top10_active_users_df = gbq.read_gbq(
8 "SELECT 12345678901234567 as iii",
----> 9 project_id=project_id
10 )

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\gbq.py in read_gbq(query, project_id, index_col, col_order, reauth, verbose, private_key, dialect, **kwargs)
98 private_key=private_key,
99 dialect=dialect,
--> 100 **kwargs)
101
102

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas_gbq\gbq.py in read_gbq(query, project_id, index_col, col_order, reauth, verbose, private_key, auth_local_webserver, dialect, **kwargs)
875 field['mode'] != 'repeated':
876 final_df[field['name']] =
--> 877 final_df[field['name']].astype(type_map[field['type'].upper()])
878
879 connector.print_elapsed_seconds(

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
89 else:
90 kwargs[new_arg_name] = new_arg_value
---> 91 return func(*args, **kwargs)
92 return wrapper
93 return _deprecate_kwarg

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
3408 # else, only a single dtype is given
3409 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 3410 **kwargs)
3411 return self._constructor(new_data).finalize(self)
3412

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, **kwargs)
3222
3223 def astype(self, dtype, **kwargs):
-> 3224 return self.apply('astype', dtype=dtype, **kwargs)
3225
3226 def convert(self, **kwargs):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
3089
3090 kwargs['mgr'] = self
-> 3091 applied = getattr(b, f)(**kwargs)
3092 result_blocks = _extend_blocks(applied, result_blocks)
3093

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in astype(self, dtype, copy, errors, values, **kwargs)
469 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
470 return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 471 **kwargs)
472
473 def _astype(self, dtype, copy=False, errors='raise', values=None,

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, raise_on_error, **kwargs)
519
520 # _astype_nansafe works fine with 1-d only
--> 521 values = astype_nansafe(values.ravel(), dtype, copy=True)
522 values = values.reshape(self.shape)
523

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy)
623 elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
624 # work around NumPy brokenness, #1987
--> 625 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
626
627 if dtype.name in ("datetime64", "timedelta64"):

pandas_libs\lib.pyx in pandas._libs.lib.astype_intsafe()

pandas/_libs/src\util.pxd in util.set_value_at_unsafe()

OverflowError: Python int too large to convert to C long

@max-sixty
Copy link
Contributor

I'm guessing you're on a 32bit platform, in which case the integer is too big. Easiest solution is to convert to float in your query, or divide by a trillion

@blose
Copy link
Contributor Author

blose commented Feb 14, 2018

It seems that I have 64-bit anaconda installation.

import platform
platform.architecture()

('64bit', 'WindowsPE')

Some anaconda installation info

(base) C:\Users\maxim.belov>conda info

     active environment : base
    active env location : C:\Users\maxim.belov\AppData\Local\Continuum\anaconda3
            shell level : 1
       user config file : C:\Users\maxim.belov\.condarc
 populated config files : C:\Users\maxim.belov\.condarc
          conda version : 4.4.4
    conda-build version : 3.0.27
         python version : 3.6.3.final.0
       base environment : C:\Users\maxim.belov\AppData\Local\Continuum\anaconda3  (writable)
           channel URLs : https://repo.continuum.io/pkgs/main/win-64
                          https://repo.continuum.io/pkgs/main/noarch
                          https://repo.continuum.io/pkgs/free/win-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/r/win-64
                          https://repo.continuum.io/pkgs/r/noarch
                          https://repo.continuum.io/pkgs/pro/win-64
                          https://repo.continuum.io/pkgs/pro/noarch
                          https://repo.continuum.io/pkgs/msys2/win-64
                          https://repo.continuum.io/pkgs/msys2/noarch
          package cache : C:\Users\maxim.belov\AppData\Local\Continuum\anaconda3\pkgs
                          C:\Users\maxim.belov\AppData\Local\conda\conda\pkgs
       envs directories : C:\Users\maxim.belov\AppData\Local\Continuum\anaconda3\envs
                          C:\Users\maxim.belov\AppData\Local\conda\conda\envs
                          C:\Users\maxim.belov\.conda\envs
               platform : win-64
             user-agent : conda/4.4.4 requests/2.18.4 CPython/3.6.3 Windows/10 Windows/10.0.16299
          administrator : False
             netrc file : None
           offline mode : False

I have 64-bit identifiers there - float or division will not work, as it loses some digits. As for the temporary solution, I just convert them to strings inside the query in a way like:

CAST(long_long_id AS STRING)

But this seems to be a rather ill act

@max-sixty
Copy link
Contributor

What do you get if you run this?

In [1]: import sys

In [2]: sys.maxsize
Out[2]: 9223372036854775807

That said, I think this may be a python & platform limitation - can you find instances of these numbers working elsewhere from BQ? If that's right, you may get more visibility on your question on SO decoupled from BQ

@blose
Copy link
Contributor Author

blose commented Feb 14, 2018

Everything is okay with python itself, 64-bit integers are treated as they should:

import sys
print(sys.maxsize)

testint = 123456789012345678
testint *= 2
testint >>= 1
print(testint)

I got as expected

9223372036854775807
123456789012345678

can you find instances of these numbers working elsewhere from BQ?

I have no idea how to reproduce this outside of pandas-gbq. When I collect the data using bigquery web interface, export it via csv, and import it with pandas, everything works perfect.

In any case, when I get a value greater than 2^32-1 it fails with the same error:

SELECT CAST(POW(2,31)-1 as INT64) is okay
SELECT CAST(POW(2,31) as INT64) fails

it also does not depend, whether the numbers are calculated as above, or retrieved from a regular table, only the values of integers do matter.

@max-sixty
Copy link
Contributor

max-sixty commented Feb 14, 2018

Thanks, v helpful response.

The weird thing is that mine works fine:

In [15]: new_df = gbq.read_gbq( "SELECT 1234567890123456789 as iii", project_id=project_id, dialect="standard" )
Requesting query... ok.
Job ID: 04292a42-7ac0-4bdb-a06a-c14c2083c335
Query running...
Query done.
Processed: 0.0 B Billed: 0.0 B
Standard price: $0.00 USD

Retrieving results...
Got 1 rows.

Total time taken 1.61 s.
Finished at 2018-02-14 10:43:58.

In [16]: new_df
Out[16]:
                   iii
0  1234567890123456789

...so I suspect it's a platform related issue.

Can you replicate if you attempt to make a dataframe (without BQ) using a large int? The stack trace doesn't look like it's anything BQ-specific, and you'll get much more help if we can turn this into a generic pandas question

@blose
Copy link
Contributor Author

blose commented Feb 14, 2018

Hmm...

from io import StringIO
import pandas as pd
TESTDATA=StringIO("""col1
    123456789012345678
    """)
df = pd.read_csv(TESTDATA, sep=";")
print(df)
print(df.dtypes)

I get

                 col1
0  123456789012345678
col1    int64
dtype: object

Maybe, this is an issue with type mapping?

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas_gbq\gbq.py in read_gbq(query, project_id, index_col, col_order, reauth, verbose, private_key, auth_local_webserver, dialect, **kwargs)
875 field['mode'] != 'repeated':
876 final_df[field['name']] =
--> 877 final_df[field['name']].astype( type_map [field['type'].upper()])
878
879 connector.print_elapsed_seconds(

@max-sixty
Copy link
Contributor

Right - how about In [6]: df.astype(int) - I think that should do the same thing pandas_gbq is doing?

@blose
Copy link
Contributor Author

blose commented Feb 14, 2018

Yes, for some reasons, the 64bit values from BQ are treated as int32

query_df = gbq.read_gbq(
    "SELECT CAST(POW(2,31)-1 as INT64) ",
    project_id=project_id,
    dialect="standard"
)
query_df.dtypes
f0_    int32
dtype: object

Do you have any ideas how it may happen, or, alternatively, how to force a type for a column before the query?

@max-sixty
Copy link
Contributor

max-sixty commented Feb 14, 2018

OK at least progress

What did the line above return? df from your previous comment

@blose
Copy link
Contributor Author

blose commented Feb 14, 2018

from io import StringIO

import pandas as pd

TESTDATA=StringIO("""col1
    123456789012345678
    """)

df = pd.read_csv(TESTDATA, sep=";")
print(df)
print("-----------------")
print(df.dtypes)
print("-----------------")
print(df.astype(int))

I get

                 col1
0  123456789012345678
-----------------
col1    int64
dtype: object
-----------------
         col1
0 -1506741426

@max-sixty
Copy link
Contributor

OK - so it doesn't throw the same error, but it does break (i.e. it wraps). Am I reading that correctly?

@blose
Copy link
Contributor Author

blose commented Feb 14, 2018

Yes, if I use pandas without GBQ

  • int64 are treated well
  • explicit conversion to int leads to silent integer overflow

If i use pandas with GBQ

  • int64 SQL type is mapped into int32 python type for some reasons
  • it throws an error if an overflow occurs

@max-sixty
Copy link
Contributor

OK good synopsis

I would take the latter bug and push it to pandas directly (or maybe it's already out there). It sounds like something Windows-specific, given I don't get it in on Mac

I'm sorry not to be more helpful directly to your problem though. Thanks for pushing this issue this far

@tswast tswast changed the title fail to read INT64 values fail to read INT64 values on Windows Feb 23, 2018
@max-sixty
Copy link
Contributor

Closed by #121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants