Do not manually loop over all rows when encoding a dataframe as JSON #96

tswast · 2017-12-08T17:49:14Z

Currently the code loops over each row to encode it as JSON. This could be sped up by calling to_json() on the whole dataframe instead.

The text was updated successfully, but these errors were encountered:

max-sixty · 2018-02-10T01:02:32Z

FYI this is our internal function. It has some idiosyncracies (e.g. how it handles indexes, that it uses the filesystem), but works fairly well.

def write_gbq(df, dataset_id, table_name, project=None,
              credentials=None, block=False, if_exists='fail', **kwargs):
    """Write a DataFrame to a Google BigQuery table.
    Parameters
    ----------
    df : DataFrame
        DataFrame to be written
    dataset_id : str
        Dataset ID to contain the table
    table_name : str
        Name of table to be written
    project : str (default to env var GOOGLE_CLOUD_PROJECT)
        Google BigQuery Account project ID.
    credentials : GoogleCredentials
    block : boolean (optional)
        Return after completed writing into BigQuery
    if_exists : {'fail', 'replace', 'append'}, default 'fail'
        'fail': If table exists, raise.
        'replace': If table exists, drop it, recreate it, and insert data.
        'append': If table exists, insert data. Create if does not exist.
    """
    client = get_client(credentials=credentials, project=project)
    dataset = client.dataset(dataset_id)
    table = dataset.table(table_name)

    # reset index if an index exists
    df = df.reset_index()
    if 'index' in df.columns:
        df = df.drop('index', axis=1)

    file_path = _write_temp_file(df)
    file_size_mb = os.stat(file_path).st_size / 1024 ** 2
    logger.info("Writing file to BQ: {} mb".format(file_size_mb))

    config = bigquery.job.LoadJobConfig()
    config.write_disposition = if_exists_map[if_exists]
    config.source_format = 'CSV'
    config.schema = _bq_schema(df)

    with open(file_path, 'rb') as source_file:
        job = client.load_table_from_file(
            file_obj=source_file,
            destination=table,
            job_config=config)

    if block:
        wait_for_job(job)

    return job


def _write_temp_file(df, filename='df.csv'):
    path = tempfile.mkdtemp()
    file_path = os.path.join(path, filename)
    df.to_csv(file_path, index=False, header=False,
              encoding='utf-8', date_format='%Y-%m-%d %H:%M')
    return file_path

tswast · 2018-02-10T01:02:52Z

Calling to_json() on the whole dataframe gives an actual JSON object, not newline-delimited JSON. This will become much easier if/when BigQuery supports a column-oriented data format such as Parquet.

tswast · 2018-02-10T01:03:30Z

Cool, thanks @maxim-lian

tswast · 2018-02-10T01:04:31Z

Ah, yeah CSV might be the way to go for now.

max-sixty · 2018-02-10T01:07:16Z

CSV also smaller! And no nesting possible so no loss there

But if json is preferred, I think you can use orient=records

tswast · 2018-02-10T01:23:36Z

Looks like something like this should work to avoid having to write to a temporary file.

def encode_chunk(df):
    """Return a file-like object of CSV-encoded rows.

    Args:
      df (pandas.DataFrame): A chunk of a dataframe to encode
    """
    csv_buffer = six.StringIO()
    df.to_csv(
        csv_buffer, index=False, header=False, encoding='utf-8',
        date_format='%Y-%m-%d %H:%M')

    # Convert to a BytesIO buffer so that unicode text is properly handled.
    # See: https://github.com/pydata/pandas-gbq/issues/106
    body = csv_buffer.getvalue()
    if isinstance(body, bytes):
        body = body.decode('utf-8')
    body = body.encode('utf-8')
    return six.BytesIO(body)

Fixes googleapis#116 and googleapis#96 by loading data in CSV chunks.

…#117) * BUG: Fix uploading of dataframes containing int64 and float64 columns Fixes #116 and #96 by loading data in CSV chunks. * ENH: allow chunksize=None to disable chunking in to_gbq() Also, fixes lint errors. * TST: update min g-c-bq lib to 0.29.0 in CI * BUG: pass schema to load job for to_gbq * Generate schema if needed for table creation. * Restore _generate_bq_schema, as it is used in tests. * Add fixes to changelog.

tswast · 2018-02-12T19:30:44Z

Closed by #117

tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Dec 8, 2017

tswast mentioned this issue Dec 8, 2017

ENH: Convert read_gbq() function to use google-cloud-python #25

Merged

max-sixty mentioned this issue Feb 10, 2018

BUG: Wrong JSON sent to BigQuery when both integer and float fields in schema #116

Closed

tswast added a commit to tswast/python-bigquery-pandas that referenced this issue Feb 10, 2018

BUG: Fix uploading of dataframes containing int64 and float64 columns

616d306

Fixes googleapis#116 and googleapis#96 by loading data in CSV chunks.

tswast mentioned this issue Feb 10, 2018

BUG: Fix uploading of dataframes containing int64 and float64 columns #117

Merged

tswast closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not manually loop over all rows when encoding a dataframe as JSON #96

Do not manually loop over all rows when encoding a dataframe as JSON #96

tswast commented Dec 8, 2017

max-sixty commented Feb 10, 2018

tswast commented Feb 10, 2018

tswast commented Feb 10, 2018

tswast commented Feb 10, 2018

max-sixty commented Feb 10, 2018

tswast commented Feb 10, 2018

tswast commented Feb 12, 2018

Do not manually loop over all rows when encoding a dataframe as JSON #96

Do not manually loop over all rows when encoding a dataframe as JSON #96

Comments

tswast commented Dec 8, 2017

max-sixty commented Feb 10, 2018

tswast commented Feb 10, 2018

tswast commented Feb 10, 2018

tswast commented Feb 10, 2018

max-sixty commented Feb 10, 2018

tswast commented Feb 10, 2018

tswast commented Feb 12, 2018