Import CSV with NULL values #401

danielnaab · 2020-10-25T02:50:24Z

I'm having a problem processing CSVs that contains NULL values. For example:

import tiledb
import urllib.request

urllib.request.urlretrieve('https://covid.ourworldindata.org/data/owid-covid-data.csv', 'owid-covid-data.csv')
tiledb.from_csv('owid-covid-data.tldb', 'owid-covid-data.csv')

Produces the following exception:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/np2buf.pyx in tiledb.libtiledb.array_to_buffer()

tiledb/np2buf.pyx in tiledb.libtiledb._varlen_cell_dtype()

TypeError: Unsupported varlen cell datatype ('<class 'float'>')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-47-38bfbe6aa19e> in <module>
      3 shutil.rmtree('owid-covid-data.tldb', ignore_errors=True)
      4 
----> 5 tiledb.from_csv('owid-covid-data.tldb', 'owid-covid-data.csv')

/private/var/tmp/_bazel_dan/cebcfff554bef27d2846af867a1d4809/execroot/datafixer/bazel-out/darwin-fastbuild/bin/notebooks/launch.runfiles/pip/pypi__tiledb/tiledb/dataframe_.py in from_csv(uri, csv_file, **kwargs)
    646 
    647         kwargs.update(tiledb_args)
--> 648         from_pandas(uri, df, **kwargs)

/private/var/tmp/_bazel_dan/cebcfff554bef27d2846af867a1d4809/execroot/datafixer/bazel-out/darwin-fastbuild/bin/notebooks/launch.runfiles/pip/pypi__tiledb/tiledb/dataframe_.py in from_pandas(uri, dataframe, **kwargs)
    429 
    430                 # TODO ensure correct col/dim ordering
--> 431                 A[tuple(coords)] = write_dict
    432 
    433             else:

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

TypeError: Failed to convert buffer for attribute: 'iso_code'

Is this a naive usage of tiledb.from_csv (ie, I'm doing it wrong), or should I expect it to handle empty values like in this sample CSV?

The text was updated successfully, but these errors were encountered:

ihnorton · 2020-10-25T19:45:50Z

Hi @danielnaab, nullable support is under active development for the next TileDB core release. Right now,tiledb.from_csv supports a fillna parameter which will apply a fill value before writing the dataframe, see here for example:

https://github.com/TileDB-Inc/TileDB-Py/blob/dev/tiledb/tests/test_pandas_dataframe.py#L506-L509

For your dataset, I tested the following call to successfully write/read a TileDB array from the CSV file:

In [5]: tiledb.from_csv('owid-covid-data.tldb', "owid-covid-data.csv", column_types={'iso_code': np.str, 'continent': np.str, 'location': np.str, 'tests_uni
   ...: ts': np.str}, parse_dates=['date'], fillna={'iso_code': '', 'continent': '', 'location': '', 'tests_units': ''})

In [6]: df_tiledb = tiledb.open_dataframe('owid-covid-data.tldb')

In [7]: df_tiledb
Out[7]:
      iso_code continent       location       date  ...  handwashing_facilities  hospital_beds_per_thousand  life_expectancy  human_development_index
0          AFG      Asia    Afghanistan 2019-12-31  ...                  37.746                         0.5            64.83                    0.498
1          AFG      Asia    Afghanistan 2020-01-01  ...                  37.746                         0.5            64.83                    0.498
2          AFG      Asia    Afghanistan 2020-01-02  ...                  37.746                         0.5            64.83                    0.498
3          AFG      Asia    Afghanistan 2020-01-03  ...                  37.746                         0.5            64.83                    0.498
4          AFG      Asia    Afghanistan 2020-01-04  ...                  37.746                         0.5            64.83                    0.498
...        ...       ...            ...        ...  ...                     ...                         ...              ...                      ...
52437                     International 2020-10-21  ...                     NaN                         NaN              NaN                      NaN
52438                     International 2020-10-22  ...                     NaN                         NaN              NaN                      NaN
52439                     International 2020-10-23  ...                     NaN                         NaN              NaN                      NaN
52440                     International 2020-10-24  ...                     NaN                         NaN              NaN                      NaN
52441                     International 2020-10-25  ...                     NaN                         NaN              NaN                      NaN

[52442 rows x 41 columns]

Compare to reading directly with pandas:

In [8]: df_pandas = pd.read_csv("owid-covid-data.csv")

In [9]: df_pandas
Out[9]:
      iso_code continent       location        date  ...  handwashing_facilities  hospital_beds_per_thousand  life_expectancy  human_development_index
0          AFG      Asia    Afghanistan  2019-12-31  ...                  37.746                         0.5            64.83                    0.498
1          AFG      Asia    Afghanistan  2020-01-01  ...                  37.746                         0.5            64.83                    0.498
2          AFG      Asia    Afghanistan  2020-01-02  ...                  37.746                         0.5            64.83                    0.498
3          AFG      Asia    Afghanistan  2020-01-03  ...                  37.746                         0.5            64.83                    0.498
4          AFG      Asia    Afghanistan  2020-01-04  ...                  37.746                         0.5            64.83                    0.498
...        ...       ...            ...         ...  ...                     ...                         ...              ...                      ...
52437      NaN       NaN  International  2020-10-21  ...                     NaN                         NaN              NaN                      NaN
52438      NaN       NaN  International  2020-10-22  ...                     NaN                         NaN              NaN                      NaN
52439      NaN       NaN  International  2020-10-23  ...                     NaN                         NaN              NaN                      NaN
52440      NaN       NaN  International  2020-10-24  ...                     NaN                         NaN              NaN                      NaN
52441      NaN       NaN  International  2020-10-25  ...                     NaN                         NaN              NaN                      NaN

[52442 rows x 41 columns]

danielnaab · 2020-10-26T15:21:42Z

Hi @ihnorton thanks for the clarification and example! The use case I'm evaluating TileDB for involves lots of unpredictable data, so that would be a very welcome feature addition. By next core release, do you mean the next minor or major release? ie, TileDB 2.2 or 3.x?

ihnorton · 2020-10-26T15:50:04Z

TileDB 2.2, which we are aiming to release by mid-November (@joe-maley is pretty far along with the project and I believe we are going to start testing the branch with Java and MariaDB this week, followed by Python).

Would you be up for a call to discuss? We really like to have several driving external users for new features, to make sure we satisfy your use-case (as well as provide high-bandwidth fixes and updates if needed when it is first released). If so, please let us know at hello@tiledb.com.

ihnorton · 2022-02-02T14:55:40Z

It is now possible to use "string" as a dtype specifier for tiledb.from_csv, which will read the specified columns into a nullable string (represented as pandas StringDtype on read-back). However, the column type must still be specified manually (requires pandas-dev/pandas#29752).

Example (using TileDB-Py 0.12.1):

In [1]: import pandas as pd, numpy as np, tiledb

In [2]: tiledb.from_csv('owid-covid-data.tldb', 'owid-covid-data.csv', column_types={'iso_code': "string", 'continent': "string", 'location': "string", 'tes
   ...: ts_units': "string"}, parse_dates=['date'], fillna={'iso_code': '', 'continent': '', 'location': '', 'tests_units': ''})

In [3]: with tiledb.open("owid-covid-data.tldb") as A:
   ...:     df = A.df[:]
   ...:

In [4]: df.dtypes
Out[4]:
iso_code                                           string
continent                                          string
location                                           string
date                                       datetime64[ns]
total_cases                                       float64
                                                ...
human_development_index                           float64
excess_mortality_cumulative_absolute              float64
excess_mortality_cumulative                       float64
excess_mortality                                  float64
excess_mortality_cumulative_per_million           float64
Length: 67, dtype: object

(please ping if needed and happen to re-open or open another issue)

ihnorton closed this as completed Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import CSV with NULL values #401

Import CSV with NULL values #401

danielnaab commented Oct 25, 2020

ihnorton commented Oct 25, 2020

danielnaab commented Oct 26, 2020

ihnorton commented Oct 26, 2020

ihnorton commented Feb 2, 2022 •

edited

Loading

Import CSV with NULL values #401

Import CSV with NULL values #401

Comments

danielnaab commented Oct 25, 2020

ihnorton commented Oct 25, 2020

danielnaab commented Oct 26, 2020

ihnorton commented Oct 26, 2020

ihnorton commented Feb 2, 2022 • edited Loading

ihnorton commented Feb 2, 2022 •

edited

Loading