Skip to content

Import CSV with NULL values #401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danielnaab opened this issue Oct 25, 2020 · 4 comments
Closed

Import CSV with NULL values #401

danielnaab opened this issue Oct 25, 2020 · 4 comments

Comments

@danielnaab
Copy link

I'm having a problem processing CSVs that contains NULL values. For example:

import tiledb
import urllib.request

urllib.request.urlretrieve('https://covid.ourworldindata.org/data/owid-covid-data.csv', 'owid-covid-data.csv')
tiledb.from_csv('owid-covid-data.tldb', 'owid-covid-data.csv')

Produces the following exception:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

tiledb/np2buf.pyx in tiledb.libtiledb.array_to_buffer()

tiledb/np2buf.pyx in tiledb.libtiledb._varlen_cell_dtype()

TypeError: Unsupported varlen cell datatype ('<class 'float'>')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-47-38bfbe6aa19e> in <module>
      3 shutil.rmtree('owid-covid-data.tldb', ignore_errors=True)
      4 
----> 5 tiledb.from_csv('owid-covid-data.tldb', 'owid-covid-data.csv')

/private/var/tmp/_bazel_dan/cebcfff554bef27d2846af867a1d4809/execroot/datafixer/bazel-out/darwin-fastbuild/bin/notebooks/launch.runfiles/pip/pypi__tiledb/tiledb/dataframe_.py in from_csv(uri, csv_file, **kwargs)
    646 
    647         kwargs.update(tiledb_args)
--> 648         from_pandas(uri, df, **kwargs)

/private/var/tmp/_bazel_dan/cebcfff554bef27d2846af867a1d4809/execroot/datafixer/bazel-out/darwin-fastbuild/bin/notebooks/launch.runfiles/pip/pypi__tiledb/tiledb/dataframe_.py in from_pandas(uri, dataframe, **kwargs)
    429 
    430                 # TODO ensure correct col/dim ordering
--> 431                 A[tuple(coords)] = write_dict
    432 
    433             else:

tiledb/libtiledb.pyx in tiledb.libtiledb.SparseArrayImpl.__setitem__()

tiledb/libtiledb.pyx in tiledb.libtiledb._write_array()

TypeError: Failed to convert buffer for attribute: 'iso_code'

Is this a naive usage of tiledb.from_csv (ie, I'm doing it wrong), or should I expect it to handle empty values like in this sample CSV?

@ihnorton
Copy link
Member

Hi @danielnaab, nullable support is under active development for the next TileDB core release. Right now,tiledb.from_csv supports a fillna parameter which will apply a fill value before writing the dataframe, see here for example:

https://github.com/TileDB-Inc/TileDB-Py/blob/dev/tiledb/tests/test_pandas_dataframe.py#L506-L509

For your dataset, I tested the following call to successfully write/read a TileDB array from the CSV file:

In [5]: tiledb.from_csv('owid-covid-data.tldb', "owid-covid-data.csv", column_types={'iso_code': np.str, 'continent': np.str, 'location': np.str, 'tests_uni
   ...: ts': np.str}, parse_dates=['date'], fillna={'iso_code': '', 'continent': '', 'location': '', 'tests_units': ''})

In [6]: df_tiledb = tiledb.open_dataframe('owid-covid-data.tldb')

In [7]: df_tiledb
Out[7]:
      iso_code continent       location       date  ...  handwashing_facilities  hospital_beds_per_thousand  life_expectancy  human_development_index
0          AFG      Asia    Afghanistan 2019-12-31  ...                  37.746                         0.5            64.83                    0.498
1          AFG      Asia    Afghanistan 2020-01-01  ...                  37.746                         0.5            64.83                    0.498
2          AFG      Asia    Afghanistan 2020-01-02  ...                  37.746                         0.5            64.83                    0.498
3          AFG      Asia    Afghanistan 2020-01-03  ...                  37.746                         0.5            64.83                    0.498
4          AFG      Asia    Afghanistan 2020-01-04  ...                  37.746                         0.5            64.83                    0.498
...        ...       ...            ...        ...  ...                     ...                         ...              ...                      ...
52437                     International 2020-10-21  ...                     NaN                         NaN              NaN                      NaN
52438                     International 2020-10-22  ...                     NaN                         NaN              NaN                      NaN
52439                     International 2020-10-23  ...                     NaN                         NaN              NaN                      NaN
52440                     International 2020-10-24  ...                     NaN                         NaN              NaN                      NaN
52441                     International 2020-10-25  ...                     NaN                         NaN              NaN                      NaN

[52442 rows x 41 columns]

Compare to reading directly with pandas:

In [8]: df_pandas = pd.read_csv("owid-covid-data.csv")

In [9]: df_pandas
Out[9]:
      iso_code continent       location        date  ...  handwashing_facilities  hospital_beds_per_thousand  life_expectancy  human_development_index
0          AFG      Asia    Afghanistan  2019-12-31  ...                  37.746                         0.5            64.83                    0.498
1          AFG      Asia    Afghanistan  2020-01-01  ...                  37.746                         0.5            64.83                    0.498
2          AFG      Asia    Afghanistan  2020-01-02  ...                  37.746                         0.5            64.83                    0.498
3          AFG      Asia    Afghanistan  2020-01-03  ...                  37.746                         0.5            64.83                    0.498
4          AFG      Asia    Afghanistan  2020-01-04  ...                  37.746                         0.5            64.83                    0.498
...        ...       ...            ...         ...  ...                     ...                         ...              ...                      ...
52437      NaN       NaN  International  2020-10-21  ...                     NaN                         NaN              NaN                      NaN
52438      NaN       NaN  International  2020-10-22  ...                     NaN                         NaN              NaN                      NaN
52439      NaN       NaN  International  2020-10-23  ...                     NaN                         NaN              NaN                      NaN
52440      NaN       NaN  International  2020-10-24  ...                     NaN                         NaN              NaN                      NaN
52441      NaN       NaN  International  2020-10-25  ...                     NaN                         NaN              NaN                      NaN

[52442 rows x 41 columns]

@danielnaab
Copy link
Author

Hi @ihnorton thanks for the clarification and example! The use case I'm evaluating TileDB for involves lots of unpredictable data, so that would be a very welcome feature addition. By next core release, do you mean the next minor or major release? ie, TileDB 2.2 or 3.x?

@ihnorton
Copy link
Member

TileDB 2.2, which we are aiming to release by mid-November (@joe-maley is pretty far along with the project and I believe we are going to start testing the branch with Java and MariaDB this week, followed by Python).

Would you be up for a call to discuss? We really like to have several driving external users for new features, to make sure we satisfy your use-case (as well as provide high-bandwidth fixes and updates if needed when it is first released). If so, please let us know at hello@tiledb.com.

@ihnorton
Copy link
Member

ihnorton commented Feb 2, 2022

It is now possible to use "string" as a dtype specifier for tiledb.from_csv, which will read the specified columns into a nullable string (represented as pandas StringDtype on read-back). However, the column type must still be specified manually (requires pandas-dev/pandas#29752).

Example (using TileDB-Py 0.12.1):

In [1]: import pandas as pd, numpy as np, tiledb

In [2]: tiledb.from_csv('owid-covid-data.tldb', 'owid-covid-data.csv', column_types={'iso_code': "string", 'continent': "string", 'location': "string", 'tes
   ...: ts_units': "string"}, parse_dates=['date'], fillna={'iso_code': '', 'continent': '', 'location': '', 'tests_units': ''})

In [3]: with tiledb.open("owid-covid-data.tldb") as A:
   ...:     df = A.df[:]
   ...:

In [4]: df.dtypes
Out[4]:
iso_code                                           string
continent                                          string
location                                           string
date                                       datetime64[ns]
total_cases                                       float64
                                                ...
human_development_index                           float64
excess_mortality_cumulative_absolute              float64
excess_mortality_cumulative                       float64
excess_mortality                                  float64
excess_mortality_cumulative_per_million           float64
Length: 67, dtype: object

(please ping if needed and happen to re-open or open another issue)

@ihnorton ihnorton closed this as completed Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants