Skip to content

Openpyxl engine for reading excel files #25092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 87 commits into from
Jun 28, 2019
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
e29b4c0
prepare testing reading excel files with multiple engines
tdamsma Feb 2, 2019
e0199a8
add openpyxl tests
tdamsma Feb 2, 2019
ce4eb01
implement first version of openpyxl reader
tdamsma Feb 2, 2019
b25877e
pep8 issues
tdamsma Feb 2, 2019
821fa4d
suppress openpyxl warnings
tdamsma Feb 2, 2019
4694668
add code for all edge cases that are tested for. Unfortunately got pr…
tdamsma Feb 7, 2019
712f1ef
formatting
tdamsma Feb 7, 2019
1d49a0e
Merge commit '683c7b55f5195fdf4f524239066cbf6f1301f0e7' into openpyxl…
tdamsma Feb 7, 2019
1473c0e
improve docstring
tdamsma Feb 7, 2019
6e8ffba
also test openpyxl reader for .xlsm files
tdamsma Feb 7, 2019
d57dfc1
explicitly use 64bit floats and ints
tdamsma Feb 7, 2019
e984f6b
Merge commit '6359bbc4c9ce6dd05bc8b422641cda74871cde43' into openpyxl…
tdamsma Feb 11, 2019
44f7af2
formatting
tdamsma Feb 11, 2019
98d3865
skip TestOpenpyxlReader when openpyxl is not installed
tdamsma Feb 11, 2019
d0188ba
Attempt to generalize _XlrdReader __init__ and move it to _BaseExcelR…
tdamsma Feb 12, 2019
205d52b
Merge commit 'f4568fd76e864d8aee3d23f5a81302262d6e0dcb' into openpyxl…
tdamsma Feb 20, 2019
7b550bf
register openpyxl writer engine, fix imports
tdamsma Feb 26, 2019
875de8d
import type_error explicitly
tdamsma Feb 26, 2019
12ad6d8
Merge branch 'master' into openpyxl-reader
tdamsma Mar 11, 2019
dfd6a36
Merge branch 'master' into openpyxl-reader
tdamsma Mar 19, 2019
fef7233
Merge branch 'master' into openpyxl-reader
tdamsma Apr 20, 2019
eaafd5f
get rid of some py2 compatibility legacy
tdamsma Apr 21, 2019
8d2db02
Merge branch 'master' into openpyxl-reader
tdamsma Apr 22, 2019
13e7793
fix some type chcking
tdamsma Apr 22, 2019
b053cce
linting
tdamsma Apr 22, 2019
fe4dd73
see if this works on linux
tdamsma Apr 22, 2019
64e5f2d
run isort on _openpyxl.py
tdamsma Apr 22, 2019
99b2cad
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 23, 2019
ce5ac05
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 23, 2019
c7895ea
Merge remote-tracking branch 'pandas/master' into openpyxl-reader
tdamsma Apr 27, 2019
2ca9368
refactor handling of sheet_name keyword
tdamsma Apr 27, 2019
5fb1aef
extract code to parse a single sheet to a method
tdamsma Apr 27, 2019
537dd0c
extract handling of header keywords
tdamsma Apr 27, 2019
44cddc5
extract handling of convert_float keyword to method
tdamsma Apr 27, 2019
e4c8f23
extract handling of index_col to method
tdamsma Apr 27, 2019
daff364
extract handling of usecols keyword to method
tdamsma Apr 27, 2019
1224918
remove redundant code
tdamsma Apr 27, 2019
1bfc030
Merge remote-tracking branch 'upstream/master' into excel-read-shared…
tdamsma Apr 28, 2019
747311e
Merge branch 'master' into excel-read-shared-init-to-baseclass
tdamsma Apr 28, 2019
a77a4c7
implement suggestions @WillAyd
tdamsma Apr 29, 2019
ddcaad8
Merge remote-tracking branch 'upstream/master' into excel-read-shared…
tdamsma Apr 29, 2019
757235d
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 29, 2019
cdd627f
remove _engine keyword altogether
tdamsma Apr 29, 2019
0b58109
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 29, 2019
45f21f8
Clean up __init__
tdamsma Apr 29, 2019
e97d029
Implement work around for Linux py35_compat import error
tdamsma Apr 29, 2019
1edae5e
fix regression for reading s3 files
tdamsma Apr 30, 2019
a69e104
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 30, 2019
f5f40e4
expand code highlighting the weirdness of a failing/skipped test.
tdamsma Apr 30, 2019
22e24bb
remove _engine keyword altogether
tdamsma Apr 29, 2019
903b188
fix regression for reading s3 files
tdamsma Apr 30, 2019
1b3ae99
Merge branch 'excel-read-shared-init-to-baseclass' into openpyxl-reader
tdamsma Apr 30, 2019
02e19a8
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 30, 2019
3e18f97
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Apr 30, 2019
d11956c
remove accidental commit
tdamsma May 1, 2019
61d7a3f
ditch some code
tdamsma May 1, 2019
13d41b2
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
tdamsma Jun 10, 2019
97c85f5
remove skips for openpyxl for tests that should pass
tdamsma Jun 11, 2019
614d972
Add `by_blocks=True` to failing `assert_frame_equal` tests, as per @W…
tdamsma Jun 13, 2019
d87d9c0
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
WillAyd Jun 27, 2019
7348b0c
Updated import machinery
WillAyd Jun 27, 2019
c1a1792
Cleaned up nan replacement
WillAyd Jun 27, 2019
d72ca5a
Simplified introspection
WillAyd Jun 27, 2019
0bba345
Used common renaming method
WillAyd Jun 27, 2019
8dd8bf6
Reverted some test changes
WillAyd Jun 27, 2019
eaaa680
Reset yield statement
WillAyd Jun 27, 2019
6bf5183
Better missing label handling
WillAyd Jun 27, 2019
a06bf9b
Aligned implementation with base
WillAyd Jun 27, 2019
f43e90f
Fix bool handling
WillAyd Jun 27, 2019
8fabe0a
Fixed 0 handling
WillAyd Jun 27, 2019
0ff5ce3
Aligned float handling with xlrd
WillAyd Jun 27, 2019
fb73692
xfailed overflow test
WillAyd Jun 27, 2019
17b1d73
lint and isort fixup
WillAyd Jun 27, 2019
3d248ed
Removed by_blocks
WillAyd Jun 27, 2019
c369fd8
Revert "Reverted some test changes"
tdamsma Jun 28, 2019
70b15a4
use readonly mode. Should be more performant and also this ignores Me…
tdamsma Jun 28, 2019
a3a3bca
formatting issues
tdamsma Jun 28, 2019
fcd43f0
handle datetime cells explicitly for openpyxl < 2.5.0 compatibility
tdamsma Jun 28, 2019
d9c1fa6
type fixup
WillAyd Jun 28, 2019
3c239a4
whatsnew
WillAyd Jun 28, 2019
4a25a5a
Removed np.nan from Scalar
WillAyd Jun 28, 2019
6258e59
revert test_reader changes again. Not needed anymore because of using…
tdamsma Jun 28, 2019
00f34b1
more types and whitespace cleanup
WillAyd Jun 28, 2019
a1fba90
Added config for excel reader. Not sure how to test this
tdamsma Jun 28, 2019
88ee325
whatsnew
WillAyd Jun 28, 2019
837ce26
Merge remote-tracking branch 'upstream/master' into openpyxl-reader
WillAyd Jun 28, 2019
dddc8c5
Regenerated test1 files
WillAyd Jun 28, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion pandas/io/excel/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -400,7 +400,7 @@ def parse(self,
data = self.get_sheet_data(sheet, convert_float)
usecols = _maybe_convert_usecols(usecols)

if sheet.nrows == 0:
if not data:
output[asheetname] = DataFrame()
continue

Expand Down
266 changes: 265 additions & 1 deletion pandas/io/excel/_openpyxl.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,16 @@
from pandas.io.excel._base import ExcelWriter
from collections import OrderedDict
from io import BytesIO

import pandas.compat as compat
from pandas.core.dtypes.common import is_integer, is_list_like
from pandas.core.frame import DataFrame
from pandas.io.common import (_is_url, _urlopen, _validate_header_arg,
get_filepath_or_buffer)
from pandas.io.excel._base import (ExcelFile, ExcelWriter, _BaseExcelReader,
_fill_mi_header, _maybe_convert_to_string,
_maybe_convert_usecols, _pop_header_name)
from pandas.io.excel._util import _validate_freeze_panes
from pandas.io.parsers import _validate_usecols_arg, _validate_usecols_names


class _OpenpyxlWriter(ExcelWriter):
Expand Down Expand Up @@ -451,3 +462,256 @@ def write_cells(self, cells, sheet_name=None, startrow=0, startcol=0,
xcell = wks.cell(column=col, row=row)
for k, v in style_kwargs.items():
setattr(xcell, k, v)


class _OpenpyxlReader(_BaseExcelReader):

def __init__(self, filepath_or_buffer):
"""Reader using openpyxl engine.

Parameters
----------
filepath_or_buffer : string, path object or Workbook
Object to be parsed.
"""
err_msg = "Install xlrd >= 1.0.0 for Excel support"

try:
import openpyxl
except ImportError:
raise ImportError(err_msg)

# If filepath_or_buffer is a url, want to keep the data as bytes so
# can't pass to get_filepath_or_buffer()
if _is_url(filepath_or_buffer):
filepath_or_buffer = BytesIO(_urlopen(filepath_or_buffer).read())
elif not isinstance(filepath_or_buffer,
(ExcelFile, openpyxl.Workbook)):
filepath_or_buffer, _, _, _ = get_filepath_or_buffer(
filepath_or_buffer)

if isinstance(filepath_or_buffer, openpyxl.Workbook):
self.book = filepath_or_buffer
elif hasattr(filepath_or_buffer, "read"):
if hasattr(filepath_or_buffer, 'seek'):
filepath_or_buffer.seek(0)
self.book = openpyxl.load_workbook(
filepath_or_buffer, data_only=True)
elif isinstance(filepath_or_buffer, compat.string_types):
self.book = openpyxl.load_workbook(
filepath_or_buffer, data_only=True)
else:
raise ValueError('Must explicitly set engine if not passing in'
' buffer or path for io.')

@property
def sheet_names(self):
return self.book.sheetnames

def get_sheet_by_name(self, name):
return self.book[name]

def get_sheet_by_index(self, index):
return self.book.worksheets[index]

@staticmethod
def _replace_type_error_with_nan(rows):
nan = float('nan')
for row in rows:
yield [nan
if cell.data_type == cell.TYPE_ERROR
else cell.value
for cell in row]

def get_sheet_data(self, sheet, convert_float):
data = self._replace_type_error_with_nan(sheet.rows)
# TODO: support using iterator
# TODO: don't make strings out of data
return list(data)

def parse(self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So similar comment as above it would be preferable not to override this and just leave in the base class. I've noticed a vast majority of this is simply copy / paste.

Rather indifferent but if we go the route of cleanup in a follow up issue then for sure need to consolidate this as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _BaseExcelReader.parse function is closely coupled to the later call of the TextParser which makes no sense foe an openpyxl based reader as it already outputs structured, properly parsed data. Effectively the parse function for openpyxl does almost the reverse of the base parser. The base parser applies keywords to overcome limitations of xlrd and then converts the data do a dataframe. This parse functions first makes a dataframe, and then reverse applies the many keywords that the excel read supports to mimic the behaviour and pass all the tests. These fundamental differences in approach make it very difficult to keep the functions generic. The same applies to the init function. This function e.g. needs the specified engine to be imported, and becomes very ugly when that is made generic, see discussion above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is probably related to my earlier question about handle_sheet_name but would be preferable if this could be made generic in the base class. Otherwise subclasses have to do a lot more work - do you see any way to make that possible?

sheet_name=0,
header=0,
names=None,
index_col=None,
usecols=None,
squeeze=False,
converters=None,
dtype=None,
true_values=None,
false_values=None,
skiprows=None,
nrows=None,
na_values=None,
verbose=False,
parse_dates=False,
date_parser=None,
thousands=None,
comment=None,
skipfooter=0,
convert_float=True,
mangle_dupe_cols=True,
**kwds):

_validate_header_arg(header)

ret_dict = False

# Keep sheetname to maintain backwards compatibility.
if isinstance(sheet_name, list):
sheets = sheet_name
ret_dict = True
elif sheet_name is None:
sheets = self.sheet_names
ret_dict = True
else:
sheets = [sheet_name]

# handle same-type duplicates.
sheets = list(OrderedDict.fromkeys(sheets).keys())

output = OrderedDict()

for asheetname in sheets:
if verbose:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a function instead of inlining all of this

print("Reading sheet {sheet}".format(sheet=asheetname))

if isinstance(asheetname, compat.string_types):
sheet = self.get_sheet_by_name(asheetname)
else: # assume an integer if not a string
sheet = self.get_sheet_by_index(asheetname)

data = self.get_sheet_data(sheet, convert_float)
if not data or data == [[None]]:
output[asheetname] = DataFrame()
continue

usecols = _maybe_convert_usecols(usecols)

if is_list_like(header) and len(header) == 1:
header = header[0]

# TODO: scrutinize what is going here
# forward fill and pull out names for MultiIndex column
header_names = None
if header is not None and is_list_like(header):
header_names = []
control_row = [True] * len(data[0])

for row in header:
if is_integer(skiprows):
row += skiprows

data[row], control_row = _fill_mi_header(data[row],
control_row)

if index_col is not None:
header_name, _ = _pop_header_name(data[row], index_col)
header_names.append(header_name)

# TODO: implement whatever this should do
# has_index_names = is_list_like(header) and len(header) > 1

if skiprows:
data = [row for i, row in enumerate(data) if i not in skiprows]

if skipfooter:
data = data[:-skipfooter]

column_names = [cell for i, cell in enumerate(data.pop(0))]

frame = DataFrame(data, columns=column_names)
if usecols:
_validate_usecols_arg(usecols)
usecols = sorted(usecols)
if any(isinstance(i, str) for i in usecols):
_validate_usecols_names(usecols, column_names)
frame = frame[usecols]
else:
frame = frame.iloc[:, usecols]

if not converters:
converters = dict()
if not dtype:
dtype = dict()

# handle columns referenced by number so all references are by
# column name
handled_converters = {}
for k, v in converters.items():
if k not in frame.columns and isinstance(k, int):
k = frame.columns[k]
handled_converters[k] = v
converters = handled_converters

# attempt to convert object columns to integer. Only because this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls make helper functions for things; this function is getting too long

# is implicitly done when reading and excel file with xlrd
# TODO: question if this should be default behaviour
if len(frame) > 0:
for column in set(frame) - set(dtype.keys()):
if frame[column].dtype == object:
try:
frame[column] = frame[column].astype('int64')
except (ValueError, TypeError):
try:
frame[column] = frame[column].astype('float64')
except (ValueError, TypeError):
continue
elif (convert_float and
frame[column].dtype == float and
all(frame[column] % 1 == 0)):
frame[column] = frame[column].astype('int64')
elif not convert_float:
if frame[column].dtype == int:
frame[column] = frame[column].astype('float64')

if converters:
for k, v in converters.items():
# for compatibiliy reasons
if frame[k].dtype == float and convert_float:
frame[k] = frame[k].fillna('')
frame[k] = frame[k].apply(v)

if dtype:
for k, v in dtype.items():
frame[k] = frame[k].astype(v)

if index_col is not None:
if is_list_like(index_col):
if any(isinstance(i, str) for i in index_col):
# TODO: see if there is already a method for this in
# pandas.io.parsers
frame = frame.set_index(index_col)
if len(index_col) == 1:
# TODO: understand why this is needed
raise TypeError(
"list indices must be integers.*, not str")
else:
frame = frame.set_index(
[column_names[i] for i in index_col])
else:
if isinstance(index_col, str):
frame = frame.set_index(index_col)
else:
frame = frame.set_index(column_names[index_col])

output[asheetname] = frame
if not squeeze or isinstance(output[asheetname], DataFrame):
if header_names:
output[asheetname].columns = output[
asheetname].columns.set_names(header_names)
elif compat.PY2:
output[asheetname].columns = _maybe_convert_to_string(
output[asheetname].columns)

# name unnamed columns
unnamed = 0
for i, col_name in enumerate(frame.columns.values):
if col_name is None:
frame.columns.values[i] = "Unnamed: {n}".format(n=unnamed)
unnamed += 1

if ret_dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not always returning a dict?

return output
else:
return output[asheetname]
Loading