10.5 Excel files
The read_excel()
method can read Excel 2003 (.xls
) and
Excel 2007+ (.xlsx
) files using the xlrd
Python
module. The to_excel()
instance method is used for
saving a DataFrame
to Excel. Generally the semantics are
similar to working with csv data. See the cookbook for some
advanced strategies
10.5.1 Reading Excel Files
In the most basic use-case, read_excel
takes a path to an Excel
file, and the sheetname
indicating which sheet to parse.
# Returns a DataFrame
read_excel('path_to_file.xls', sheetname='Sheet1')
10.5.1.1 ExcelFile
class
To facilitate working with multiple sheets from the same file, the ExcelFile
class can be used to wrap the file and can be be passed into read_excel
There will be a performance benefit for reading multiple sheets as the file is
read into memory only once.
xlsx = pd.ExcelFile('path_to_file.xls)
df = pd.read_excel(xlsx, 'Sheet1')
The ExcelFile
class can also be used as a context manager.
with pd.ExcelFile('path_to_file.xls') as xls:
df1 = pd.read_excel(xls, 'Sheet1')
df2 = pd.read_excel(xls, 'Sheet2')
The sheet_names
property will generate
a list of the sheet names in the file.
The primary use-case for an ExcelFile
is parsing multiple sheets with
different parameters
data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None, na_values=['NA'])
data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1)
Note that if the same parsing parameters are used for all sheets, a list
of sheet names can simply be passed to read_excel
with no loss in performance.
# using the ExcelFile class
data = {}
with pd.ExcelFile('path_to_file.xls') as xls:
data['Sheet1'] = read_excel(xls, 'Sheet1', index_col=None, na_values=['NA'])
data['Sheet2'] = read_excel(xls, 'Sheet2', index_col=None, na_values=['NA'])
# equivalent using the read_excel function
data = read_excel('path_to_file.xls', ['Sheet1', 'Sheet2'], index_col=None, na_values=['NA'])
New in version 0.12.
ExcelFile
has been moved to the top level namespace.
New in version 0.17.
read_excel
can take an ExcelFile
object as input
10.5.1.2 Specifying Sheets
Note
The second argument is sheetname
, not to be confused with ExcelFile.sheet_names
Note
An ExcelFile’s attribute sheet_names
provides access to a list of sheets.
- The arguments
sheetname
allows specifying the sheet or sheets to read. - The default value for
sheetname
is 0, indicating to read the first sheet - Pass a string to refer to the name of a particular sheet in the workbook.
- Pass an integer to refer to the index of a sheet. Indices follow Python convention, beginning at 0.
- Pass a list of either strings or integers, to return a dictionary of specified sheets.
- Pass a
None
to return a dictionary of all available sheets.
# Returns a DataFrame
read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])
Using the sheet index:
# Returns a DataFrame
read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA'])
Using all default values:
# Returns a DataFrame
read_excel('path_to_file.xls')
Using None to get all sheets:
# Returns a dictionary of DataFrames
read_excel('path_to_file.xls',sheetname=None)
Using a list to get multiple sheets:
# Returns the 1st and 4th sheet, as a dictionary of DataFrames.
read_excel('path_to_file.xls',sheetname=['Sheet1',3])
New in version 0.16.
read_excel
can read more than one sheet, by setting sheetname
to either
a list of sheet names, a list of sheet positions, or None
to read all sheets.
New in version 0.13.
Sheets can be specified by sheet index or sheet name, using an integer or string, respectively.
10.5.1.3 Reading a MultiIndex
New in version 0.17.
read_excel
can read a MultiIndex
index, by passing a list of columns to index_col
and a MultiIndex
column by passing a list of rows to header
. If either the index
or columns
have serialized level names those will be read in as well by specifying
the rows/columns that make up the levels.
For example, to read in a MultiIndex
index without names:
In [1]: df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]},
...: index=pd.MultiIndex.from_product([['a','b'],['c','d']]))
...:
In [2]: df.to_excel('path_to_file.xlsx')
In [3]: df = pd.read_excel('path_to_file.xlsx', index_col=[0,1])
In [4]: df
Out[4]:
a b
a c 1 5
d 2 6
b c 3 7
d 4 8
If the index has level names, they will parsed as well, using the same parameters.
In [5]: df.index = df.index.set_names(['lvl1', 'lvl2'])
In [6]: df.to_excel('path_to_file.xlsx')
In [7]: df = pd.read_excel('path_to_file.xlsx', index_col=[0,1])
In [8]: df
Out[8]:
a b
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
If the source file has both MultiIndex
index and columns, lists specifying each
should be passed to index_col
and header
In [9]: df.columns = pd.MultiIndex.from_product([['a'],['b', 'd']], names=['c1', 'c2'])
In [10]: df.to_excel('path_to_file.xlsx')
In [11]: df = pd.read_excel('path_to_file.xlsx',
....: index_col=[0,1], header=[0,1])
....:
In [12]: df
Out[12]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
Warning
Excel files saved in version 0.16.2 or prior that had index names will still able to be read in,
but the has_index_names
argument must specified to True
.
10.5.1.4 Parsing Specific Columns
It is often the case that users will insert columns to do temporary computations in Excel and you may not want to read in those columns. read_excel takes a parse_cols keyword to allow you to specify a subset of columns to parse.
If parse_cols is an integer, then it is assumed to indicate the last column to be parsed.
read_excel('path_to_file.xls', 'Sheet1', parse_cols=2)
If parse_cols is a list of integers, then it is assumed to be the file column indices to be parsed.
read_excel('path_to_file.xls', 'Sheet1', parse_cols=[0, 2, 3])
10.5.1.5 Cell Converters
It is possible to transform the contents of Excel cells via the converters option. For instance, to convert a column to boolean:
read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool})
This options handles missing values and treats exceptions in the converters as missing data. Transformations are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For instance, a column of integers with missing values cannot be transformed to an array with integer dtype, because NaN is strictly a float. You can manually mask missing data to recover integer dtype:
cfun = lambda x: int(x) if x else -1
read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun})
10.5.2 Writing Excel Files
10.5.2.1 Writing Excel Files to Disk
To write a DataFrame object to a sheet of an Excel file, you can use the
to_excel
instance method. The arguments are largely the same as to_csv
described above, the first argument being the name of the excel file, and the
optional second argument the name of the sheet to which the DataFrame should be
written. For example:
df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')
Files with a .xls
extension will be written using xlwt
and those with a
.xlsx
extension will be written using xlsxwriter
(if available) or
openpyxl
.
The DataFrame will be written in a way that tries to mimic the REPL output. One
difference from 0.12.0 is that the index_label
will be placed in the second
row instead of the first. You can get the previous behaviour by setting the
merge_cells
option in to_excel()
to False
:
df.to_excel('path_to_file.xlsx', index_label='label', merge_cells=False)
The Panel class also has a to_excel
instance method,
which writes each DataFrame in the Panel to a separate sheet.
In order to write separate DataFrames to separate sheets in a single Excel file,
one can pass an ExcelWriter
.
with ExcelWriter('path_to_file.xlsx') as writer:
df1.to_excel(writer, sheet_name='Sheet1')
df2.to_excel(writer, sheet_name='Sheet2')
Note
Wringing a little more performance out of read_excel
Internally, Excel stores all numeric data as floats. Because this can
produce unexpected behavior when reading in data, pandas defaults to trying
to convert integers to floats if it doesn’t lose information (1.0 -->
1
). You can pass convert_float=False
to disable this behavior, which
may give a slight performance improvement.
10.5.2.2 Writing Excel Files to Memory
New in version 0.17.
Pandas supports writing Excel files to buffer-like objects such as StringIO
or
BytesIO
using ExcelWriter
.
New in version 0.17.
Added support for Openpyxl >= 2.2
# Safe import for either Python 2.x or 3.x
try:
from io import BytesIO
except ImportError:
from cStringIO import StringIO as BytesIO
bio = BytesIO()
# By setting the 'engine' in the ExcelWriter constructor.
writer = ExcelWriter(bio, engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
# Save the workbook
writer.save()
# Seek to the beginning and read to copy the workbook to a variable in memory
bio.seek(0)
workbook = bio.read()
Note
engine
is optional but recommended. Setting the engine determines
the version of workbook produced. Setting engine='xlrd'
will produce an
Excel 2003-format workbook (xls). Using either 'openpyxl'
or
'xlsxwriter'
will produce an Excel 2007-format workbook (xlsx). If
omitted, an Excel 2007-formatted workbook is produced.
10.5.3 Excel writer engines
New in version 0.13.
pandas
chooses an Excel writer via two methods:
- the
engine
keyword argument - the filename extension (via the default specified in config options)
By default, pandas
uses the XlsxWriter for .xlsx
and openpyxl
for .xlsm
files and xlwt for .xls
files. If you have multiple
engines installed, you can set the default engine through setting the
config options io.excel.xlsx.writer
and
io.excel.xls.writer
. pandas will fall back on openpyxl for .xlsx
files if Xlsxwriter is not available.
To specify which writer you want to use, you can pass an engine keyword
argument to to_excel
and to ExcelWriter
. The built-in engines are:
openpyxl
: This includes stable support for Openpyxl from 1.6.1. However, it is advised to use version 2.2 and higher, especially when working with styles.xlsxwriter
xlwt
# By setting the 'engine' in the DataFrame and Panel 'to_excel()' methods.
df.to_excel('path_to_file.xlsx', sheet_name='Sheet1', engine='xlsxwriter')
# By setting the 'engine' in the ExcelWriter constructor.
writer = ExcelWriter('path_to_file.xlsx', engine='xlsxwriter')
# Or via pandas configuration.
from pandas import options
options.io.excel.xlsx.writer = 'xlsxwriter'
df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')