.. currentmodule:: pandas .. ipython:: python :suppress: import os import csv from pandas.compat import StringIO, BytesIO import pandas as pd ExcelWriter = pd.ExcelWriter import sys reload(sys) # Reload does the trick! sys.setdefaultencoding('UTF8') import numpy as np np.random.seed(123456) randn = np.random.randn np.set_printoptions(precision=4, suppress=True) import matplotlib.pyplot as plt plt.close('all') import pandas.util.testing as tm pd.options.display.max_rows=15 clipdf = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':['p','q','r']}, index=['x','y','z']) .. _io.excel: Excel files ----------- The :func:`~pandas.read_excel` method can read Excel 2003 (``.xls``) and Excel 2007+ (``.xlsx``) files using the ``xlrd`` Python module. The :meth:`~DataFrame.to_excel` instance method is used for saving a ``DataFrame`` to Excel. Generally the semantics are similar to working with :ref:`csv` data. See the :ref:`cookbook` for some advanced strategies .. _io.excel_reader: Reading Excel Files ''''''''''''''''''' In the most basic use-case, ``read_excel`` takes a path to an Excel file, and the ``sheetname`` indicating which sheet to parse. .. code-block:: python # Returns a DataFrame read_excel('path_to_file.xls', sheetname='Sheet1') .. _io.excel.excelfile_class: ``ExcelFile`` class +++++++++++++++++++ To facilitate working with multiple sheets from the same file, the ``ExcelFile`` class can be used to wrap the file and can be be passed into ``read_excel`` There will be a performance benefit for reading multiple sheets as the file is read into memory only once. .. code-block:: python xlsx = pd.ExcelFile('path_to_file.xls) df = pd.read_excel(xlsx, 'Sheet1') The ``ExcelFile`` class can also be used as a context manager. .. code-block:: python with pd.ExcelFile('path_to_file.xls') as xls: df1 = pd.read_excel(xls, 'Sheet1') df2 = pd.read_excel(xls, 'Sheet2') The ``sheet_names`` property will generate a list of the sheet names in the file. The primary use-case for an ``ExcelFile`` is parsing multiple sheets with different parameters .. code-block:: python data = {} # For when Sheet1's format differs from Sheet2 with pd.ExcelFile('path_to_file.xls') as xls: data['Sheet1'] = pd.read_excel(xls, 'Sheet1', index_col=None, na_values=['NA']) data['Sheet2'] = pd.read_excel(xls, 'Sheet2', index_col=1) Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed to ``read_excel`` with no loss in performance. .. code-block:: python # using the ExcelFile class data = {} with pd.ExcelFile('path_to_file.xls') as xls: data['Sheet1'] = read_excel(xls, 'Sheet1', index_col=None, na_values=['NA']) data['Sheet2'] = read_excel(xls, 'Sheet2', index_col=None, na_values=['NA']) # equivalent using the read_excel function data = read_excel('path_to_file.xls', ['Sheet1', 'Sheet2'], index_col=None, na_values=['NA']) .. versionadded:: 0.12 ``ExcelFile`` has been moved to the top level namespace. .. versionadded:: 0.17 ``read_excel`` can take an ``ExcelFile`` object as input .. _io.excel.specifying_sheets: Specifying Sheets +++++++++++++++++ .. note :: The second argument is ``sheetname``, not to be confused with ``ExcelFile.sheet_names`` .. note :: An ExcelFile's attribute ``sheet_names`` provides access to a list of sheets. - The arguments ``sheetname`` allows specifying the sheet or sheets to read. - The default value for ``sheetname`` is 0, indicating to read the first sheet - Pass a string to refer to the name of a particular sheet in the workbook. - Pass an integer to refer to the index of a sheet. Indices follow Python convention, beginning at 0. - Pass a list of either strings or integers, to return a dictionary of specified sheets. - Pass a ``None`` to return a dictionary of all available sheets. .. code-block:: python # Returns a DataFrame read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA']) Using the sheet index: .. code-block:: python # Returns a DataFrame read_excel('path_to_file.xls', 0, index_col=None, na_values=['NA']) Using all default values: .. code-block:: python # Returns a DataFrame read_excel('path_to_file.xls') Using None to get all sheets: .. code-block:: python # Returns a dictionary of DataFrames read_excel('path_to_file.xls',sheetname=None) Using a list to get multiple sheets: .. code-block:: python # Returns the 1st and 4th sheet, as a dictionary of DataFrames. read_excel('path_to_file.xls',sheetname=['Sheet1',3]) .. versionadded:: 0.16 ``read_excel`` can read more than one sheet, by setting ``sheetname`` to either a list of sheet names, a list of sheet positions, or ``None`` to read all sheets. .. versionadded:: 0.13 Sheets can be specified by sheet index or sheet name, using an integer or string, respectively. .. _io.excel.reading_multiindex: Reading a ``MultiIndex`` ++++++++++++++++++++++++ .. versionadded:: 0.17 ``read_excel`` can read a ``MultiIndex`` index, by passing a list of columns to ``index_col`` and a ``MultiIndex`` column by passing a list of rows to ``header``. If either the ``index`` or ``columns`` have serialized level names those will be read in as well by specifying the rows/columns that make up the levels. For example, to read in a ``MultiIndex`` index without names: .. ipython:: python df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,6,7,8]}, index=pd.MultiIndex.from_product([['a','b'],['c','d']])) df.to_excel('path_to_file.xlsx') df = pd.read_excel('path_to_file.xlsx', index_col=[0,1]) df If the index has level names, they will parsed as well, using the same parameters. .. ipython:: python df.index = df.index.set_names(['lvl1', 'lvl2']) df.to_excel('path_to_file.xlsx') df = pd.read_excel('path_to_file.xlsx', index_col=[0,1]) df If the source file has both ``MultiIndex`` index and columns, lists specifying each should be passed to ``index_col`` and ``header`` .. ipython:: python df.columns = pd.MultiIndex.from_product([['a'],['b', 'd']], names=['c1', 'c2']) df.to_excel('path_to_file.xlsx') df = pd.read_excel('path_to_file.xlsx', index_col=[0,1], header=[0,1]) df .. ipython:: python :suppress: import os os.remove('path_to_file.xlsx') .. warning:: Excel files saved in version 0.16.2 or prior that had index names will still able to be read in, but the ``has_index_names`` argument must specified to ``True``. Parsing Specific Columns ++++++++++++++++++++++++ It is often the case that users will insert columns to do temporary computations in Excel and you may not want to read in those columns. `read_excel` takes a `parse_cols` keyword to allow you to specify a subset of columns to parse. If `parse_cols` is an integer, then it is assumed to indicate the last column to be parsed. .. code-block:: python read_excel('path_to_file.xls', 'Sheet1', parse_cols=2) If `parse_cols` is a list of integers, then it is assumed to be the file column indices to be parsed. .. code-block:: python read_excel('path_to_file.xls', 'Sheet1', parse_cols=[0, 2, 3]) Cell Converters +++++++++++++++ It is possible to transform the contents of Excel cells via the `converters` option. For instance, to convert a column to boolean: .. code-block:: python read_excel('path_to_file.xls', 'Sheet1', converters={'MyBools': bool}) This options handles missing values and treats exceptions in the converters as missing data. Transformations are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For instance, a column of integers with missing values cannot be transformed to an array with integer dtype, because NaN is strictly a float. You can manually mask missing data to recover integer dtype: .. code-block:: python cfun = lambda x: int(x) if x else -1 read_excel('path_to_file.xls', 'Sheet1', converters={'MyInts': cfun}) .. _io.excel_writer: Writing Excel Files ''''''''''''''''''' Writing Excel Files to Disk +++++++++++++++++++++++++++ To write a DataFrame object to a sheet of an Excel file, you can use the ``to_excel`` instance method. The arguments are largely the same as ``to_csv`` described above, the first argument being the name of the excel file, and the optional second argument the name of the sheet to which the DataFrame should be written. For example: .. code-block:: python df.to_excel('path_to_file.xlsx', sheet_name='Sheet1') Files with a ``.xls`` extension will be written using ``xlwt`` and those with a ``.xlsx`` extension will be written using ``xlsxwriter`` (if available) or ``openpyxl``. The DataFrame will be written in a way that tries to mimic the REPL output. One difference from 0.12.0 is that the ``index_label`` will be placed in the second row instead of the first. You can get the previous behaviour by setting the ``merge_cells`` option in ``to_excel()`` to ``False``: .. code-block:: python df.to_excel('path_to_file.xlsx', index_label='label', merge_cells=False) The Panel class also has a ``to_excel`` instance method, which writes each DataFrame in the Panel to a separate sheet. In order to write separate DataFrames to separate sheets in a single Excel file, one can pass an :class:`~pandas.io.excel.ExcelWriter`. .. code-block:: python with ExcelWriter('path_to_file.xlsx') as writer: df1.to_excel(writer, sheet_name='Sheet1') df2.to_excel(writer, sheet_name='Sheet2') .. note:: Wringing a little more performance out of ``read_excel`` Internally, Excel stores all numeric data as floats. Because this can produce unexpected behavior when reading in data, pandas defaults to trying to convert integers to floats if it doesn't lose information (``1.0 --> 1``). You can pass ``convert_float=False`` to disable this behavior, which may give a slight performance improvement. .. _io.excel_writing_buffer: Writing Excel Files to Memory +++++++++++++++++++++++++++++ .. versionadded:: 0.17 Pandas supports writing Excel files to buffer-like objects such as ``StringIO`` or ``BytesIO`` using :class:`~pandas.io.excel.ExcelWriter`. .. versionadded:: 0.17 Added support for Openpyxl >= 2.2 .. code-block:: python # Safe import for either Python 2.x or 3.x try: from io import BytesIO except ImportError: from cStringIO import StringIO as BytesIO bio = BytesIO() # By setting the 'engine' in the ExcelWriter constructor. writer = ExcelWriter(bio, engine='xlsxwriter') df.to_excel(writer, sheet_name='Sheet1') # Save the workbook writer.save() # Seek to the beginning and read to copy the workbook to a variable in memory bio.seek(0) workbook = bio.read() .. note:: ``engine`` is optional but recommended. Setting the engine determines the version of workbook produced. Setting ``engine='xlrd'`` will produce an Excel 2003-format workbook (xls). Using either ``'openpyxl'`` or ``'xlsxwriter'`` will produce an Excel 2007-format workbook (xlsx). If omitted, an Excel 2007-formatted workbook is produced. .. _io.excel.writers: Excel writer engines '''''''''''''''''''' .. versionadded:: 0.13 ``pandas`` chooses an Excel writer via two methods: 1. the ``engine`` keyword argument 2. the filename extension (via the default specified in config options) By default, ``pandas`` uses the `XlsxWriter`_ for ``.xlsx`` and `openpyxl`_ for ``.xlsm`` files and `xlwt`_ for ``.xls`` files. If you have multiple engines installed, you can set the default engine through :ref:`setting the config options ` ``io.excel.xlsx.writer`` and ``io.excel.xls.writer``. pandas will fall back on `openpyxl`_ for ``.xlsx`` files if `Xlsxwriter`_ is not available. .. _XlsxWriter: http://xlsxwriter.readthedocs.org .. _openpyxl: http://openpyxl.readthedocs.org/ .. _xlwt: http://www.python-excel.org To specify which writer you want to use, you can pass an engine keyword argument to ``to_excel`` and to ``ExcelWriter``. The built-in engines are: - ``openpyxl``: This includes stable support for Openpyxl from 1.6.1. However, it is advised to use version 2.2 and higher, especially when working with styles. - ``xlsxwriter`` - ``xlwt`` .. code-block:: python # By setting the 'engine' in the DataFrame and Panel 'to_excel()' methods. df.to_excel('path_to_file.xlsx', sheet_name='Sheet1', engine='xlsxwriter') # By setting the 'engine' in the ExcelWriter constructor. writer = ExcelWriter('path_to_file.xlsx', engine='xlsxwriter') # Or via pandas configuration. from pandas import options options.io.excel.xlsx.writer = 'xlsxwriter' df.to_excel('path_to_file.xlsx', sheet_name='Sheet1')