10.4 HTML

10.4.1 Reading HTML Content

Warning

We highly encourage you to read the HTML parsing gotchas regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.

New in version 0.12.0.

The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into list of pandas DataFrames. Let’s look at a few examples.

Note

read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content

Read a URL with no options

In [1]: url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'

In [2]: dfs = pd.read_html(url)

In [3]: dfs
Out[3]: 
[                             Bank Name             City  ST   CERT  \
 0                          Allied Bank         Mulberry  AR     91   
 1         The Woodbury Banking Company         Woodbury  GA  11297   
 2               First CornerStone Bank  King of Prussia  PA  35312   
 3                   Trust Company Bank          Memphis  TN   9956   
 4           North Milwaukee State Bank        Milwaukee  WI  20364   
 5               Hometown National Bank         Longview  WA  35156   
 6                  The Bank of Georgia   Peachtree City  GA  35259   
 ..                                 ...              ...  ..    ...   
 540        Hamilton Bank, NAEn Espanol            Miami  FL  24382   
 541             Sinclair National Bank         Gravette  AR  34248   
 542                 Superior Bank, FSB         Hinsdale  IL  32646   
 543                Malta National Bank            Malta  OH   6629   
 544    First Alliance Bank & Trust Co.       Manchester  NH  34264   
 545  National State Bank of Metropolis       Metropolis  IL   3815   
 546                   Bank of Honolulu         Honolulu  HI  21029   
 
           ...               Updated Date Loss Share Type Agreement Terminated  \
 0         ...         September 29, 2016            none                  NaN   
 1         ...            August 30, 2016            none                  NaN   
 2         ...          September 6, 2016            none                  NaN   
 3         ...          September 6, 2016            none                  NaN   
 4         ...              June 16, 2016            none                  NaN   
 5         ...             April 13, 2016            none                  NaN   
 6         ...             April 13, 2016            none                  NaN   
 ..        ...                        ...             ...                  ...   
 540       ...         September 21, 2015            none                  NaN   
 541       ...          February 10, 2004            none                  NaN   
 542       ...            August 19, 2014            none                  NaN   
 543       ...          November 18, 2002            none                  NaN   
 544       ...          February 18, 2003            none                  NaN   
 545       ...             March 17, 2005            none                  NaN   
 546       ...             March 17, 2005            none                   No   
 
     Termination Date  
 0                NaN  
 1                NaN  
 2                NaN  
 3                NaN  
 4                NaN  
 5                NaN  
 6                NaN  
 ..               ...  
 540              NaN  
 541              NaN  
 542              NaN  
 543              NaN  
 544              NaN  
 545              NaN  
 546              NaN  
 
 [547 rows x 10 columns]]

Note

The data from the above URL changes every Monday so the resulting data above and the data below may be slightly different.

Read in the content of the file from the above URL and pass it to read_html as a string

In [4]: import os

In [5]: file_path = os.path.abspath(os.path.join('source', '_static', 'banklist.html'))
In [6]: with open(file_path, 'r') as f:
   ...:     dfs = pd.read_html(f.read())
   ...: 

In [7]: dfs
Out[7]: 
[                                    Bank Name          City  ST   CERT  \
 0    Banks of Wisconsin d/b/a Bank of Kenosha       Kenosha  WI  35386   
 1                        Central Arizona Bank    Scottsdale  AZ  34527   
 2                                Sunrise Bank      Valdosta  GA  58185   
 3                       Pisgah Community Bank     Asheville  NC  58701   
 4                         Douglas County Bank  Douglasville  GA  21649   
 5                                Parkway Bank        Lenoir  NC  57158   
 6                      Chipola Community Bank      Marianna  FL  58034   
 ..                                        ...           ...  ..    ...   
 499               Hamilton Bank, NAEn Espanol         Miami  FL  24382   
 500                    Sinclair National Bank      Gravette  AR  34248   
 501                        Superior Bank, FSB      Hinsdale  IL  32646   
 502                       Malta National Bank         Malta  OH   6629   
 503           First Alliance Bank & Trust Co.    Manchester  NH  34264   
 504         National State Bank of Metropolis    Metropolis  IL   3815   
 505                          Bank of Honolulu      Honolulu  HI  21029   
 
                    Acquiring Institution       Closing Date       Updated Date  
 0                  North Shore Bank, FSB       May 31, 2013       May 31, 2013  
 1                     Western State Bank       May 14, 2013       May 20, 2013  
 2                           Synovus Bank       May 10, 2013       May 21, 2013  
 3                     Capital Bank, N.A.       May 10, 2013       May 14, 2013  
 4                    Hamilton State Bank     April 26, 2013       May 16, 2013  
 5       CertusBank, National Association     April 26, 2013       May 17, 2013  
 6          First Federal Bank of Florida     April 19, 2013       May 16, 2013  
 ..                                   ...                ...                ...  
 499     Israel Discount Bank of New York   January 11, 2002       June 5, 2012  
 500                   Delta Trust & Bank  September 7, 2001  February 10, 2004  
 501                Superior Federal, FSB      July 27, 2001       June 5, 2012  
 502                    North Valley Bank        May 3, 2001  November 18, 2002  
 503  Southern New Hampshire Bank & Trust   February 2, 2001  February 18, 2003  
 504              Banterra Bank of Marion  December 14, 2000     March 17, 2005  
 505                   Bank of the Orient   October 13, 2000     March 17, 2005  
 
 [506 rows x 7 columns]]

You can even pass in an instance of StringIO if you so desire

In [8]: with open(file_path, 'r') as f:
   ...:     sio = StringIO(f.read())
   ...: 

In [9]: dfs = pd.read_html(sio)

In [10]: dfs
Out[10]: 
[                                    Bank Name          City  ST   CERT  \
 0    Banks of Wisconsin d/b/a Bank of Kenosha       Kenosha  WI  35386   
 1                        Central Arizona Bank    Scottsdale  AZ  34527   
 2                                Sunrise Bank      Valdosta  GA  58185   
 3                       Pisgah Community Bank     Asheville  NC  58701   
 4                         Douglas County Bank  Douglasville  GA  21649   
 5                                Parkway Bank        Lenoir  NC  57158   
 6                      Chipola Community Bank      Marianna  FL  58034   
 ..                                        ...           ...  ..    ...   
 499               Hamilton Bank, NAEn Espanol         Miami  FL  24382   
 500                    Sinclair National Bank      Gravette  AR  34248   
 501                        Superior Bank, FSB      Hinsdale  IL  32646   
 502                       Malta National Bank         Malta  OH   6629   
 503           First Alliance Bank & Trust Co.    Manchester  NH  34264   
 504         National State Bank of Metropolis    Metropolis  IL   3815   
 505                          Bank of Honolulu      Honolulu  HI  21029   
 
                    Acquiring Institution       Closing Date       Updated Date  
 0                  North Shore Bank, FSB       May 31, 2013       May 31, 2013  
 1                     Western State Bank       May 14, 2013       May 20, 2013  
 2                           Synovus Bank       May 10, 2013       May 21, 2013  
 3                     Capital Bank, N.A.       May 10, 2013       May 14, 2013  
 4                    Hamilton State Bank     April 26, 2013       May 16, 2013  
 5       CertusBank, National Association     April 26, 2013       May 17, 2013  
 6          First Federal Bank of Florida     April 19, 2013       May 16, 2013  
 ..                                   ...                ...                ...  
 499     Israel Discount Bank of New York   January 11, 2002       June 5, 2012  
 500                   Delta Trust & Bank  September 7, 2001  February 10, 2004  
 501                Superior Federal, FSB      July 27, 2001       June 5, 2012  
 502                    North Valley Bank        May 3, 2001  November 18, 2002  
 503  Southern New Hampshire Bank & Trust   February 2, 2001  February 18, 2003  
 504              Banterra Bank of Marion  December 14, 2000     March 17, 2005  
 505                   Bank of the Orient   October 13, 2000     March 17, 2005  
 
 [506 rows x 7 columns]]

Note

The following examples are not run by the IPython evaluator due to the fact that having so many network-accessing functions slows down the documentation build. If you spot an error or an example that doesn’t run, please do not hesitate to report it over on pandas GitHub issues page.

Read a URL and match a table that contains specific text

match = 'Metcalf Bank'
df_list = pd.read_html(url, match=match)

Specify a header row (by default <th> elements are used to form the column index); if specified, the header row is taken from the data minus the parsed header elements (<th> elements).

dfs = pd.read_html(url, header=0)

Specify an index column

dfs = pd.read_html(url, index_col=0)

Specify a number of rows to skip

dfs = pd.read_html(url, skiprows=0)

Specify a number of rows to skip using a list (xrange (Python 2 only) works as well)

dfs = pd.read_html(url, skiprows=range(2))

Specify an HTML attribute

dfs1 = pd.read_html(url, attrs={'id': 'table'})
dfs2 = pd.read_html(url, attrs={'class': 'sortable'})
print(np.array_equal(dfs1[0], dfs2[0]))  # Should be True

Specify values that should be converted to NaN

dfs = pd.read_html(url, na_values=['No Acquirer'])

New in version 0.19.

Specify whether to keep the default set of NaN values

dfs = pd.read_html(url, keep_default_na=False)

New in version 0.19.

Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.

url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC':
str})

New in version 0.19.

Use some combination of the above

dfs = pd.read_html(url, match='Metcalf Bank', index_col=0)

Read in pandas to_html output (with some loss of floating point precision)

df = pd.DataFrame(randn(2, 2))
s = df.to_html(float_format='{0:.40g}'.format)
dfin = pd.read_html(s, index_col=0)

The lxml backend will raise an error on a failed parse if that is the only parser you provide (if you only have a single parser you can provide just a string, but it is considered good practice to pass a list with one string if, for example, the function expects a sequence of strings)

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml'])

or

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor='lxml')

However, if you have bs4 and html5lib installed and pass None or ['lxml', 'bs4'] then the parse will most likely succeed. Note that as soon as a parse succeeds, the function will return.

dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml', 'bs4'])

10.4.2 Writing to HTML files

DataFrame objects have an instance method to_html which renders the contents of the DataFrame as an HTML table. The function arguments are as in the method to_string described above.

Note

Not all of the possible options for DataFrame.to_html are shown here for brevity’s sake. See to_html() for the full set of options.

In [11]: df = pd.DataFrame(randn(2, 2))

In [12]: df
Out[12]: 
        0       1
0  0.4691 -0.2829
1 -1.5091 -1.1356

In [13]: print(df.to_html())  # raw html
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.4691</td>
      <td>-0.2829</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-1.5091</td>
      <td>-1.1356</td>
    </tr>
  </tbody>
</table>

HTML:

The columns argument will limit the columns shown

In [14]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.4691</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-1.5091</td>
    </tr>
  </tbody>
</table>

HTML:

float_format takes a Python callable to control the precision of floating point values

In [15]: print(df.to_html(float_format='{0:.10f}'.format))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.4691122999</td>
      <td>-0.2828633443</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-1.5090585032</td>
      <td>-1.1356323710</td>
    </tr>
  </tbody>
</table>

HTML:

bold_rows will make the row labels bold by default, but you can turn that off

In [16]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>0.4691</td>
      <td>-0.2829</td>
    </tr>
    <tr>
      <td>1</td>
      <td>-1.5091</td>
      <td>-1.1356</td>
    </tr>
  </tbody>
</table>

The classes argument provides the ability to give the resulting HTML table CSS classes. Note that these classes are appended to the existing 'dataframe' class.

In [17]: print(df.to_html(classes=['awesome_table_class', 'even_more_awesome_class']))
<table border="1" class="dataframe awesome_table_class even_more_awesome_class">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0.4691</td>
      <td>-0.2829</td>
    </tr>
    <tr>
      <th>1</th>
      <td>-1.5091</td>
      <td>-1.1356</td>
    </tr>
  </tbody>
</table>

Finally, the escape argument allows you to control whether the “<”, “>” and “&” characters escaped in the resulting HTML (by default it is True). So to get the HTML without escaped characters pass escape=False

In [18]: df = pd.DataFrame({'a': list('&<>'), 'b': randn(3)})

Escaped:

In [19]: print(df.to_html())
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>&amp;</td>
      <td>1.2121</td>
    </tr>
    <tr>
      <th>1</th>
      <td>&lt;</td>
      <td>-0.1732</td>
    </tr>
    <tr>
      <th>2</th>
      <td>&gt;</td>
      <td>0.1192</td>
    </tr>
  </tbody>
</table>

Not escaped:

In [20]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>a</th>
      <th>b</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>&</td>
      <td>1.2121</td>
    </tr>
    <tr>
      <th>1</th>
      <td><</td>
      <td>-0.1732</td>
    </tr>
    <tr>
      <th>2</th>
      <td>></td>
      <td>0.1192</td>
    </tr>
  </tbody>
</table>

Note

Some browsers may not show a difference in the rendering of the previous two HTML tables.