10.4 HTML
10.4.1 Reading HTML Content
Warning
We highly encourage you to read the HTML parsing gotchas regarding the issues surrounding the BeautifulSoup4/html5lib/lxml parsers.
New in version 0.12.0.
The top-level read_html()
function can accept an HTML
string/file/URL and will parse HTML tables into list of pandas DataFrames.
Let’s look at a few examples.
Note
read_html
returns a list
of DataFrame
objects, even if there is
only a single table contained in the HTML content
Read a URL with no options
In [1]: url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
In [2]: dfs = pd.read_html(url)
In [3]: dfs
Out[3]:
[ Bank Name City ST CERT \
0 Allied Bank Mulberry AR 91
1 The Woodbury Banking Company Woodbury GA 11297
2 First CornerStone Bank King of Prussia PA 35312
3 Trust Company Bank Memphis TN 9956
4 North Milwaukee State Bank Milwaukee WI 20364
5 Hometown National Bank Longview WA 35156
6 The Bank of Georgia Peachtree City GA 35259
.. ... ... .. ...
540 Hamilton Bank, NAEn Espanol Miami FL 24382
541 Sinclair National Bank Gravette AR 34248
542 Superior Bank, FSB Hinsdale IL 32646
543 Malta National Bank Malta OH 6629
544 First Alliance Bank & Trust Co. Manchester NH 34264
545 National State Bank of Metropolis Metropolis IL 3815
546 Bank of Honolulu Honolulu HI 21029
... Updated Date Loss Share Type Agreement Terminated \
0 ... September 29, 2016 none NaN
1 ... August 30, 2016 none NaN
2 ... September 6, 2016 none NaN
3 ... September 6, 2016 none NaN
4 ... June 16, 2016 none NaN
5 ... April 13, 2016 none NaN
6 ... April 13, 2016 none NaN
.. ... ... ... ...
540 ... September 21, 2015 none NaN
541 ... February 10, 2004 none NaN
542 ... August 19, 2014 none NaN
543 ... November 18, 2002 none NaN
544 ... February 18, 2003 none NaN
545 ... March 17, 2005 none NaN
546 ... March 17, 2005 none No
Termination Date
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
.. ...
540 NaN
541 NaN
542 NaN
543 NaN
544 NaN
545 NaN
546 NaN
[547 rows x 10 columns]]
Note
The data from the above URL changes every Monday so the resulting data above and the data below may be slightly different.
Read in the content of the file from the above URL and pass it to read_html
as a string
In [4]: import os
In [5]: file_path = os.path.abspath(os.path.join('source', '_static', 'banklist.html'))
In [6]: with open(file_path, 'r') as f:
...: dfs = pd.read_html(f.read())
...:
In [7]: dfs
Out[7]:
[ Bank Name City ST CERT \
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386
1 Central Arizona Bank Scottsdale AZ 34527
2 Sunrise Bank Valdosta GA 58185
3 Pisgah Community Bank Asheville NC 58701
4 Douglas County Bank Douglasville GA 21649
5 Parkway Bank Lenoir NC 57158
6 Chipola Community Bank Marianna FL 58034
.. ... ... .. ...
499 Hamilton Bank, NAEn Espanol Miami FL 24382
500 Sinclair National Bank Gravette AR 34248
501 Superior Bank, FSB Hinsdale IL 32646
502 Malta National Bank Malta OH 6629
503 First Alliance Bank & Trust Co. Manchester NH 34264
504 National State Bank of Metropolis Metropolis IL 3815
505 Bank of Honolulu Honolulu HI 21029
Acquiring Institution Closing Date Updated Date
0 North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Western State Bank May 14, 2013 May 20, 2013
2 Synovus Bank May 10, 2013 May 21, 2013
3 Capital Bank, N.A. May 10, 2013 May 14, 2013
4 Hamilton State Bank April 26, 2013 May 16, 2013
5 CertusBank, National Association April 26, 2013 May 17, 2013
6 First Federal Bank of Florida April 19, 2013 May 16, 2013
.. ... ... ...
499 Israel Discount Bank of New York January 11, 2002 June 5, 2012
500 Delta Trust & Bank September 7, 2001 February 10, 2004
501 Superior Federal, FSB July 27, 2001 June 5, 2012
502 North Valley Bank May 3, 2001 November 18, 2002
503 Southern New Hampshire Bank & Trust February 2, 2001 February 18, 2003
504 Banterra Bank of Marion December 14, 2000 March 17, 2005
505 Bank of the Orient October 13, 2000 March 17, 2005
[506 rows x 7 columns]]
You can even pass in an instance of StringIO
if you so desire
In [8]: with open(file_path, 'r') as f:
...: sio = StringIO(f.read())
...:
In [9]: dfs = pd.read_html(sio)
In [10]: dfs
Out[10]:
[ Bank Name City ST CERT \
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha WI 35386
1 Central Arizona Bank Scottsdale AZ 34527
2 Sunrise Bank Valdosta GA 58185
3 Pisgah Community Bank Asheville NC 58701
4 Douglas County Bank Douglasville GA 21649
5 Parkway Bank Lenoir NC 57158
6 Chipola Community Bank Marianna FL 58034
.. ... ... .. ...
499 Hamilton Bank, NAEn Espanol Miami FL 24382
500 Sinclair National Bank Gravette AR 34248
501 Superior Bank, FSB Hinsdale IL 32646
502 Malta National Bank Malta OH 6629
503 First Alliance Bank & Trust Co. Manchester NH 34264
504 National State Bank of Metropolis Metropolis IL 3815
505 Bank of Honolulu Honolulu HI 21029
Acquiring Institution Closing Date Updated Date
0 North Shore Bank, FSB May 31, 2013 May 31, 2013
1 Western State Bank May 14, 2013 May 20, 2013
2 Synovus Bank May 10, 2013 May 21, 2013
3 Capital Bank, N.A. May 10, 2013 May 14, 2013
4 Hamilton State Bank April 26, 2013 May 16, 2013
5 CertusBank, National Association April 26, 2013 May 17, 2013
6 First Federal Bank of Florida April 19, 2013 May 16, 2013
.. ... ... ...
499 Israel Discount Bank of New York January 11, 2002 June 5, 2012
500 Delta Trust & Bank September 7, 2001 February 10, 2004
501 Superior Federal, FSB July 27, 2001 June 5, 2012
502 North Valley Bank May 3, 2001 November 18, 2002
503 Southern New Hampshire Bank & Trust February 2, 2001 February 18, 2003
504 Banterra Bank of Marion December 14, 2000 March 17, 2005
505 Bank of the Orient October 13, 2000 March 17, 2005
[506 rows x 7 columns]]
Note
The following examples are not run by the IPython evaluator due to the fact that having so many network-accessing functions slows down the documentation build. If you spot an error or an example that doesn’t run, please do not hesitate to report it over on pandas GitHub issues page.
Read a URL and match a table that contains specific text
match = 'Metcalf Bank'
df_list = pd.read_html(url, match=match)
Specify a header row (by default <th>
elements are used to form the column
index); if specified, the header row is taken from the data minus the parsed
header elements (<th>
elements).
dfs = pd.read_html(url, header=0)
Specify an index column
dfs = pd.read_html(url, index_col=0)
Specify a number of rows to skip
dfs = pd.read_html(url, skiprows=0)
Specify a number of rows to skip using a list (xrange
(Python 2 only) works
as well)
dfs = pd.read_html(url, skiprows=range(2))
Specify an HTML attribute
dfs1 = pd.read_html(url, attrs={'id': 'table'})
dfs2 = pd.read_html(url, attrs={'class': 'sortable'})
print(np.array_equal(dfs1[0], dfs2[0])) # Should be True
Specify values that should be converted to NaN
dfs = pd.read_html(url, na_values=['No Acquirer'])
New in version 0.19.
Specify whether to keep the default set of NaN values
dfs = pd.read_html(url, keep_default_na=False)
New in version 0.19.
Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to strings.
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Telekom Albania', header=0, converters={'MNC':
str})
New in version 0.19.
Use some combination of the above
dfs = pd.read_html(url, match='Metcalf Bank', index_col=0)
Read in pandas to_html
output (with some loss of floating point precision)
df = pd.DataFrame(randn(2, 2))
s = df.to_html(float_format='{0:.40g}'.format)
dfin = pd.read_html(s, index_col=0)
The lxml
backend will raise an error on a failed parse if that is the only
parser you provide (if you only have a single parser you can provide just a
string, but it is considered good practice to pass a list with one string if,
for example, the function expects a sequence of strings)
dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml'])
or
dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor='lxml')
However, if you have bs4 and html5lib installed and pass None
or ['lxml',
'bs4']
then the parse will most likely succeed. Note that as soon as a parse
succeeds, the function will return.
dfs = pd.read_html(url, 'Metcalf Bank', index_col=0, flavor=['lxml', 'bs4'])
10.4.2 Writing to HTML files
DataFrame
objects have an instance method to_html
which renders the
contents of the DataFrame
as an HTML table. The function arguments are as
in the method to_string
described above.
Note
Not all of the possible options for DataFrame.to_html
are shown here for
brevity’s sake. See to_html()
for the
full set of options.
In [11]: df = pd.DataFrame(randn(2, 2))
In [12]: df
Out[12]:
0 1
0 0.4691 -0.2829
1 -1.5091 -1.1356
In [13]: print(df.to_html()) # raw html
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.4691</td>
<td>-0.2829</td>
</tr>
<tr>
<th>1</th>
<td>-1.5091</td>
<td>-1.1356</td>
</tr>
</tbody>
</table>
HTML:
The columns
argument will limit the columns shown
In [14]: print(df.to_html(columns=[0]))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.4691</td>
</tr>
<tr>
<th>1</th>
<td>-1.5091</td>
</tr>
</tbody>
</table>
HTML:
float_format
takes a Python callable to control the precision of floating
point values
In [15]: print(df.to_html(float_format='{0:.10f}'.format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.4691122999</td>
<td>-0.2828633443</td>
</tr>
<tr>
<th>1</th>
<td>-1.5090585032</td>
<td>-1.1356323710</td>
</tr>
</tbody>
</table>
HTML:
bold_rows
will make the row labels bold by default, but you can turn that
off
In [16]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0.4691</td>
<td>-0.2829</td>
</tr>
<tr>
<td>1</td>
<td>-1.5091</td>
<td>-1.1356</td>
</tr>
</tbody>
</table>
The classes
argument provides the ability to give the resulting HTML
table CSS classes. Note that these classes are appended to the existing
'dataframe'
class.
In [17]: print(df.to_html(classes=['awesome_table_class', 'even_more_awesome_class']))
<table border="1" class="dataframe awesome_table_class even_more_awesome_class">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0.4691</td>
<td>-0.2829</td>
</tr>
<tr>
<th>1</th>
<td>-1.5091</td>
<td>-1.1356</td>
</tr>
</tbody>
</table>
Finally, the escape
argument allows you to control whether the
“<”, “>” and “&” characters escaped in the resulting HTML (by default it is
True
). So to get the HTML without escaped characters pass escape=False
In [18]: df = pd.DataFrame({'a': list('&<>'), 'b': randn(3)})
Escaped:
In [19]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>1.2121</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.1732</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>0.1192</td>
</tr>
</tbody>
</table>
Not escaped:
In [20]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>1.2121</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.1732</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>0.1192</td>
</tr>
</tbody>
</table>
Note
Some browsers may not show a difference in the rendering of the previous two HTML tables.