12. Troubleshooting¶
12.1. diagnose()
¶
If you’re having trouble understanding what Beautiful Soup does to a
document, pass the document into the diagnose()
function. (New in
Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing
you how different parsers handle the document, and tell you if you’re
missing a parser that Beautiful Soup could be using:
from bs4.diagnose import diagnose
data = open("bad.html").read()
diagnose(data)
# Diagnostic running on Beautiful Soup 4.2.0
# Python version 2.7.3 (default, Aug 1 2012, 05:16:07)
# I noticed that html5lib is not installed. Installing it may help.
# Found lxml version 2.3.2.0
#
# Trying to parse your data with html.parser
# Here's what html.parser did with the document:
# ...
Just looking at the output of diagnose() may show you how to solve the
problem. Even if not, you can paste the output of diagnose()
when
asking for help.
12.2. Errors when parsing a document¶
There are two different kinds of parse errors. There are crashes,
where you feed a document to Beautiful Soup and it raises an
exception, usually an HTMLParser.HTMLParseError
. And there is
unexpected behavior, where a Beautiful Soup parse tree looks a lot
different than the document used to create it.
Almost none of these problems turn out to be problems with Beautiful Soup. This is not because Beautiful Soup is an amazingly well-written piece of software. It’s because Beautiful Soup doesn’t include any parsing code. Instead, it relies on external parsers. If one parser isn’t working on a certain document, the best solution is to try a different parser. See Installing a parser for details and a parser comparison.
The most common parse errors are HTMLParser.HTMLParseError:
malformed start tag
and HTMLParser.HTMLParseError: bad end
tag
. These are both generated by Python’s built-in HTML parser
library, and the solution is to install lxml or
html5lib.
The most common type of unexpected behavior is that you can’t find a
tag that you know is in the document. You saw it going in, but
find_all()
returns []
or find()
returns None
. This is
another common problem with Python’s built-in HTML parser, which
sometimes skips tags it doesn’t understand. Again, the solution is to
install lxml or html5lib.
12.3. Version mismatch problems¶
SyntaxError: Invalid syntax
(on the lineROOT_TAG_NAME = u'[document]'
): Caused by running the Python 2 version of Beautiful Soup under Python 3, without converting the code.ImportError: No module named HTMLParser
- Caused by running the Python 2 version of Beautiful Soup under Python 3.ImportError: No module named html.parser
- Caused by running the Python 3 version of Beautiful Soup under Python 2.ImportError: No module named BeautifulSoup
- Caused by running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed tobs4
.ImportError: No module named bs4
- Caused by running Beautiful Soup 4 code on a system that doesn’t have BS4 installed.
12.4. Parsing XML¶
By default, Beautiful Soup parses documents as HTML. To parse a
document as XML, pass in “xml” as the second argument to the
BeautifulSoup
constructor:
soup = BeautifulSoup(markup, "xml")
You’ll need to have lxml installed.
12.5. Other parser problems¶
- If your script works on one computer but not another, it’s probably
because the two computers have different parser libraries
available. For example, you may have developed the script on a
computer that has lxml installed, and then tried to run it on a
computer that only has html5lib insgit talled. See Differences between parsers for why this matters, and fix the problem by mentioning a
specific parser library in the
BeautifulSoup
constructor. - Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup <TAG></TAG> is converted to <tag></tag>. If you want to preserve mixed-case or uppercase tags and attributes, you’ll need to parse the document as XML.
12.6. Miscellaneous¶
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar
(or just about any otherUnicodeEncodeError
) - This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 withu.encode("utf8")
.KeyError: [attr]
- Caused by accessingtag['attr']
when the tag in question doesn’t define theattr
attribute. The most common errors areKeyError: 'href'
andKeyError: 'class'
. Usetag.get('attr')
if you’re not sureattr
is defined, just as you would with a Python dictionary.AttributeError: 'ResultSet' object has no attribute 'foo'
- This usually happens because you expectedfind_all()
to return a single tag or string. Butfind_all()
returns a _list_ of tags and strings–aResultSet
object. You need to iterate over the list and look at the.foo
of each one. Or, if you really only want one result, you need to usefind()
instead offind_all()
.AttributeError: 'NoneType' object has no attribute 'foo'
- This usually happens because you calledfind()
and then tried to access the .foo` attribute of the result. But in your case,find()
didn’t find anything, so it returnedNone
, instead of returning a tag or a string. You need to figure out why yourfind()
call isn’t returning anything.
12.7. Improving Performance¶
Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is critical, if you’re paying for computer time by the hour, or if there’s any other reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml.
That said, there are things you can do to speed up Beautiful Soup. If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib.
You can speed up encoding detection significantly by installing the cchardet library.
Parsing only part of a document won’t save you much time parsing the document, but it can save a lot of memory, and it’ll make searching the document much faster.