12. Troubleshooting

12.1. diagnose()

If you’re having trouble understanding what Beautiful Soup does to a document, pass the document into the diagnose() function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing you how different parsers handle the document, and tell you if you’re missing a parser that Beautiful Soup could be using:

from bs4.diagnose import diagnose
data = open("bad.html").read()
diagnose(data)

# Diagnostic running on Beautiful Soup 4.2.0
# Python version 2.7.3 (default, Aug  1 2012, 05:16:07)
# I noticed that html5lib is not installed. Installing it may help.
# Found lxml version 2.3.2.0
#
# Trying to parse your data with html.parser
# Here's what html.parser did with the document:
# ...

Just looking at the output of diagnose() may show you how to solve the problem. Even if not, you can paste the output of diagnose() when asking for help.

12.2. Errors when parsing a document

There are two different kinds of parse errors. There are crashes, where you feed a document to Beautiful Soup and it raises an exception, usually an HTMLParser.HTMLParseError. And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different than the document used to create it.

Almost none of these problems turn out to be problems with Beautiful Soup. This is not because Beautiful Soup is an amazingly well-written piece of software. It’s because Beautiful Soup doesn’t include any parsing code. Instead, it relies on external parsers. If one parser isn’t working on a certain document, the best solution is to try a different parser. See Installing a parser for details and a parser comparison.

The most common parse errors are HTMLParser.HTMLParseError: malformed start tag and HTMLParser.HTMLParseError: bad end tag. These are both generated by Python’s built-in HTML parser library, and the solution is to install lxml or html5lib.

The most common type of unexpected behavior is that you can’t find a tag that you know is in the document. You saw it going in, but find_all() returns [] or find() returns None. This is another common problem with Python’s built-in HTML parser, which sometimes skips tags it doesn’t understand. Again, the solution is to install lxml or html5lib.

12.3. Version mismatch problems

  • SyntaxError: Invalid syntax (on the line ROOT_TAG_NAME = u'[document]'): Caused by running the Python 2 version of Beautiful Soup under Python 3, without converting the code.
  • ImportError: No module named HTMLParser - Caused by running the Python 2 version of Beautiful Soup under Python 3.
  • ImportError: No module named html.parser - Caused by running the Python 3 version of Beautiful Soup under Python 2.
  • ImportError: No module named BeautifulSoup - Caused by running Beautiful Soup 3 code on a system that doesn’t have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4.
  • ImportError: No module named bs4 - Caused by running Beautiful Soup 4 code on a system that doesn’t have BS4 installed.

12.4. Parsing XML

By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in “xml” as the second argument to the BeautifulSoup constructor:

soup = BeautifulSoup(markup, "xml")

You’ll need to have lxml installed.

12.5. Other parser problems

  • If your script works on one computer but not another, it’s probably because the two computers have different parser libraries available. For example, you may have developed the script on a computer that has lxml installed, and then tried to run it on a computer that only has html5lib insgit talled. See Differences between parsers for why this matters, and fix the problem by mentioning a specific parser library in the BeautifulSoup constructor.
  • Because HTML tags and attributes are case-insensitive, all three HTML parsers convert tag and attribute names to lowercase. That is, the markup <TAG></TAG> is converted to <tag></tag>. If you want to preserve mixed-case or uppercase tags and attributes, you’ll need to parse the document as XML.

12.6. Miscellaneous

  • UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (or just about any other UnicodeEncodeError) - This is not a problem with Beautiful Soup. This problem shows up in two main situations. First, when you try to print a Unicode character that your console doesn’t know how to display. (See this page on the Python wiki for help.) Second, when you’re writing to a file and you pass in a Unicode character that’s not supported by your default encoding. In this case, the simplest solution is to explicitly encode the Unicode string into UTF-8 with u.encode("utf8").
  • KeyError: [attr] - Caused by accessing tag['attr'] when the tag in question doesn’t define the attr attribute. The most common errors are KeyError: 'href' and KeyError: 'class'. Use tag.get('attr') if you’re not sure attr is defined, just as you would with a Python dictionary.
  • AttributeError: 'ResultSet' object has no attribute 'foo' - This usually happens because you expected find_all() to return a single tag or string. But find_all() returns a _list_ of tags and strings–a ResultSet object. You need to iterate over the list and look at the .foo of each one. Or, if you really only want one result, you need to use find() instead of find_all().
  • AttributeError: 'NoneType' object has no attribute 'foo' - This usually happens because you called find() and then tried to access the .foo` attribute of the result. But in your case, find() didn’t find anything, so it returned None, instead of returning a tag or a string. You need to figure out why your find() call isn’t returning anything.

12.7. Improving Performance

Beautiful Soup will never be as fast as the parsers it sits on top of. If response time is critical, if you’re paying for computer time by the hour, or if there’s any other reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml.

That said, there are things you can do to speed up Beautiful Soup. If you’re not using lxml as the underlying parser, my advice is to start. Beautiful Soup parses documents significantly faster using lxml than using html.parser or html5lib.

You can speed up encoding detection significantly by installing the cchardet library.

Parsing only part of a document won’t save you much time parsing the document, but it can save a lot of memory, and it’ll make searching the document much faster.