13. Beautiful Soup 3

Beautiful Soup 3 is the previous release series, and is no longer being actively developed. It’s currently packaged with all major Linux distributions:

$ apt-get install python-beautifulsoup

It’s also published through PyPi as BeautifulSoup.:

$ easy_install BeautifulSoup

$ pip install BeautifulSoup

You can also download a tarball of Beautiful Soup 3.2.0.

If you ran easy_install beautifulsoup or easy_install BeautifulSoup, but your code doesn’t work, you installed Beautiful Soup 3 by mistake. You need to run easy_install beautifulsoup4.

The documentation for Beautiful Soup 3 is archived online. If your first language is Chinese, it might be easier for you to read the Chinese translation of the Beautiful Soup 3 documentation, then read this document to find out about the changes made in Beautiful Soup 4.

13.1. Porting code to BS4

Most code written against Beautiful Soup 3 will work against Beautiful Soup 4 with one simple change. All you should have to do is change the package name from BeautifulSoup to bs4. So this:

from BeautifulSoup import BeautifulSoup

becomes this:

from bs4 import BeautifulSoup
  • If you get the ImportError “No module named BeautifulSoup”, your problem is that you’re trying to run Beautiful Soup 3 code, but you only have Beautiful Soup 4 installed.
  • If you get the ImportError “No module named bs4”, your problem is that you’re trying to run Beautiful Soup 4 code, but you only have Beautiful Soup 3 installed.

Although BS4 is mostly backwards-compatible with BS3, most of its methods have been deprecated and given new names for PEP 8 compliance. There are numerous other renames and changes, and a few of them break backwards compatibility.

Here’s what you’ll need to know to convert your BS3 code and habits to BS4:

13.1.1. You need a parser

Beautiful Soup 3 used Python’s SGMLParser, a module that was deprecated and removed in Python 3.0. Beautiful Soup 4 uses html.parser by default, but you can plug in lxml or html5lib and use that instead. See Installing a parser for a comparison.

Since html.parser is not the same parser as SGMLParser, it will treat invalid markup differently. Usually the “difference” is that html.parser crashes. In that case, you’ll need to install another parser. But sometimes html.parser just creates a different parse tree than SGMLParser would. If this happens, you may need to update your BS3 scraping code to deal with the new tree.

13.1.2. Method names

  • renderContents -> encode_contents
  • replaceWith -> replace_with
  • replaceWithChildren -> unwrap
  • findAll -> find_all
  • findAllNext -> find_all_next
  • findAllPrevious -> find_all_previous
  • findNext -> find_next
  • findNextSibling -> find_next_sibling
  • findNextSiblings -> find_next_siblings
  • findParent -> find_parent
  • findParents -> find_parents
  • findPrevious -> find_previous
  • findPreviousSibling -> find_previous_sibling
  • findPreviousSiblings -> find_previous_siblings
  • nextSibling -> next_sibling
  • previousSibling -> previous_sibling

Some arguments to the Beautiful Soup constructor were renamed for the same reasons:

  • BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
  • BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)

I renamed one method for compatibility with Python 3:

  • Tag.has_key() -> Tag.has_attr()

I renamed one attribute to use more accurate terminology:

  • Tag.isSelfClosing -> Tag.is_empty_element

I renamed three attributes to avoid using words that have special meaning to Python. Unlike the others, these changes are not backwards compatible. If you used these attributes in BS3, your code will break on BS4 until you change them.

  • UnicodeDammit.unicode -> UnicodeDammit.unicode_markup
  • Tag.next -> Tag.next_element
  • Tag.previous -> Tag.previous_element

13.1.3. Generators

I gave the generators PEP 8-compliant names, and transformed them into properties:

  • childGenerator() -> children
  • nextGenerator() -> next_elements
  • nextSiblingGenerator() -> next_siblings
  • previousGenerator() -> previous_elements
  • previousSiblingGenerator() -> previous_siblings
  • recursiveChildGenerator() -> descendants
  • parentGenerator() -> parents

So instead of this:

for parent in tag.parentGenerator():
    ...

You can write this:

for parent in tag.parents:
    ...

(But the old code will still work.)

Some of the generators used to yield None after they were done, and then stop. That was a bug. Now the generators just stop.

There are two new generators, .strings and .stripped_strings. .strings yields NavigableString objects, and .stripped_strings yields Python strings that have had whitespace stripped.

13.1.4. XML

There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor. For the same reason, the BeautifulSoup constructor no longer recognizes the isHTML argument.

Beautiful Soup’s handling of empty-element XML tags has been improved. Previously when you parsed XML you had to explicitly say which tags were considered empty-element tags. The selfClosingTags argument to the constructor is no longer recognized. Instead, Beautiful Soup considers any empty tag to be an empty-element tag. If you add a child to an empty-element tag, it stops being an empty-element tag.

13.1.5. Entities

An incoming HTML or XML entity is always converted into the corresponding Unicode character. Beautiful Soup 3 had a number of overlapping ways of dealing with entities, which have been removed. The BeautifulSoup constructor no longer recognizes the smartQuotesTo or convertEntities arguments. (Unicode, Dammit still has smart_quotes_to, but its default is now to turn smart quotes into Unicode.) The constants HTML_ENTITIES, XML_ENTITIES, and XHTML_ENTITIES have been removed, since they configure a feature (transforming some but not all entities into Unicode characters) that no longer exists.

If you want to turn Unicode characters back into HTML entities on output, rather than turning them into UTF-8 characters, you need to use an output formatter.

13.1.6. Miscellaneous

Tag.string now operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string. (Previously, it was None.)

Multi-valued attributes like class have lists of strings as their values, not strings. This may affect the way you search by CSS class.

If you pass one of the find* methods both text and a tag-specific argument like name, Beautiful Soup will search for tags that match your tag-specific criteria and whose Tag.string matches your value for text. It will not find the strings themselves. Previously, Beautiful Soup ignored the tag-specific arguments and looked for strings.

The BeautifulSoup constructor no longer recognizes the markupMassage argument. It’s now the parser’s responsibility to handle markup correctly.

The rarely-used alternate parser classes like ICantBelieveItsBeautifulSoup and BeautifulSOAP have been removed. It’s now the parser’s decision how to handle ambiguous markup.

The prettify() method now returns a Unicode string, not a bytestring.