13. Beautiful Soup 3¶
Beautiful Soup 3 is the previous release series, and is no longer being actively developed. It’s currently packaged with all major Linux distributions:
$ apt-get install python-beautifulsoup
It’s also published through PyPi as BeautifulSoup
.:
$ easy_install BeautifulSoup
$ pip install BeautifulSoup
You can also download a tarball of Beautiful Soup 3.2.0.
If you ran easy_install beautifulsoup
or easy_install
BeautifulSoup
, but your code doesn’t work, you installed Beautiful
Soup 3 by mistake. You need to run easy_install beautifulsoup4
.
The documentation for Beautiful Soup 3 is archived online. If your first language is Chinese, it might be easier for you to read the Chinese translation of the Beautiful Soup 3 documentation, then read this document to find out about the changes made in Beautiful Soup 4.
13.1. Porting code to BS4¶
Most code written against Beautiful Soup 3 will work against Beautiful
Soup 4 with one simple change. All you should have to do is change the
package name from BeautifulSoup
to bs4
. So this:
from BeautifulSoup import BeautifulSoup
becomes this:
from bs4 import BeautifulSoup
- If you get the
ImportError
“No module named BeautifulSoup”, your problem is that you’re trying to run Beautiful Soup 3 code, but you only have Beautiful Soup 4 installed. - If you get the
ImportError
“No module named bs4”, your problem is that you’re trying to run Beautiful Soup 4 code, but you only have Beautiful Soup 3 installed.
Although BS4 is mostly backwards-compatible with BS3, most of its methods have been deprecated and given new names for PEP 8 compliance. There are numerous other renames and changes, and a few of them break backwards compatibility.
Here’s what you’ll need to know to convert your BS3 code and habits to BS4:
13.1.1. You need a parser¶
Beautiful Soup 3 used Python’s SGMLParser
, a module that was
deprecated and removed in Python 3.0. Beautiful Soup 4 uses
html.parser
by default, but you can plug in lxml or html5lib and
use that instead. See Installing a parser for a comparison.
Since html.parser
is not the same parser as SGMLParser
, it
will treat invalid markup differently. Usually the “difference” is
that html.parser
crashes. In that case, you’ll need to install
another parser. But sometimes html.parser
just creates a different
parse tree than SGMLParser
would. If this happens, you may need to
update your BS3 scraping code to deal with the new tree.
13.1.2. Method names¶
renderContents
->encode_contents
replaceWith
->replace_with
replaceWithChildren
->unwrap
findAll
->find_all
findAllNext
->find_all_next
findAllPrevious
->find_all_previous
findNext
->find_next
findNextSibling
->find_next_sibling
findNextSiblings
->find_next_siblings
findParent
->find_parent
findParents
->find_parents
findPrevious
->find_previous
findPreviousSibling
->find_previous_sibling
findPreviousSiblings
->find_previous_siblings
nextSibling
->next_sibling
previousSibling
->previous_sibling
Some arguments to the Beautiful Soup constructor were renamed for the same reasons:
BeautifulSoup(parseOnlyThese=...)
->BeautifulSoup(parse_only=...)
BeautifulSoup(fromEncoding=...)
->BeautifulSoup(from_encoding=...)
I renamed one method for compatibility with Python 3:
Tag.has_key()
->Tag.has_attr()
I renamed one attribute to use more accurate terminology:
Tag.isSelfClosing
->Tag.is_empty_element
I renamed three attributes to avoid using words that have special meaning to Python. Unlike the others, these changes are not backwards compatible. If you used these attributes in BS3, your code will break on BS4 until you change them.
UnicodeDammit.unicode
->UnicodeDammit.unicode_markup
Tag.next
->Tag.next_element
Tag.previous
->Tag.previous_element
13.1.3. Generators¶
I gave the generators PEP 8-compliant names, and transformed them into properties:
childGenerator()
->children
nextGenerator()
->next_elements
nextSiblingGenerator()
->next_siblings
previousGenerator()
->previous_elements
previousSiblingGenerator()
->previous_siblings
recursiveChildGenerator()
->descendants
parentGenerator()
->parents
So instead of this:
for parent in tag.parentGenerator():
...
You can write this:
for parent in tag.parents:
...
(But the old code will still work.)
Some of the generators used to yield None
after they were done, and
then stop. That was a bug. Now the generators just stop.
There are two new generators, .strings and
.stripped_strings. .strings
yields
NavigableString objects, and .stripped_strings
yields Python
strings that have had whitespace stripped.
13.1.4. XML¶
There is no longer a BeautifulStoneSoup
class for parsing XML. To
parse XML you pass in “xml” as the second argument to the
BeautifulSoup
constructor. For the same reason, the
BeautifulSoup
constructor no longer recognizes the isHTML
argument.
Beautiful Soup’s handling of empty-element XML tags has been
improved. Previously when you parsed XML you had to explicitly say
which tags were considered empty-element tags. The selfClosingTags
argument to the constructor is no longer recognized. Instead,
Beautiful Soup considers any empty tag to be an empty-element tag. If
you add a child to an empty-element tag, it stops being an
empty-element tag.
13.1.5. Entities¶
An incoming HTML or XML entity is always converted into the
corresponding Unicode character. Beautiful Soup 3 had a number of
overlapping ways of dealing with entities, which have been
removed. The BeautifulSoup
constructor no longer recognizes the
smartQuotesTo
or convertEntities
arguments. (Unicode, Dammit still has smart_quotes_to
, but its default is now to turn
smart quotes into Unicode.) The constants HTML_ENTITIES
,
XML_ENTITIES
, and XHTML_ENTITIES
have been removed, since they
configure a feature (transforming some but not all entities into
Unicode characters) that no longer exists.
If you want to turn Unicode characters back into HTML entities on output, rather than turning them into UTF-8 characters, you need to use an output formatter.
13.1.6. Miscellaneous¶
Tag.string now operates recursively. If tag A contains a single tag B and nothing else, then A.string is the same as B.string. (Previously, it was None.)
Multi-valued attributes like class
have lists of strings as
their values, not strings. This may affect the way you search by CSS
class.
If you pass one of the find*
methods both text and
a tag-specific argument like name, Beautiful Soup will
search for tags that match your tag-specific criteria and whose
Tag.string matches your value for text. It will not find the strings themselves. Previously,
Beautiful Soup ignored the tag-specific arguments and looked for
strings.
The BeautifulSoup
constructor no longer recognizes the
markupMassage argument. It’s now the parser’s responsibility to
handle markup correctly.
The rarely-used alternate parser classes like
ICantBelieveItsBeautifulSoup
and BeautifulSOAP
have been
removed. It’s now the parser’s decision how to handle ambiguous
markup.
The prettify()
method now returns a Unicode string, not a bytestring.