4. Kinds of objects¶

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects.

4.1. `Tag`¶

A Tag object corresponds to an XML or HTML tag in the original document:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

4.1.1. Name¶

Every tag has a name, accessible as .name:

tag.name
# u'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>

4.1.2. Attributes¶

A tag may have any number of attributes. The tag <b class="boldest"> has an attribute “class” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

tag['class']
# u'boldest'

You can access that dictionary directly as .attrs:

tag.attrs
# {u'class': u'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

4.1.2.1. Multi-valued attributes¶

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

When you turn a tag back into a string, multiple attribute values are consolidated:

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>

If you parse a document as XML, there are no multi-valued attributes:

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# u'body strikeout'

4.2. `NavigableString`¶

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with unicode():

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>

You can’t edit a string in place, but you can replace one string with another, using replace_with():

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method.

If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.

4.3. `BeautifulSoup`¶

The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name, so it’s been given the special .name “[document]”:

soup.name
# u'[document]'

4.4. Comments and other special strings¶

Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the comment:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

The Comment object is just a special type of NavigableString:

comment
# u'Hey, buddy. Want to buy a used parser'

But when it appears as part of an HTML document, a Comment is displayed with special formatting:

print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

Beautiful Soup defines classes for anything else that might show up in an XML document: CData, ProcessingInstruction, Declaration, and Doctype. Just like Comment, these classes are subclasses of NavigableString that add something extra to the string. Here’s an example that replaces the comment with a CDATA block:

from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())
# <b>
#  <![CDATA[A CDATA block]]>
# </b>