4. Kinds of objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects.

4.1. Tag

A Tag object corresponds to an XML or HTML tag in the original document:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
# <class 'bs4.element.Tag'>

Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

4.1.1. Name

Every tag has a name, accessible as .name:

# u'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

tag.name = "blockquote"
# <blockquote class="boldest">Extremely bold</blockquote>

4.1.2. Attributes

A tag may have any number of attributes. The tag <b class="boldest"> has an attribute “class” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

# u'boldest'

You can access that dictionary directly as .attrs:

# {u'class': u'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

tag['class'] = 'verybold'
tag['id'] = 1
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
# <blockquote>Extremely bold</blockquote>

# KeyError: 'class'
# None Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
# ["body", "strikeout"]

css_soup = BeautifulSoup('<p class="body"></p>')
# ["body"]

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

id_soup = BeautifulSoup('<p id="my id"></p>')
# 'my id'

When you turn a tag back into a string, multiple attribute values are consolidated:

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
# <p>Back to the <a rel="index contents">homepage</a></p>

If you parse a document as XML, there are no multi-valued attributes:

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
# u'body strikeout'

4.3. BeautifulSoup

The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name, so it’s been given the special .name “[document]”:

# u'[document]'

4.4. Comments and other special strings

Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the comment:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
# <class 'bs4.element.Comment'>

The Comment object is just a special type of NavigableString:

# u'Hey, buddy. Want to buy a used parser'

But when it appears as part of an HTML document, a Comment is displayed with special formatting:

# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

Beautiful Soup defines classes for anything else that might show up in an XML document: CData, ProcessingInstruction, Declaration, and Doctype. Just like Comment, these classes are subclasses of NavigableString that add something extra to the string. Here’s an example that replaces the comment with a CDATA block:

from bs4 import CData
cdata = CData("A CDATA block")

# <b>
#  <![CDATA[A CDATA block]]>
# </b>