4. Kinds of objects¶
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects.
4.1. Tag
¶
A Tag
object corresponds to an XML or HTML tag in the original document:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.
4.1.1. Name¶
Every tag has a name, accessible as .name
:
tag.name
# u'b'
If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
4.1.2. Attributes¶
A tag may have any number of attributes. The tag <b
class="boldest">
has an attribute “class” whose value is
“boldest”. You can access a tag’s attributes by treating the tag like
a dictionary:
tag['class']
# u'boldest'
You can access that dictionary directly as .attrs
:
tag.attrs
# {u'class': u'boldest'}
You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
4.1.2.1. Multi-valued attributes¶
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is class
(that is, a tag can have more than
one CSS class). Others include rel
, rev
, accept-charset
,
headers
, and accesskey
. Beautiful Soup presents the value(s)
of a multi-valued attribute as a list:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]
If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'
When you turn a tag back into a string, multiple attribute values are consolidated:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>
If you parse a document as XML, there are no multi-valued attributes:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# u'body strikeout'
4.3. BeautifulSoup
¶
The BeautifulSoup
object itself represents the document as a
whole. For most purposes, you can treat it as a Tag
object. This means it supports most of the methods described in
Navigating the tree and Searching the tree.
Since the BeautifulSoup
object doesn’t correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it’s
useful to look at its .name
, so it’s been given the special
.name
“[document]”:
soup.name
# u'[document]'
4.4. Comments and other special strings¶
Tag
, NavigableString
, and BeautifulSoup
cover almost
everything you’ll see in an HTML or XML file, but there are a few
leftover bits. The only one you’ll probably ever need to worry about
is the comment:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>
The Comment
object is just a special type of NavigableString
:
comment
# u'Hey, buddy. Want to buy a used parser'
But when it appears as part of an HTML document, a Comment
is
displayed with special formatting:
print(soup.b.prettify())
# <b>
# <!--Hey, buddy. Want to buy a used parser?-->
# </b>
Beautiful Soup defines classes for anything else that might show up in
an XML document: CData
, ProcessingInstruction
,
Declaration
, and Doctype
. Just like Comment
, these classes
are subclasses of NavigableString
that add something extra to the
string. Here’s an example that replaces the comment with a CDATA
block:
from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)
print(soup.b.prettify())
# <b>
# <![CDATA[A CDATA block]]>
# </b>