Kinds of objects ================ Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you'll only ever have to deal with about four `kinds` of objects. .. _Tag: ``Tag`` ------- A ``Tag`` object corresponds to an XML or HTML tag in the original document:: soup = BeautifulSoup('Extremely bold') tag = soup.b type(tag) # Tags have a lot of attributes and methods, and I'll cover most of them in :ref:`navigating_the_tree` and :ref:`searching_the_tree`. For now, the most important features of a tag are its name and attributes. Name ^^^^ Every tag has a name, accessible as ``.name``:: # u'b' If you change a tag's name, the change will be reflected in any HTML markup generated by Beautiful Soup:: = "blockquote" tag #
Extremely bold
.. _kind_of_obj attributes: Attributes ^^^^^^^^^^ A tag may have any number of attributes. The tag ```` has an attribute "class" whose value is "boldest". You can access a tag's attributes by treating the tag like a dictionary:: tag['class'] # u'boldest' You can access that dictionary directly as ``.attrs``:: tag.attrs # {u'class': u'boldest'} You can add, remove, and modify a tag's attributes. Again, this is done by treating the tag as a dictionary:: tag['class'] = 'verybold' tag['id'] = 1 tag #
Extremely bold
del tag['class'] del tag['id'] tag #
Extremely bold
tag['class'] # KeyError: 'class' print(tag.get('class')) # None .. _multivalue: Multi-valued attributes &&&&&&&&&&&&&&&&&&&&&&& HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is ``class`` (that is, a tag can have more than one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, ``headers``, and ``accesskey``. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:: css_soup = BeautifulSoup('

') css_soup.p['class'] # ["body", "strikeout"] css_soup = BeautifulSoup('

') css_soup.p['class'] # ["body"] If an attribute `looks` like it has more than one value, but it's not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:: id_soup = BeautifulSoup('

') id_soup.p['id'] # 'my id' When you turn a tag back into a string, multiple attribute values are consolidated:: rel_soup = BeautifulSoup('

Back to the homepage

') rel_soup.a['rel'] # ['index'] rel_soup.a['rel'] = ['index', 'contents'] print(rel_soup.p) #

Back to the homepage

If you parse a document as XML, there are no multi-valued attributes:: xml_soup = BeautifulSoup('

', 'xml') xml_soup.p['class'] # u'body strikeout' ``NavigableString`` ------------------- A string corresponds to a bit of text within a tag. Beautiful Soup uses the ``NavigableString`` class to contain these bits of text:: tag.string # u'Extremely bold' type(tag.string) # A ``NavigableString`` is just like a Python Unicode string, except that it also supports some of the features described in :ref:`navigating_the_tree` and :ref:`searching_the_tree`. You can convert a ``NavigableString`` to a Unicode string with ``unicode()``:: unicode_string = unicode(tag.string) unicode_string # u'Extremely bold' type(unicode_string) # You can't edit a string in place, but you can replace one string with another, using :ref:`replace_with`:: tag.string.replace_with("No longer bold") tag #
No longer bold
``NavigableString`` supports most of the features described in :ref:`navigating_the_tree` and :ref:`searching_the_tree`, but not all of them. In particular, since a string can't contain anything (the way a tag may contain a string or another tag), strings don't support the ``.contents`` or ``.string`` attributes, or the ``find()`` method. If you want to use a ``NavigableString`` outside of Beautiful Soup, you should call ``unicode()`` on it to turn it into a normal Python Unicode string. If you don't, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you're done using Beautiful Soup. This is a big waste of memory. ``BeautifulSoup`` ----------------- The ``BeautifulSoup`` object itself represents the document as a whole. For most purposes, you can treat it as a :ref:`Tag` object. This means it supports most of the methods described in :ref:`navigating_the_tree` and :ref:`searching_the_tree`. Since the ``BeautifulSoup`` object doesn't correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it's useful to look at its ``.name``, so it's been given the special ``.name`` "[document]":: # u'[document]' Comments and other special strings ---------------------------------- ``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost everything you'll see in an HTML or XML file, but there are a few leftover bits. The only one you'll probably ever need to worry about is the comment:: markup = "" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) # The ``Comment`` object is just a special type of ``NavigableString``:: comment # u'Hey, buddy. Want to buy a used parser' But when it appears as part of an HTML document, a ``Comment`` is displayed with special formatting:: print(soup.b.prettify()) # # # Beautiful Soup defines classes for anything else that might show up in an XML document: ``CData``, ``ProcessingInstruction``, ``Declaration``, and ``Doctype``. Just like ``Comment``, these classes are subclasses of ``NavigableString`` that add something extra to the string. Here's an example that replaces the comment with a CDATA block:: from bs4 import CData cdata = CData("A CDATA block") comment.replace_with(cdata) print(soup.b.prettify()) # # #