Kinds of objects
================

Beautiful Soup transforms a complex HTML document into a complex tree
of Python objects. But you'll only ever have to deal with about four
`kinds` of objects.

.. _Tag:

``Tag``
-------

A ``Tag`` object corresponds to an XML or HTML tag in the original document::

 soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
 tag = soup.b
 type(tag)
 # <class 'bs4.element.Tag'>

Tags have a lot of attributes and methods, and I'll cover most of them
in :ref:`navigating_the_tree` and :ref:`searching_the_tree`. For now, the most
important features of a tag are its name and attributes.

Name
^^^^

Every tag has a name, accessible as ``.name``::

 tag.name
 # u'b'

If you change a tag's name, the change will be reflected in any HTML
markup generated by Beautiful Soup::

 tag.name = "blockquote"
 tag
 # <blockquote class="boldest">Extremely bold</blockquote>

.. _kind_of_obj attributes:

Attributes
^^^^^^^^^^

A tag may have any number of attributes. The tag ``<b
class="boldest">`` has an attribute "class" whose value is
"boldest". You can access a tag's attributes by treating the tag like
a dictionary::

 tag['class']
 # u'boldest'

You can access that dictionary directly as ``.attrs``::

 tag.attrs
 # {u'class': u'boldest'}

You can add, remove, and modify a tag's attributes. Again, this is
done by treating the tag as a dictionary::

 tag['class'] = 'verybold'
 tag['id'] = 1
 tag
 # <blockquote class="verybold" id="1">Extremely bold</blockquote>

 del tag['class']
 del tag['id']
 tag
 # <blockquote>Extremely bold</blockquote>

 tag['class']
 # KeyError: 'class'
 print(tag.get('class'))
 # None

.. _multivalue:

Multi-valued attributes
&&&&&&&&&&&&&&&&&&&&&&&

HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is ``class`` (that is, a tag can have more than
one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
``headers``, and ``accesskey``. Beautiful Soup presents the value(s)
of a multi-valued attribute as a list::

 css_soup = BeautifulSoup('<p class="body strikeout"></p>')
 css_soup.p['class']
 # ["body", "strikeout"]

 css_soup = BeautifulSoup('<p class="body"></p>')
 css_soup.p['class']
 # ["body"]

If an attribute `looks` like it has more than one value, but it's not
a multi-valued attribute as defined by any version of the HTML
standard, Beautiful Soup will leave the attribute alone::

 id_soup = BeautifulSoup('<p id="my id"></p>')
 id_soup.p['id']
 # 'my id'

When you turn a tag back into a string, multiple attribute values are
consolidated::

 rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
 rel_soup.a['rel']
 # ['index']
 rel_soup.a['rel'] = ['index', 'contents']
 print(rel_soup.p)
 # <p>Back to the <a rel="index contents">homepage</a></p>

If you parse a document as XML, there are no multi-valued attributes::

 xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
 xml_soup.p['class']
 # u'body strikeout'


``NavigableString``
-------------------

A string corresponds to a bit of text within a tag. Beautiful Soup
uses the ``NavigableString`` class to contain these bits of text::

 tag.string
 # u'Extremely bold'
 type(tag.string)
 # <class 'bs4.element.NavigableString'>

A ``NavigableString`` is just like a Python Unicode string, except
that it also supports some of the features described in :ref:`navigating_the_tree` and :ref:`searching_the_tree`. You can convert a
``NavigableString`` to a Unicode string with ``unicode()``::

 unicode_string = unicode(tag.string)
 unicode_string
 # u'Extremely bold'
 type(unicode_string)
 # <type 'unicode'>

You can't edit a string in place, but you can replace one string with
another, using :ref:`replace_with`::

 tag.string.replace_with("No longer bold")
 tag
 # <blockquote>No longer bold</blockquote>

``NavigableString`` supports most of the features described in
:ref:`navigating_the_tree` and :ref:`searching_the_tree`, but not all of
them. In particular, since a string can't contain anything (the way a
tag may contain a string or another tag), strings don't support the
``.contents`` or ``.string`` attributes, or the ``find()`` method.

If you want to use a ``NavigableString`` outside of Beautiful Soup,
you should call ``unicode()`` on it to turn it into a normal Python
Unicode string. If you don't, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when you're
done using Beautiful Soup. This is a big waste of memory.

``BeautifulSoup``
-----------------

The ``BeautifulSoup`` object itself represents the document as a
whole. For most purposes, you can treat it as a :ref:`Tag`
object. This means it supports most of the methods described in
:ref:`navigating_the_tree` and :ref:`searching_the_tree`.

Since the ``BeautifulSoup`` object doesn't correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it's
useful to look at its ``.name``, so it's been given the special
``.name`` "[document]"::

 soup.name
 # u'[document]'

Comments and other special strings
----------------------------------

``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
everything you'll see in an HTML or XML file, but there are a few
leftover bits. The only one you'll probably ever need to worry about
is the comment::

 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
 soup = BeautifulSoup(markup)
 comment = soup.b.string
 type(comment)
 # <class 'bs4.element.Comment'>

The ``Comment`` object is just a special type of ``NavigableString``::

 comment
 # u'Hey, buddy. Want to buy a used parser'

But when it appears as part of an HTML document, a ``Comment`` is
displayed with special formatting::

 print(soup.b.prettify())
 # <b>
 #  <!--Hey, buddy. Want to buy a used parser?-->
 # </b>

Beautiful Soup defines classes for anything else that might show up in
an XML document: ``CData``, ``ProcessingInstruction``,
``Declaration``, and ``Doctype``. Just like ``Comment``, these classes
are subclasses of ``NavigableString`` that add something extra to the
string. Here's an example that replaces the comment with a CDATA
block::

 from bs4 import CData
 cdata = CData("A CDATA block")
 comment.replace_with(cdata)

 print(soup.b.prettify())
 # <b>
 #  <![CDATA[A CDATA block]]>
 # </b>