.. _navigating_the_tree:
Navigating the tree
===================
Here's the "Three sisters" HTML document again::
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
I'll use this as an example to show you how to move from one part of
a document to another.
Going down
----------
Tags may contain strings and other tags. These elements are the tag's
`children`. Beautiful Soup provides a lot of different attributes for
navigating and iterating over a tag's children.
Note that Beautiful Soup strings don't support any of these
attributes, because a string can't have children.
.. _navtree using tag names:
Navigating using tag names
^^^^^^^^^^^^^^^^^^^^^^^^^^
The simplest way to navigate the parse tree is to say the name of the
tag you want. If you want the tag, just say ``soup.head``::
soup.head
# The Dormouse's story
soup.title
# The Dormouse's story
You can do use this trick again and again to zoom in on a certain part
of the parse tree. This code gets the first tag beneath the tag::
soup.body.b
# The Dormouse's story
Using a tag name as an attribute will give you only the `first` tag by that
name::
soup.a
# Elsie
If you need to get `all` the tags, or anything more complicated
than the first tag with a certain name, you'll need to use one of the
methods described in :ref:`searching_the_tree`, such as `find_all()`::
soup.find_all('a')
# [Elsie,
# Lacie,
# Tillie]
``.contents`` and ``.children``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A tag's children are available in a list called ``.contents``::
head_tag = soup.head
head_tag
# The Dormouse's story
head_tag.contents
[The Dormouse's story]
title_tag = head_tag.contents[0]
title_tag
# The Dormouse's story
title_tag.contents
# [u'The Dormouse's story']
The ``BeautifulSoup`` object itself has children. In this case, the
tag is the child of the ``BeautifulSoup`` object.::
len(soup.contents)
# 1
soup.contents[0].name
# u'html'
A string does not have ``.contents``, because it can't contain
anything::
text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'
Instead of getting them as a list, you can iterate over a tag's
children using the ``.children`` generator::
for child in title_tag.children:
print(child)
# The Dormouse's story
``.descendants``
^^^^^^^^^^^^^^^^
The ``.contents`` and ``.children`` attributes only consider a tag's
`direct` children. For instance, the tag has a single direct
child--the tag::
head_tag.contents
# [The Dormouse's story]
But the tag itself has a child: the string "The Dormouse's
story". There's a sense in which that string is also a child of the
tag. The ``.descendants`` attribute lets you iterate over `all`
of a tag's children, recursively: its direct children, the children of
its direct children, and so on::
for child in head_tag.descendants:
print(child)
# The Dormouse's story
# The Dormouse's story
The tag has only one child, but it has two descendants: the
tag and the tag's child. The ``BeautifulSoup`` object
only has one direct child (the tag), but it has a whole lot of
descendants::
len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
.. _.string:
``.string``
^^^^^^^^^^^
If a tag has only one child, and that child is a ``NavigableString``,
the child is made available as ``.string``::
title_tag.string
# u'The Dormouse's story'
If a tag's only child is another tag, and `that` tag has a
``.string``, then the parent tag is considered to have the same
``.string`` as its child::
head_tag.contents
# [The Dormouse's story]
head_tag.string
# u'The Dormouse's story'
If a tag contains more than one thing, then it's not clear what
``.string`` should refer to, so ``.string`` is defined to be
``None``::
print(soup.html.string)
# None
.. _string-generators:
``.strings`` and ``stripped_strings``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If there's more than one thing inside a tag, you can still look at
just the strings. Use the ``.strings`` generator::
for string in soup.strings:
print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were\n'
# u'Elsie'
# u',\n'
# u'Lacie'
# u' and\n'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'
These strings tend to have a lot of extra whitespace, which you can
remove by using the ``.stripped_strings`` generator instead::
for string in soup.stripped_strings:
print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'
Here, strings consisting entirely of whitespace are ignored, and
whitespace at the beginning and end of strings is removed.
Going up
--------
Continuing the "family tree" analogy, every tag and every string has a
`parent`: the tag that contains it.
.. _.parent:
``.parent``
^^^^^^^^^^^
You can access an element's parent with the ``.parent`` attribute. In
the example "three sisters" document, the tag is the parent
of the tag::
title_tag = soup.title
title_tag
# The Dormouse's story
title_tag.parent
# The Dormouse's story
The title string itself has a parent: the tag that contains
it::
title_tag.string.parent
# The Dormouse's story
The parent of a top-level tag like is the ``BeautifulSoup`` object
itself::
html_tag = soup.html
type(html_tag.parent)
#
And the ``.parent`` of a ``BeautifulSoup`` object is defined as None::
print(soup.parent)
# None
.. _.parents:
``.parents``
^^^^^^^^^^^^
You can iterate over all of an element's parents with
``.parents``. This example uses ``.parents`` to travel from an tag
buried deep within the document, to the very top of the document::
link = soup.a
link
# Elsie
for parent in link.parents:
if parent is None:
print(parent)
else:
print(parent.name)
# p
# body
# html
# [document]
# None
Going sideways
--------------
Consider a simple document like this::
sibling_soup = BeautifulSoup("text1text2")
print(sibling_soup.prettify())
#
#
#
#
# text1
#
#
# text2
#
#
#
#
The tag and the tag are at the same level: they're both direct
children of the same tag. We call them `siblings`. When a document is
pretty-printed, siblings show up at the same indentation level. You
can also use this relationship in the code you write.
``.next_sibling`` and ``.previous_sibling``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can use ``.next_sibling`` and ``.previous_sibling`` to navigate
between page elements that are on the same level of the parse tree::
sibling_soup.b.next_sibling
# text2
sibling_soup.c.previous_sibling
# text1
The tag has a ``.next_sibling``, but no ``.previous_sibling``,
because there's nothing before the tag `on the same level of the
tree`. For the same reason, the tag has a ``.previous_sibling``
but no ``.next_sibling``::
print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None
The strings "text1" and "text2" are `not` siblings, because they don't
have the same parent::
sibling_soup.b.string
# u'text1'
print(sibling_soup.b.string.next_sibling)
# None
In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a
tag will usually be a string containing whitespace. Going back to the
"three sisters" document::
Elsie
Lacie
Tillie
You might think that the ``.next_sibling`` of the first tag would
be the second tag. But actually, it's a string: the comma and
newline that separate the first tag from the second::
link = soup.a
link
# Elsie
link.next_sibling
# u',\n'
The second tag is actually the ``.next_sibling`` of the comma::
link.next_sibling.next_sibling
# Lacie
.. _sibling-generators:
``.next_siblings`` and ``.previous_siblings``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can iterate over a tag's siblings with ``.next_siblings`` or
``.previous_siblings``::
for sibling in soup.a.next_siblings:
print(repr(sibling))
# u',\n'
# Lacie
# u' and\n'
# Tillie
# u'; and they lived at the bottom of a well.'
# None
for sibling in soup.find(id="link3").previous_siblings:
print(repr(sibling))
# ' and\n'
# Lacie
# u',\n'
# Elsie
# u'Once upon a time there were three little sisters; and their names were\n'
# None
Going back and forth
--------------------
Take a look at the beginning of the "three sisters" document::
The Dormouse's story
The Dormouse's story
An HTML parser takes this string of characters and turns it into a
series of events: "open an tag", "open a tag", "open a
tag", "add a string", "close the tag", "open a
tag", and so on. Beautiful Soup offers tools for reconstructing the
initial parse of the document.
.. _element-generators:
``.next_element`` and ``.previous_element``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``.next_element`` attribute of a string or tag points to whatever
was parsed immediately afterwards. It might be the same as
``.next_sibling``, but it's usually drastically different.
Here's the final tag in the "three sisters" document. Its
``.next_sibling`` is a string: the conclusion of the sentence that was
interrupted by the start of the tag.::
last_a_tag = soup.find("a", id="link3")
last_a_tag
# Tillie
last_a_tag.next_sibling
# '; and they lived at the bottom of a well.'
But the ``.next_element`` of that tag, the thing that was parsed
immediately after the tag, is `not` the rest of that sentence:
it's the word "Tillie"::
last_a_tag.next_element
# u'Tillie'
That's because in the original markup, the word "Tillie" appeared
before that semicolon. The parser encountered an tag, then the
word "Tillie", then the closing tag, then the semicolon and rest of
the sentence. The semicolon is on the same level as the tag, but the
word "Tillie" was encountered first.
The ``.previous_element`` attribute is the exact opposite of
``.next_element``. It points to whatever element was parsed
immediately before this one::
last_a_tag.previous_element
# u' and\n'
last_a_tag.previous_element.next_element
# Tillie
``.next_elements`` and ``.previous_elements``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You should get the idea by now. You can use these iterators to move
forward or backward in the document as it was parsed::
for element in last_a_tag.next_elements:
print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
#
...
# u'...'
# u'\n'
# None