.. _searching_the_tree: Searching the tree ================== Beautiful Soup defines a lot of methods for searching the parse tree, but they're all very similar. I'm going to spend a lot of time explaining the two most popular methods: ``find()`` and ``find_all()``. The other methods take almost exactly the same arguments, so I'll just cover them briefly. Once again, I'll be using the "three sisters" document as an example:: html_doc = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) By passing in a filter to an argument like ``find_all()``, you can zoom in on the parts of the document you're interested in. Kinds of filters ---------------- Before talking in detail about ``find_all()`` and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag's name, on its attributes, on the text of a string, or on some combination of these. .. _a string: A string ^^^^^^^^ The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the tags in the document:: soup.find_all('b') # [The Dormouse's story] If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead. .. _a regular expression: A regular expression ^^^^^^^^^^^^^^^^^^^^ If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its ``match()`` method. This code finds all the tags whose names start with the letter "b"; in this case, the tag and the tag:: import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b This code finds all the tags whose names contain the letter 't':: for tag in soup.find_all(re.compile("t")): print(tag.name) # html # title .. _a list: A list ^^^^^^ If you pass in a list, Beautiful Soup will allow a string match against `any` item in that list. This code finds all the tags `and` all the tags:: soup.find_all(["a", "b"]) # [The Dormouse's story, # Elsie, # Lacie, # Tillie] .. _the value True: ``True`` ^^^^^^^^ The value ``True`` matches everything it can. This code finds `all` the tags in the document, but none of the text strings:: for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p .. a function: A function ^^^^^^^^^^ If none of the other matches work for you, define a function that takes an element as its only argument. The function should return ``True`` if the argument matches, and ``False`` otherwise. Here's a function that returns ``True`` if a tag defines the "class" attribute but doesn't define the "id" attribute:: def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') Pass this function into ``find_all()`` and you'll pick up all thetags:: soup.find_all(has_class_but_no_id) # [
The Dormouse's story
, #Once upon a time there were...
, #...
] This function only picks up the tags. It doesn't pick up the
tags, because those tags define both "class" and "id". It doesn't pick
up tags like and The Dormouse's story tag with the CSS class "title"?
Let's look at the arguments to ``find_all()``.
.. _name:
The ``name`` argument
^^^^^^^^^^^^^^^^^^^^^
Pass in a value for ``name`` and you'll tell Beautiful Soup to only
consider tags with certain names. Text strings will be ignored, as
will tags whose names that don't match.
This is the simplest usage::
soup.find_all("title")
# [ The Dormouse's story Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well. tags is an
indirect parent of the string, and our search finds that as
well. There's a tag with the CSS class "title" `somewhere` in the
document, but it's not one of this string's parents, so we can't find
it with ``find_parents()``.
You may have made the connection between ``find_parent()`` and
``find_parents()``, and the :ref:`.parent` and :ref:`.parents` attributes
mentioned earlier. The connection is very strong. These search methods
actually use ``.parents`` to iterate over all the parents, and check
each one against the provided filter to see if it matches.
``find_next_siblings()`` and ``find_next_sibling()``
----------------------------------------------------
Signature: find_next_siblings(:ref:`name ... The Dormouse's story ... tag in the document showed up, even though it's not in
the same part of the tree as the tag we started from. For these
methods, all that matters is that an element match the filter, and
show up later in the document than the starting element.
``find_all_previous()`` and ``find_previous()``
-----------------------------------------------
Signature: find_all_previous(:ref:`name Once upon a time there were three little sisters; ... The Dormouse's story tag that contains the tag we started
with. This shouldn't be too surprising: we're looking at all the tags
that show up earlier in the document than the one we started with. A
tag that contains an tag must have shown up before the
tag it contains.
CSS selectors
-------------
Beautiful Soup supports the most commonly-used `CSS selectors
... Hello Howdy, y'all Pip-pip, old fruit Bonjour mes amis Hello Howdy, y'all Pip-pip, old fruit