The Dormouse's story

.. _quick_start: Quick Start =========== Here's an HTML document I'll be using as an example throughout this document. It's part of a story from `Alice in Wonderland`:: html_doc = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" Running the "three sisters" document through Beautiful Soup gives us a ``BeautifulSoup`` object, which represents the document as a nested data structure:: from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) print(soup.prettify()) # # # # The Dormouse's story # # # #

# # The Dormouse's story # #

# Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #

# ... #

# # Here are some simple ways to navigate that data structure:: soup.title # The Dormouse's story soup.title.name # u'title' soup.title.string # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p #

The Dormouse's story

soup.p['class'] # u'title' soup.a # Elsie soup.find_all('a') # [Elsie, # Lacie, # Tillie] soup.find(id="link3") # Tillie One common task is extracting all the URLs found within a page's tags:: for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page:: print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ... Does this look like what you need? If so, read on.