Beautiful Soup is one of the most popular libraries for parsing HTML and scraping data from websites in Python. In this tutorial, we'll go through everything you need to know to start a full fledged scraping project with Beautiful Soup.
The first step is to install Beautiful Soup. You can do this with pip:
pip install beautifulsoup4
This will grab the latest version of Beautiful Soup and its dependencies.
Now that Beautiful Soup is installed, we can start parsing HTML. To create a BeautifulSoup object, you need to pass in some HTML content and specify the parser to use.
Common parsers include 'html.parser', 'lxml', and 'html5lib'. Here's an example:
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page Title</title></head>
<body>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
Now the HTML is stored in a BeautifulSoup object called 'soup' that we can query.
BeautifulSoup represents the HTML document as a parse tree that can be traversed and searched. Common navigation methods include:
soup.title
- Get the <title>
elementsoup.head
- Get the <head>
elementsoup.body
- Get the <body>
elementsoup.p
- Get all <p>
elementssoup.find_all('p')
- Get a list of all <p>
tagsWe can also navigate relationships like:
soup.head.title
soup.body.p
These return Tag objects that contain the parsed element.
Now that we have Tag objects, we can extract useful data from them:
tag.name
- Tag name as a stringtag.text
- Tag contents as a NavigableStringtag['id']
- Value of id attributetag.attrs
- Dictionary of all attributes