Beautiful Soup

Using Beautiful Soup for Web Scraping

Beautiful Soup is one of the most popular libraries for parsing HTML and scraping data from websites in Python. In this tutorial, we'll go through everything you need to know to start a full fledged scraping project with Beautiful Soup.

Installing Beautiful Soup

The first step is to install Beautiful Soup. You can do this with pip:

pip install beautifulsoup4

This will grab the latest version of Beautiful Soup and its dependencies.

Parsing HTML with BeautifulSoup

Now that Beautiful Soup is installed, we can start parsing HTML. To create a BeautifulSoup object, you need to pass in some HTML content and specify the parser to use.

Common parsers include 'html.parser', 'lxml', and 'html5lib'. Here's an example:

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page Title</title></head> 
<body>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

Now the HTML is stored in a BeautifulSoup object called 'soup' that we can query.

BeautifulSoup represents the HTML document as a parse tree that can be traversed and searched. Common navigation methods include:

  • soup.title - Get the <title> element
  • soup.head - Get the <head> element
  • soup.body - Get the <body> element
  • soup.p - Get all <p> elements
  • soup.find_all('p') - Get a list of all <p> tags

We can also navigate relationships like:

  • soup.head.title
  • soup.body.p

These return Tag objects that contain the parsed element.

Extracting Data from Tags

Now that we have Tag objects, we can extract useful data from them:

  • tag.name - Tag name as a string
  • tag.text - Tag contents as a NavigableString
  • tag['id'] - Value of id attribute
  • tag.attrs - Dictionary of all attributes