Beautiful Soup is one of the most popular libraries for parsing HTML and scraping data from websites in Python. In this tutorial, we'll go through everything you need to know to start a full fledged scraping project with Beautiful Soup.
The first step is to install Beautiful Soup. You can do this with pip:
pip install beautifulsoup4
This will grab the latest version of Beautiful Soup and its dependencies.
Now that Beautiful Soup is installed, we can start parsing HTML. To create a BeautifulSoup object, you need to pass in some HTML content and specify the parser to use.
Common parsers include 'html.parser', 'lxml', and 'html5lib'. Here's an example:
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Page Title</title></head>
<body>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
Now the HTML is stored in a BeautifulSoup object called 'soup' that we can query.
BeautifulSoup represents the HTML document as a parse tree that can be traversed and searched. Common navigation methods include:
soup.title
- Get the <title>
elementsoup.head
- Get the <head>
elementsoup.body
- Get the <body>
elementsoup.p
- Get all <p>
elementssoup.find_all('p')
- Get a list of all <p>
tagsWe can also navigate relationships like:
soup.head.title
soup.body.p
These return Tag objects that contain the parsed element.
Now that we have Tag objects, we can extract useful data from them:
tag.name
- Tag name as a stringtag.text
- Tag contents as a NavigableStringtag['id']
- Value of id attributetag.attrs
- Dictionary of all attributesSome common tasks are getting text, attributes, or finding descendant tags.
For more complex searches, Beautiful Soup supports CSS selectors:
links = soup.select('p a')
This retrieves all <a>
tags inside <p>
tags. You can use class, id and attribute selectors as well.
When scraping multiple pages, you'll want to save data for later processing:
with open('data.json', 'w') as f:
json.dump(data, f)
And load saved data:
with open('data.json') as f:
data = json.load(f)
Here are some more advanced techniques for web scraping with Beautiful Soup:
Many sites load dynamic content via JavaScript after page load. Beautiful Soup can't execute JS, so the HTML will be incomplete.
For this, use Selenium to automate a browser:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome()
browser.get('http://example.com')
html = browser.page_source
soup = BeautifulSoup(html)
Selenium lets JavaScript fully render before getting HTML.
To interact with forms, you'll need Selenium to control the browser:
username = browser.find_element_by_id('username')
username.send_keys('myaccount')
password = browser.find_element_by_id('password')
password.send_keys('mypassword')
submit = browser.find_element_by_css_selector('input[type="submit"]')
submit.click()
After logging in, you can scrape pages that require auth.
You don't necessarily need to use Selenium to scrape dynamic websites with Beautiful Soup. Here are some alternative approaches:
Requests-HTML allows you to make HTML requests and parse the response like BeautifulSoup, but it can also execute JavaScript on the page.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://example.com')
r.html.render()
soup = BeautifulSoup(r.text, 'html.parser')
This renders the JS before passing to BeautifulSoup.
You can pass browser-like headers with Requests to trick sites:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.get('http://example.com', headers=headers)
Some sites may still detect it's not a real browser though.
Using the browser dev tools, identify the exact APIs/endpoints used to load dynamic content.
Make direct requests to these endpoints to bypass rendering:
import json
import requests
url = 'http://example.com/api/data'
response = requests.get(url)
data = response.json()
Puppeteer and Playwright allow controlling real headless browsers like Chromium without a visual UI. This executes JS properly.
So in summary, Selenium isn't always needed - various Requests-based approaches can work too.
Some sites load data via asynchronous requests. Inspect the network tab to find the API URL. You can directly make the request instead of scraping rendered HTML:
import requests
response = requests.get('http://example.com/api/data')
data = response.json()
This bypasses rendering and gives you data in Python dicts/lists.
To avoid being blocked, you may need to:
Add error handling to make the scraper robust. And that's some advanced Beautiful Soup techniques!