Using Beautiful Soup for Web Scraping
...

Beautiful Soup is one of the most popular libraries for parsing HTML and scraping data from websites in Python. In this tutorial, we'll go through everything you need to know to start a full fledged scraping project with Beautiful Soup.

Installing Beautiful Soup
...

The first step is to install Beautiful Soup. You can do this with pip:

pip install beautifulsoup4

This will grab the latest version of Beautiful Soup and its dependencies.

Parsing HTML with BeautifulSoup
...

Now that Beautiful Soup is installed, we can start parsing HTML. To create a BeautifulSoup object, you need to pass in some HTML content and specify the parser to use.

Common parsers include 'html.parser', 'lxml', and 'html5lib'. Here's an example:

from bs4 import BeautifulSoup

html = """
<html>
<head><title>Page Title</title></head> 
<body>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

Now the HTML is stored in a BeautifulSoup object called 'soup' that we can query.

BeautifulSoup represents the HTML document as a parse tree that can be traversed and searched. Common navigation methods include:

  • soup.title - Get the <title> element
  • soup.head - Get the <head> element
  • soup.body - Get the <body> element
  • soup.p - Get all <p> elements
  • soup.find_all('p') - Get a list of all <p> tags

We can also navigate relationships like:

  • soup.head.title
  • soup.body.p

These return Tag objects that contain the parsed element.

Extracting Data from Tags
...

Now that we have Tag objects, we can extract useful data from them:

  • tag.name - Tag name as a string
  • tag.text - Tag contents as a NavigableString
  • tag['id'] - Value of id attribute
  • tag.attrs - Dictionary of all attributes

Some common tasks are getting text, attributes, or finding descendant tags.

Searching with CSS Selectors
...

For more complex searches, Beautiful Soup supports CSS selectors:

links = soup.select('p a')

This retrieves all <a> tags inside <p> tags. You can use class, id and attribute selectors as well.

Saving and Loading Data
...

When scraping multiple pages, you'll want to save data for later processing:

with open('data.json', 'w') as f:
  json.dump(data, f)

And load saved data:

with open('data.json') as f:
  data = json.load(f) 

Here are some more advanced techniques for web scraping with Beautiful Soup:

Dynamic sites
...

Handling Rendered Content with Selenium
...

Many sites load dynamic content via JavaScript after page load. Beautiful Soup can't execute JS, so the HTML will be incomplete.

For this, use Selenium to automate a browser:

from selenium import webdriver
from bs4 import BeautifulSoup

browser = webdriver.Chrome() 
browser.get('http://example.com')

html = browser.page_source
soup = BeautifulSoup(html)

Selenium lets JavaScript fully render before getting HTML.

Submitting Forms and Logging In
...

To interact with forms, you'll need Selenium to control the browser:

username = browser.find_element_by_id('username')
username.send_keys('myaccount')

password = browser.find_element_by_id('password')  
password.send_keys('mypassword')

submit = browser.find_element_by_css_selector('input[type="submit"]')
submit.click()

After logging in, you can scrape pages that require auth.

You don't necessarily need to use Selenium to scrape dynamic websites with Beautiful Soup. Here are some alternative approaches:

Use the Requests-HTML Library
...

Requests-HTML allows you to make HTML requests and parse the response like BeautifulSoup, but it can also execute JavaScript on the page.

from requests_html import HTMLSession

session = HTMLSession()
r = session.get('http://example.com')
r.html.render()

soup = BeautifulSoup(r.text, 'html.parser')

This renders the JS before passing to BeautifulSoup.

Emulate Browser Behavior
...

You can pass browser-like headers with Requests to trick sites:

import requests

headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get('http://example.com', headers=headers)

Some sites may still detect it's not a real browser though.

Inspect Network Requests
...

Using the browser dev tools, identify the exact APIs/endpoints used to load dynamic content.

Make direct requests to these endpoints to bypass rendering:

import json
import requests

url = 'http://example.com/api/data'
response = requests.get(url)
data = response.json()

Use a Headless Browser
...

Puppeteer and Playwright allow controlling real headless browsers like Chromium without a visual UI. This executes JS properly.

So in summary, Selenium isn't always needed - various Requests-based approaches can work too.

Scraping Data Behind AJAX Requests
...

Some sites load data via asynchronous requests. Inspect the network tab to find the API URL. You can directly make the request instead of scraping rendered HTML:

import requests

response = requests.get('http://example.com/api/data')
data = response.json()

This bypasses rendering and gives you data in Python dicts/lists.

Rotating Proxies and User-Agents
...

To avoid being blocked, you may need to:

  • Use rotating proxies
  • Randomize user-agents on each request
  • Add delays between requests

Add error handling to make the scraper robust. And that's some advanced Beautiful Soup techniques!