Back to Python

Python Web Scraping

Web Scraping Basics

Web scraping is the process of extracting data from websites using automated tools. Python offers several libraries for this purpose.

import requests
from bs4 import BeautifulSoup

# Fetch webpage content
url = 'https://example.com'
response = requests.get(url)
print(response.status_code) # 200 means success

# Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Extract data
title = soup.title.text
headings = [h1.text for h1 in soup.find_all('h1')]
links = [a['href'] for a in soup.find_all('a', href=True)]

print("Page title:", title)
print("First heading:", headings[0])
print("First link:", links[0])

Always check a website's robots.txt file (e.g., https://example.com/robots.txt) and Terms of Service before scraping.

CSS Selectors

BeautifulSoup supports CSS selectors for more precise element targeting, similar to how you select elements in JavaScript.

import requests
from bs4 import BeautifulSoup

url = 'https://quotes.toscrape.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Select elements by CSS class
quotes = soup.select('.quote')
for quote in quotes:
    text = quote.select_one('.text').text
    author = quote.select_one('.author').text
    tags = [tag.text for tag in quote.select('.tag')]
    print(f"{text} - {author}")
    print("Tags:", ', '.join(tags))
    print()

# Select specific elements
next_page = soup.select_one('li.next > a')
if next_page:
    print("Next page URL:", next_page['href'])

CSS selectors provide a powerful way to navigate complex HTML structures with precision.

Handling Dynamic Content

For JavaScript-rendered content, tools like Selenium or Playwright can be used to automate browsers.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Set up Selenium with Chrome
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Load a dynamic page
driver.get('https://quotes.toscrape.com/js/')

# Wait for elements to load (implicit wait)
driver.implicitly_wait(10) # seconds

# Extract data
quotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, 'text').text
    author = quote.find_element(By.CLASS_NAME, 'author').text
    print(f"{text} - {author}")

# Close the browser
driver.quit()

Browser automation tools are slower but necessary for scraping modern JavaScript-heavy websites.

Scrapy Framework

Scrapy is a powerful framework for large-scale web scraping projects with built-in support for:

  • Concurrent requests
  • Data pipelines
  • Middleware support
  • Exporting to multiple formats
# Sample Scrapy spider (save as quotes_spider.py)
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get(),
                'tags': quote.css('.tag::text').getall(),
            }

        # Follow pagination links
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

# Run with: scrapy runspider quotes_spider.py -o quotes.json

Scrapy is ideal for production-grade scraping with its built-in features for handling large-scale data extraction.

Ethical Considerations

Responsible web scraping involves respecting website owners and their resources:

import requests
import time
from urllib.robotparser import RobotFileParser

# Check robots.txt
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_scrape = rp.can_fetch('*', 'https://example.com/target-page')
print("Allowed to scrape:", can_scrape)

# Polite scraping practices
headers = {
    'User-Agent': 'MyScraper/1.0 (contact@example.com)',
    'Accept-Language': 'en-US,en;q=0.9'
}

def scrape_politely(url):
    # Respect crawl-delay if specified
    time.sleep(2) # Default delay between requests
    response = requests.get(url, headers=headers)
    if response.status_code == 429: # Too Many Requests
        time.sleep(60) # Wait a minute if rate-limited
        return scrape_politely(url)
    return response

# Use session for connection pooling
session = requests.Session()
session.headers.update(headers)

Always identify your scraper, respect robots.txt, limit request rate, and consider using official APIs when available.

Python Web Scraping Videos

Master web scraping with these handpicked YouTube tutorials:

Beginner Guides

Learn the fundamentals of web scraping:

Advanced Techniques

Take your scraping to the next level:

Real-world Projects

Practical scraping applications:

Python Web Scraping Quiz