Web Scraping Basics
Web scraping is the process of extracting data from websites using automated tools. Python offers several libraries for this purpose.
import requests
from bs4 import BeautifulSoup
# Fetch webpage content
url = 'https://example.com'
response = requests.get(url)
print(response.status_code) # 200 means success
# Parse HTML content
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data
title = soup.title.text
headings = [h1.text for h1 in soup.find_all('h1')]
links = [a['href'] for a in soup.find_all('a', href=True)]
print("Page title:", title)
print("First heading:", headings[0])
print("First link:", links[0])
Always check a website's robots.txt file (e.g., https://example.com/robots.txt) and Terms of Service before scraping.
CSS Selectors
BeautifulSoup supports CSS selectors for more precise element targeting, similar to how you select elements in JavaScript.
import requests
from bs4 import BeautifulSoup
url = 'https://quotes.toscrape.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Select elements by CSS class
quotes = soup.select('.quote')
for quote in quotes:
text = quote.select_one('.text').text
author = quote.select_one('.author').text
tags = [tag.text for tag in quote.select('.tag')]
print(f"{text} - {author}")
print("Tags:", ', '.join(tags))
print()
# Select specific elements
next_page = soup.select_one('li.next > a')
if next_page:
print("Next page URL:", next_page['href'])
CSS selectors provide a powerful way to navigate complex HTML structures with precision.
Handling Dynamic Content
For JavaScript-rendered content, tools like Selenium or Playwright can be used to automate browsers.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Set up Selenium with Chrome
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Load a dynamic page
driver.get('https://quotes.toscrape.com/js/')
# Wait for elements to load (implicit wait)
driver.implicitly_wait(10) # seconds
# Extract data
quotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
text = quote.find_element(By.CLASS_NAME, 'text').text
author = quote.find_element(By.CLASS_NAME, 'author').text
print(f"{text} - {author}")
# Close the browser
driver.quit()
Browser automation tools are slower but necessary for scraping modern JavaScript-heavy websites.
Scrapy Framework
Scrapy is a powerful framework for large-scale web scraping projects with built-in support for:
- Concurrent requests
- Data pipelines
- Middleware support
- Exporting to multiple formats
# Sample Scrapy spider (save as quotes_spider.py)
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
for quote in response.css('.quote'):
yield {
'text': quote.css('.text::text').get(),
'author': quote.css('.author::text').get(),
'tags': quote.css('.tag::text').getall(),
}
# Follow pagination links
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
# Run with: scrapy runspider quotes_spider.py -o quotes.json
Scrapy is ideal for production-grade scraping with its built-in features for handling large-scale data extraction.
Ethical Considerations
Responsible web scraping involves respecting website owners and their resources:
import requests
import time
from urllib.robotparser import RobotFileParser
# Check robots.txt
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_scrape = rp.can_fetch('*', 'https://example.com/target-page')
print("Allowed to scrape:", can_scrape)
# Polite scraping practices
headers = {
'User-Agent': 'MyScraper/1.0 (contact@example.com)',
'Accept-Language': 'en-US,en;q=0.9'
}
def scrape_politely(url):
# Respect crawl-delay if specified
time.sleep(2) # Default delay between requests
response = requests.get(url, headers=headers)
if response.status_code == 429: # Too Many Requests
time.sleep(60) # Wait a minute if rate-limited
return scrape_politely(url)
return response
# Use session for connection pooling
session = requests.Session()
session.headers.update(headers)
Always identify your scraper, respect robots.txt, limit request rate, and consider using official APIs when available.
Python Web Scraping Videos
Master web scraping with these handpicked YouTube tutorials:
Learn the fundamentals of web scraping:
Take your scraping to the next level:
Scraping dynamic websites:
Practical scraping applications: