March 26, 202610 min read

Python Web Scraping: Extract Data From Any Website (Ethically)

A practical guide to web scraping with Python — requests, BeautifulSoup, Selenium, handling pagination, and respecting website boundaries.

python web-scraping beautifulsoup selenium tutorial

Web scraping is one of the most useful and most misused skills in programming. Useful because the internet is the largest dataset ever created and most of it isn't available through APIs. Misused because too many tutorials teach you how to extract data without teaching you when you should and shouldn't.

So let's start with the ethics and legality, then get into the code.

Before You Scrape: The Rules

Check for an API first. Many websites that people scrape — Twitter, Reddit, Wikipedia, government databases — have official APIs. APIs are more reliable, faster, and explicitly permitted. Scraping a site that offers an API is usually the wrong approach. Read robots.txt. Every major website has a robots.txt file (e.g., example.com/robots.txt) that specifies which pages crawlers are allowed to access. Respecting robots.txt isn't legally required everywhere, but it's the ethical standard. Read the Terms of Service. Some websites explicitly prohibit scraping. Violating ToS can have legal consequences, particularly for commercial use. Don't overload servers. Hitting a website with hundreds of requests per second can degrade service for real users. Always add delays between requests. If you're scraping a small site, be especially careful — they have less capacity to absorb your traffic. Personal and research use is generally fine. Scraping public data for personal analysis, academic research, or journalism is widely considered acceptable. Scraping for commercial gain or to replicate a service is where legal risk increases.

With that covered, let's build.

The Setup

pip install requests beautifulsoup4 lxml

requests makes HTTP requests. beautifulsoup4 parses HTML. lxml is a fast HTML/XML parser that BeautifulSoup can use as a backend.

Your First Scraper: Requests + BeautifulSoup

import requests
from bs4 import BeautifulSoup

# Fetch the page
url = "https://news.ycombinator.com/"
headers = {
    "User-Agent": "Mozilla/5.0 (educational scraping project)"
}
response = requests.get(url, headers=headers)
response.raise_for_status()  # Raise exception for HTTP errors

# Parse the HTML
soup = BeautifulSoup(response.text, "lxml")

# Extract data
articles = []
for item in soup.select(".titleline > a"):
    title = item.get_text()
    link = item.get("href", "")
    articles.append({"title": title, "link": link})

# Print results
for article in articles[:10]:
    print(f"Title: {article['title']}")
    print(f"Link:  {article['link']}")
    print()

The headers dictionary includes a User-Agent string. Some websites block requests that don't include one, and it's polite to identify your scraper. For a real project, include contact information so the site owner can reach you if there's an issue.

CSS Selectors — The Key Skill

BeautifulSoup supports CSS selectors, which are the most efficient way to find elements. If you know CSS, you already know how to navigate HTML for scraping:

# Find by tag
soup.select("h1")                    # All h1 elements
soup.select("p")                     # All paragraphs

# Find by class
soup.select(".product-card")         # Class name
soup.select("div.product-card")      # Div with class

# Find by ID
soup.select("#main-content")         # Element with id="main-content"

# Nested selectors
soup.select("div.products > .item")  # Direct children
soup.select("table tr td")           # Nested descendants

# Attribute selectors
soup.select("a[href]")              # All links with href
soup.select('a[href*="product"]')   # Links containing "product" in href
soup.select('img[src$=".jpg"]')     # Images ending in .jpg

# Multiple selectors
soup.select("h1, h2, h3")           # All h1, h2, and h3 elements

# Nth child
soup.select("tr:nth-child(even)")    # Even table rows

Finding the right selector is often the hardest part of scraping. Use your browser's DevTools: right-click an element, select "Inspect," and examine the HTML structure. Chrome's "Copy > Copy selector" feature gives you a CSS selector for any element, though generated selectors are often more specific than necessary.

Extracting Data From Real Pages

A practical example — scraping product information from a product listing page:

import requests
from bs4 import BeautifulSoup
import json
import time

def scrape_product_page(url):
    """Scrape product details from a single page."""
    response = requests.get(url, headers={"User-Agent": "Educational scraper"})
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

products = []
    for card in soup.select(".product-card"):
        name_el = card.select_one(".product-name")
        price_el = card.select_one(".product-price")
        rating_el = card.select_one(".rating-value")
        image_el = card.select_one("img.product-image")

product = {
            "name": name_el.get_text(strip=True) if name_el else None,
            "price": parse_price(price_el.get_text(strip=True)) if price_el else None,
            "rating": float(rating_el.get_text(strip=True)) if rating_el else None,
            "image_url": image_el.get("src") if image_el else None,
            "url": card.select_one("a").get("href") if card.select_one("a") else None,
        }
        products.append(product)

return products

def parse_price(price_str):
    """Convert '$19.99' to 19.99"""
    import re
    match = re.search(r'[\d,]+\.?\d*', price_str.replace(",", ""))
    return float(match.group()) if match else None

Always use get_text(strip=True) to remove whitespace. Always check if elements exist before accessing them — websites change their HTML without warning, and a missing element shouldn't crash your entire scraper.

Handling Pagination

Most scrapers need to follow pagination — either page numbers, "next" buttons, or infinite scroll:

def scrape_all_pages(base_url, max_pages=50):
    """Scrape multiple pages with rate limiting."""
    all_products = []

for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        print(f"Scraping page {page}...")

try:
            products = scrape_product_page(url)
        except requests.HTTPError as e:
            if e.response.status_code == 404:
                print(f"Page {page} not found — reached the end.")
                break
            raise

if not products:
            print("No products found — reached the end.")
            break

all_products.extend(products)

# Rate limiting — be respectful
        time.sleep(2)  # Wait 2 seconds between requests

return all_products

# For "next page" style pagination
def scrape_with_next_link(start_url):
    """Follow 'Next' links until there are no more."""
    all_items = []
    url = start_url

while url:
        print(f"Scraping: {url}")
        response = requests.get(url, headers={"User-Agent": "Educational scraper"})
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")

# Extract items from current page
        items = extract_items(soup)
        all_items.extend(items)

# Find the next page link
        next_link = soup.select_one("a.next-page")
        if next_link and next_link.get("href"):
            url = next_link["href"]
            if not url.startswith("http"):
                url = f"https://example.com{url}"  # Handle relative URLs
        else:
            url = None  # No more pages

time.sleep(2)

return all_items

The time.sleep(2) between requests is crucial. Without it, you're hammering the server with requests as fast as your connection allows. Two seconds is a reasonable default; for smaller sites, use longer delays.

Respecting robots.txt

from urllib.robotparser import RobotFileParser

def can_scrape(url):
    """Check if scraping this URL is allowed by robots.txt."""
    from urllib.parse import urlparse

parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

parser = RobotFileParser()
    parser.set_url(robots_url)
    try:
        parser.read()
    except Exception:
        return True  # If we can't read robots.txt, proceed cautiously

return parser.can_fetch("*", url)

# Use it before scraping
url = "https://example.com/products"
if can_scrape(url):
    products = scrape_product_page(url)
else:
    print(f"Scraping not allowed for {url}")

Selenium for JavaScript-Heavy Sites

Some websites render content with JavaScript. When you fetch the page with requests, you get the raw HTML before JavaScript runs — meaning the data you want might not be there. Selenium solves this by driving a real browser:

pip install selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

def scrape_dynamic_page(url):
    """Scrape a JavaScript-rendered page using Selenium."""
    # Run Chrome in headless mode (no visible window)
    options = Options()
    options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(options=options)

try:
        driver.get(url)

# Wait for content to load (up to 10 seconds)
        WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card"))
        )

# Now parse with BeautifulSoup
        soup = BeautifulSoup(driver.page_source, "lxml")
        return extract_products(soup)

finally:
        driver.quit()  # Always close the browser

Handling infinite scroll — where new content loads as you scroll down:

def scrape_infinite_scroll(url, scroll_count=10):
    """Scroll down to load more content, then extract data."""
    options = Options()
    options.add_argument("--headless=new")
    driver = webdriver.Chrome(options=options)

try:
        driver.get(url)
        last_height = driver.execute_script("return document.body.scrollHeight")

for i in range(scroll_count):
            # Scroll to bottom
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for new content to load
            time.sleep(3)

# Check if we've reached the bottom
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:
                print(f"Reached the bottom after {i + 1} scrolls.")
                break
            last_height = new_height

soup = BeautifulSoup(driver.page_source, "lxml")
        return extract_items(soup)

finally:
        driver.quit()

Selenium is slower and more resource-intensive than requests. Only use it when the data you need is rendered by JavaScript. A good test: view the page source (Ctrl+U) in your browser. If the data is there, use requests. If it's not, use Selenium.

Saving Data

import csv
import json

def save_to_csv(data, filename):
    """Save list of dicts to CSV."""
    if not data:
        return

fieldnames = data[0].keys()
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(data)
    print(f"Saved {len(data)} records to {filename}")

def save_to_json(data, filename):
    """Save data to JSON with pretty printing."""
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"Saved {len(data)} records to {filename}")

# Usage
products = scrape_all_pages("https://example.com/products")
save_to_csv(products, "products.csv")
save_to_json(products, "products.json")

For larger datasets, consider saving to a SQLite database:

import sqlite3

def save_to_sqlite(data, db_name, table_name):
    """Save list of dicts to SQLite."""
    if not data:
        return

conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

# Create table from first record's keys
    columns = data[0].keys()
    col_defs = ", ".join(f'"{col}" TEXT' for col in columns)
    cursor.execute(f'CREATE TABLE IF NOT EXISTS "{table_name}" ({col_defs})')

# Insert data
    placeholders = ", ".join("?" for _ in columns)
    col_names = ", ".join(f'"{col}"' for col in columns)
    for record in data:
        values = [str(record.get(col, "")) for col in columns]
        cursor.execute(
            f'INSERT INTO "{table_name}" ({col_names}) VALUES ({placeholders})',
            values
        )

conn.commit()
    conn.close()
    print(f"Saved {len(data)} records to {db_name}")

A Complete Example: Scraping Book Data

Here's a full scraper that puts all the concepts together, scraping from a practice site designed for scraping (books.toscrape.com):

import requests
from bs4 import BeautifulSoup
import time
import json
import re

BASE_URL = "https://books.toscrape.com"

def scrape_book_list(page_url):
    """Extract book summaries from a listing page."""
    response = requests.get(page_url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

books = []
    for article in soup.select("article.product_pod"):
        title = article.select_one("h3 a")["title"]
        price_text = article.select_one(".price_color").get_text(strip=True)
        price = float(re.search(r'[\d.]+', price_text).group())
        rating_class = article.select_one(".star-rating")["class"][1]
        rating_map = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
        rating = rating_map.get(rating_class, 0)
        relative_url = article.select_one("h3 a")["href"]
        detail_url = f"{BASE_URL}/catalogue/{relative_url.lstrip('../')}"

books.append({
            "title": title,
            "price": price,
            "rating": rating,
            "detail_url": detail_url
        })

# Check for next page
    next_btn = soup.select_one("li.next a")
    next_url = None
    if next_btn:
        next_href = next_btn["href"]
        if "catalogue" in page_url:
            next_url = f"{BASE_URL}/catalogue/{next_href}"
        else:
            next_url = f"{BASE_URL}/{next_href}"

return books, next_url

def scrape_all_books(max_pages=5):
    """Scrape books across multiple pages."""
    all_books = []
    url = BASE_URL

page = 1
    while url and page <= max_pages:
        print(f"Page {page}: {url}")
        books, next_url = scrape_book_list(url)
        all_books.extend(books)
        url = next_url
        page += 1
        time.sleep(1)

print(f"\nTotal books scraped: {len(all_books)}")
    return all_books

if __name__ == "__main__":
    books = scrape_all_books(max_pages=5)

# Analysis
    avg_price = sum(b["price"] for b in books) / len(books)
    five_star = [b for b in books if b["rating"] == 5]
    cheapest = min(books, key=lambda b: b["price"])

print(f"Average price: ${avg_price:.2f}")
    print(f"Five-star books: {len(five_star)}")
    print(f"Cheapest book: {cheapest['title']} (${cheapest['price']})")

# Save
    with open("books.json", "w") as f:
        json.dump(books, f, indent=2)

Common Pitfalls

Getting blocked: Websites detect scrapers by request patterns (too fast, no User-Agent, no cookies). Rotate User-Agent strings, add delays, and consider using sessions that maintain cookies. Fragile selectors: If you rely on deeply nested CSS selectors like div.container > div:nth-child(3) > span.text, your scraper breaks the moment the site updates its layout. Use the most stable identifiers available — IDs, data attributes, semantic class names. Character encoding: Always specify encoding. response.encoding = 'utf-8' or let requests detect it with response.text (which uses the encoding from the HTTP headers). Relative URLs: Links on pages are often relative (/products/123 instead of https://example.com/products/123). Use urllib.parse.urljoin to resolve them properly.

Beyond BeautifulSoup

For more demanding scraping projects, consider these tools:

Scrapy — a full scraping framework with built-in concurrency, pipeline processing, and middleware. Overkill for simple scripts, ideal for large-scale data collection.
Playwright — a modern alternative to Selenium, faster and with better API design. Supports Chromium, Firefox, and WebKit.
httpx — an async HTTP client that can be faster than requests for concurrent scraping.

Web scraping is a skill that combines HTTP knowledge, HTML parsing, and problem-solving. The more you practice reading HTML structures and writing selectors, the faster you get. For sharpening those programming and pattern-matching skills, try the challenges on CodeUp — the logical thinking transfers directly to writing better scrapers.