 
This repository turns a full-length article into a practical, GitHub-ready guide.
You’ll scrape the r/programming subreddit using Requests and BeautifulSoup, collect post titles, and analyze which programming languages appear most often.
The tutorial targets old Reddit (https://old.reddit.com) — an easier static HTML interface that doesn’t require JavaScript.
⚠️ Always check a website’s robots.txt and Terms of Service before scraping.
Respect rate limits, add delays, and use a unique User-Agent.
Web scraping means using code to:
- Fetch the HTML of a webpage, and
- Extract useful data from it.
Most sites can be scraped with:
- requests– downloads the HTML
- beautifulsoup4– parses and navigates HTML
For pages that render data dynamically via JavaScript, you’ll need Playwright or Selenium.
Common use cases:
- Market & price tracking
- Research & analytics
- Trend or keyword monitoring
Python’s ecosystem is the go-to choice for scraping in 2025 because it’s simple, powerful, and well-supported.
Popular libraries include:
| Library | Purpose | 
|---|---|
| requests | Fetch HTML from websites | 
| beautifulsoup4 | Parse and navigate HTML trees | 
| scrapy | Advanced framework for large projects | 
| playwright | Headless browser automation | 
You’ll need Python 3.9+.
pip install requests beautifulsoup4
# or
pip install -r requirements.txtCreate a file src/scraper.py and follow the examples below.
import requests
page = requests.get(
    "https://old.reddit.com/r/programming/",
    headers={'User-agent': 'Learning Python Web Scraping'}
)
html = page.contentfrom bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
p_tags = soup.find_all("p", "title")
titles = [p.find("a").get_text() for p in p_tags]
print(titles)At this point, you’ll see the post titles from the first page of r/programming.
Old Reddit includes a “Next” button with <span class="next-button">.
We can loop through multiple pages safely:
import requests
from bs4 import BeautifulSoup
import time
post_titles = []
next_page = "https://old.reddit.com/r/programming/"
for current_page in range(0, 20):
    page = requests.get(next_page,
                        headers={'User-agent': 'Sorry, learning Python!'})
    html = page.content
    soup = BeautifulSoup(html, "html.parser")
    p_tags = soup.find_all("p", "title")
    titles = [p.find("a").get_text() for p in p_tags]
    post_titles += titles
    next_page = soup.find("span", "next-button").find("a")['href']
    time.sleep(3)
print(post_titles)Let’s count mentions of popular programming languages:
language_counter = {
    "javascript": 0, "html": 0, "css": 0, "sql": 0, "python": 0, "typescript": 0,
    "java": 0, "c#": 0, "c++": 0, "php": 0, "c": 0, "powershell": 0,
    "go": 0, "rust": 0, "kotlin": 0, "dart": 0, "ruby": 0
}
words = []
for title in post_titles:
    words += [word.lower() for word in title.split()]
for word in words:
    for key in language_counter:
        if word == key:
            language_counter[key] += 1
print(language_counter)Example output:
{'javascript': 20, 'html': 6, 'css': 10, 'sql': 0, 'python': 26, 'typescript': 1,
 'java': 10, 'c#': 5, 'c++': 10, 'php': 1, 'c': 10, 'powershell': 0,
 'go': 5, 'rust': 7, 'kotlin': 3, 'dart': 0, 'ruby': 1}
To avoid rate limits or bans, route requests through a proxy provider such as IPRoyal.
PROXIES = {
    "http":  "http://youruser:[email protected]:22323",
    "https": "http://youruser:[email protected]:22323",
}
page = requests.get(next_page,
                    headers={'User-agent': 'Just learning Python, sorry!'},
                    proxies=PROXIES)Proxies allow rotation between IPs, making traffic look more natural and reducing blocks.
You now know how to:
- Fetch and parse HTML with Requests and BeautifulSoup
- Scrape multiple pages safely
- Count language mentions from Reddit titles
- Optionally add proxy support for stability
For larger or dynamic projects, explore Scrapy or Playwright.