Skip to content

Commit 86ae97c

Browse files
committed
.
1 parent 866437e commit 86ae97c

File tree

5 files changed

+410
-40
lines changed

5 files changed

+410
-40
lines changed

1_web_scraper.ipynb

Lines changed: 123 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,56 @@
55
"id": "ed05b0b7",
66
"metadata": {},
77
"source": [
8-
"# Web Scraper"
8+
"# Part 1 Web Scraper\n",
9+
"In this notebook you will learn how to build a **polite recursive web crawler** using `requests` and `BeautifulSoup4`.\n",
10+
"\n",
11+
"**What you will learn**\n",
12+
"- How to scrape multiple related websites safely\n",
13+
"- How a **recursive crawler** (BFS-style) works\n",
14+
"- How to respect website rules (rate limiting, domain restriction)\n",
15+
"- How to clean HTML into readable text\n",
16+
"- How to save data for later use (e.g., RAG systems or chatbots)\n",
17+
"\n",
18+
"**Target websites** \n",
19+
"- Innovation Wing: https://innowings.engg.hku.hk/ \n",
20+
"- InnoAcademy: https://innoacademy.engg.hku.hk/\n",
21+
"\n",
22+
"By the end you will have a JSON dataset that can be used to power AI applications!"
23+
]
24+
},
25+
{
26+
"cell_type": "markdown",
27+
"id": "f44502fa",
28+
"metadata": {},
29+
"source": [
30+
"## 1.1. Introduction to Web Scraping\n",
31+
"\n",
32+
"Web scraping is the process of automatically downloading and extracting information from websites.\n",
33+
"\n",
34+
"**Why do we need a crawler?** \n",
35+
"Instead of scraping one page manually, a **recursive crawler** follows links and discovers new pages automatically — just like a search engine bot.\n",
36+
"\n",
37+
"**Important rules we follow**\n",
38+
"- Only scrape pages we are allowed to (domain restriction)\n",
39+
"- Be polite: wait between requests (`DELAY`)\n",
40+
"- Use a realistic User-Agent so servers know it’s a student project\n",
41+
"- Stop after a safe limit (`MAX_PAGES`)\n",
42+
"\n",
43+
"This notebook is designed for the **Chatbot Challenge** — the scraped data will become the knowledge base for an intelligent assistant."
944
]
1045
},
1146
{
1247
"cell_type": "markdown",
1348
"id": "08dfbce5",
1449
"metadata": {},
1550
"source": [
16-
"### Install and import libraries"
51+
"## 1.2. Install and Import Libraries\n",
52+
"\n",
53+
"We use two main libraries:\n",
54+
"- **`requests`** → downloads raw HTML from the internet\n",
55+
"- **`beautifulsoup4` (bs4)** → parses HTML and lets us extract text and links easily\n",
56+
"\n",
57+
"We also import helper modules for URL handling, JSON saving, and timing."
1758
]
1859
},
1960
{
@@ -95,9 +136,21 @@
95136
"id": "093e1060",
96137
"metadata": {},
97138
"source": [
98-
"### Define constraits\n",
139+
"## 1.3. Define Constraints & Configuration\n",
140+
"\n",
141+
"Before we start crawling, we need to set clear **rules** so the crawler stays safe and focused.\n",
99142
"\n",
100-
"We will deploy a recursive crawler, that is, it not only scrapes the page given, it also recursively clicks links in the page and scrape them."
143+
"### Seed URLs\n",
144+
"These are the starting points of our crawl.\n",
145+
"\n",
146+
"### Allowed Domains\n",
147+
"We **only** allow links from the two official HKU Innovation Wing domains. \n",
148+
"This prevents the crawler from accidentally wandering into the entire internet!\n",
149+
"\n",
150+
"### Other Safety Settings\n",
151+
"- `MAX_PAGES`: stops after 150 pages (safety limit)\n",
152+
"- `DELAY`: waits 1 second between requests (polite crawling)\n",
153+
"- `HEADERS`: defines the browser type used"
101154
]
102155
},
103156
{
@@ -172,7 +225,15 @@
172225
"id": "d38e6c3f",
173226
"metadata": {},
174227
"source": [
175-
"### Define helper functions"
228+
"## 1.4. Helper Functions\n",
229+
"\n",
230+
"We create three clean, reusable functions:\n",
231+
"\n",
232+
"1. **`is_internal_link()`** – checks whether a link belongs to our allowed domains\n",
233+
"2. **`simple_clean_text()`** – removes scripts, styles, and extra whitespace (i.e. the website code that is not related to content)\n",
234+
"3. **`scrape_page()`** – downloads one page and returns clean text\n",
235+
"\n",
236+
"**Note**: There was a duplicate `is_internal_link` definition earlier — we only need it once!"
176237
]
177238
},
178239
{
@@ -210,6 +271,24 @@
210271
" return {\"url\": url, \"text\": f\"[ERROR: {e}]\"}"
211272
]
212273
},
274+
{
275+
"cell_type": "markdown",
276+
"id": "07d31190",
277+
"metadata": {},
278+
"source": [
279+
"## 1.5. How the Recursive Crawler Works\n",
280+
"\n",
281+
"We use a **queue** (Breadth-First Search) to explore pages:\n",
282+
"\n",
283+
"1. Start with the two seed URLs\n",
284+
"2. Scrape the current page\n",
285+
"3. Extract all links on that page\n",
286+
"4. Add any new internal links to the queue\n",
287+
"5. Repeat until we reach the maximum number of pages\n",
288+
"\n",
289+
"This is exactly how real web crawlers (like Googlebot) explore the web!"
290+
]
291+
},
213292
{
214293
"cell_type": "markdown",
215294
"id": "99d64747",
@@ -220,7 +299,7 @@
220299
},
221300
{
222301
"cell_type": "code",
223-
"execution_count": 7,
302+
"execution_count": null,
224303
"id": "4f0b296b",
225304
"metadata": {},
226305
"outputs": [
@@ -408,7 +487,6 @@
408487
" \n",
409488
" visited.add(url)\n",
410489
"\n",
411-
" # Extract links (very basic)\n",
412490
" try:\n",
413491
" resp = requests.get(url, headers=HEADERS, timeout=10)\n",
414492
" soup = BeautifulSoup(resp.text, \"html.parser\")\n",
@@ -429,7 +507,11 @@
429507
"id": "a86cd964",
430508
"metadata": {},
431509
"source": [
432-
"### Save the data to a file"
510+
"## 1.6. Saving the Collected Data\n",
511+
"\n",
512+
"We save everything as `data.json` — a clean, machine-readable format.\n",
513+
"\n",
514+
"This file can later be loaded into a vector database for Retrieval-Augmented Generation (RAG) or a chatbot."
433515
]
434516
},
435517
{
@@ -461,16 +543,17 @@
461543
"id": "2d1e4cff",
462544
"metadata": {},
463545
"source": [
464-
"### Data Quality Check (very important!)"
465-
]
466-
},
467-
{
468-
"cell_type": "markdown",
469-
"id": "b2b053b5",
470-
"metadata": {},
471-
"source": [
472-
"The scraped contents are stored in the variable documents, which is saved in data.json. <br>\n",
473-
"👀Let's take a look at the data structure."
546+
"## 1.7. Data Quality Check (Very Important!)\n",
547+
"\n",
548+
"Now let’s inspect what we actually collected.\n",
549+
"\n",
550+
"### Questions to ask:\n",
551+
"- Is the text clean and readable?\n",
552+
"- Are there any noisy elements (menus, footers, JavaScript leftovers)?\n",
553+
"- Does every page have meaningful content?\n",
554+
"- Are there any error messages?\n",
555+
"\n",
556+
"**Pro tip**: Always look at the first few and a few random pages before using the data!"
474557
]
475558
},
476559
{
@@ -533,6 +616,28 @@
533616
"source": [
534617
"print(documents[0][\"text\"])"
535618
]
619+
},
620+
{
621+
"cell_type": "markdown",
622+
"id": "46048f1f",
623+
"metadata": {},
624+
"source": [
625+
"## 1.8. Conclusion & Next Steps\n",
626+
"\n",
627+
"**Congratulations!** 🎉 \n",
628+
"You have successfully built a polite recursive web scraper and collected data about HKU Innovation Wing & Innovation Academy.\n",
629+
"\n",
630+
"### What you learned\n",
631+
"- How to set up a controlled crawler\n",
632+
"- Domain restriction and politeness rules\n",
633+
"- HTML cleaning with BeautifulSoup\n",
634+
"- Saving data for AI applications\n",
635+
"\n",
636+
"### Possible next steps\n",
637+
"1. Improve text cleaning (remove non-context texts)\n",
638+
"2. Use website metadata (eg. headings) to further enhance the dataset\n",
639+
"3. Add data from other sources"
640+
]
536641
}
537642
],
538643
"metadata": {

0 commit comments

Comments
 (0)