|
5 | 5 | "id": "ed05b0b7", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | | - "# Web Scraper" |
| 8 | + "# Part 1 Web Scraper\n", |
| 9 | + "In this notebook you will learn how to build a **polite recursive web crawler** using `requests` and `BeautifulSoup4`.\n", |
| 10 | + "\n", |
| 11 | + "**What you will learn**\n", |
| 12 | + "- How to scrape multiple related websites safely\n", |
| 13 | + "- How a **recursive crawler** (BFS-style) works\n", |
| 14 | + "- How to respect website rules (rate limiting, domain restriction)\n", |
| 15 | + "- How to clean HTML into readable text\n", |
| 16 | + "- How to save data for later use (e.g., RAG systems or chatbots)\n", |
| 17 | + "\n", |
| 18 | + "**Target websites** \n", |
| 19 | + "- Innovation Wing: https://innowings.engg.hku.hk/ \n", |
| 20 | + "- InnoAcademy: https://innoacademy.engg.hku.hk/\n", |
| 21 | + "\n", |
| 22 | + "By the end you will have a JSON dataset that can be used to power AI applications!" |
| 23 | + ] |
| 24 | + }, |
| 25 | + { |
| 26 | + "cell_type": "markdown", |
| 27 | + "id": "f44502fa", |
| 28 | + "metadata": {}, |
| 29 | + "source": [ |
| 30 | + "## 1.1. Introduction to Web Scraping\n", |
| 31 | + "\n", |
| 32 | + "Web scraping is the process of automatically downloading and extracting information from websites.\n", |
| 33 | + "\n", |
| 34 | + "**Why do we need a crawler?** \n", |
| 35 | + "Instead of scraping one page manually, a **recursive crawler** follows links and discovers new pages automatically — just like a search engine bot.\n", |
| 36 | + "\n", |
| 37 | + "**Important rules we follow**\n", |
| 38 | + "- Only scrape pages we are allowed to (domain restriction)\n", |
| 39 | + "- Be polite: wait between requests (`DELAY`)\n", |
| 40 | + "- Use a realistic User-Agent so servers know it’s a student project\n", |
| 41 | + "- Stop after a safe limit (`MAX_PAGES`)\n", |
| 42 | + "\n", |
| 43 | + "This notebook is designed for the **Chatbot Challenge** — the scraped data will become the knowledge base for an intelligent assistant." |
9 | 44 | ] |
10 | 45 | }, |
11 | 46 | { |
12 | 47 | "cell_type": "markdown", |
13 | 48 | "id": "08dfbce5", |
14 | 49 | "metadata": {}, |
15 | 50 | "source": [ |
16 | | - "### Install and import libraries" |
| 51 | + "## 1.2. Install and Import Libraries\n", |
| 52 | + "\n", |
| 53 | + "We use two main libraries:\n", |
| 54 | + "- **`requests`** → downloads raw HTML from the internet\n", |
| 55 | + "- **`beautifulsoup4` (bs4)** → parses HTML and lets us extract text and links easily\n", |
| 56 | + "\n", |
| 57 | + "We also import helper modules for URL handling, JSON saving, and timing." |
17 | 58 | ] |
18 | 59 | }, |
19 | 60 | { |
|
95 | 136 | "id": "093e1060", |
96 | 137 | "metadata": {}, |
97 | 138 | "source": [ |
98 | | - "### Define constraits\n", |
| 139 | + "## 1.3. Define Constraints & Configuration\n", |
| 140 | + "\n", |
| 141 | + "Before we start crawling, we need to set clear **rules** so the crawler stays safe and focused.\n", |
99 | 142 | "\n", |
100 | | - "We will deploy a recursive crawler, that is, it not only scrapes the page given, it also recursively clicks links in the page and scrape them." |
| 143 | + "### Seed URLs\n", |
| 144 | + "These are the starting points of our crawl.\n", |
| 145 | + "\n", |
| 146 | + "### Allowed Domains\n", |
| 147 | + "We **only** allow links from the two official HKU Innovation Wing domains. \n", |
| 148 | + "This prevents the crawler from accidentally wandering into the entire internet!\n", |
| 149 | + "\n", |
| 150 | + "### Other Safety Settings\n", |
| 151 | + "- `MAX_PAGES`: stops after 150 pages (safety limit)\n", |
| 152 | + "- `DELAY`: waits 1 second between requests (polite crawling)\n", |
| 153 | + "- `HEADERS`: defines the browser type used" |
101 | 154 | ] |
102 | 155 | }, |
103 | 156 | { |
|
172 | 225 | "id": "d38e6c3f", |
173 | 226 | "metadata": {}, |
174 | 227 | "source": [ |
175 | | - "### Define helper functions" |
| 228 | + "## 1.4. Helper Functions\n", |
| 229 | + "\n", |
| 230 | + "We create three clean, reusable functions:\n", |
| 231 | + "\n", |
| 232 | + "1. **`is_internal_link()`** – checks whether a link belongs to our allowed domains\n", |
| 233 | + "2. **`simple_clean_text()`** – removes scripts, styles, and extra whitespace (i.e. the website code that is not related to content)\n", |
| 234 | + "3. **`scrape_page()`** – downloads one page and returns clean text\n", |
| 235 | + "\n", |
| 236 | + "**Note**: There was a duplicate `is_internal_link` definition earlier — we only need it once!" |
176 | 237 | ] |
177 | 238 | }, |
178 | 239 | { |
|
210 | 271 | " return {\"url\": url, \"text\": f\"[ERROR: {e}]\"}" |
211 | 272 | ] |
212 | 273 | }, |
| 274 | + { |
| 275 | + "cell_type": "markdown", |
| 276 | + "id": "07d31190", |
| 277 | + "metadata": {}, |
| 278 | + "source": [ |
| 279 | + "## 1.5. How the Recursive Crawler Works\n", |
| 280 | + "\n", |
| 281 | + "We use a **queue** (Breadth-First Search) to explore pages:\n", |
| 282 | + "\n", |
| 283 | + "1. Start with the two seed URLs\n", |
| 284 | + "2. Scrape the current page\n", |
| 285 | + "3. Extract all links on that page\n", |
| 286 | + "4. Add any new internal links to the queue\n", |
| 287 | + "5. Repeat until we reach the maximum number of pages\n", |
| 288 | + "\n", |
| 289 | + "This is exactly how real web crawlers (like Googlebot) explore the web!" |
| 290 | + ] |
| 291 | + }, |
213 | 292 | { |
214 | 293 | "cell_type": "markdown", |
215 | 294 | "id": "99d64747", |
|
220 | 299 | }, |
221 | 300 | { |
222 | 301 | "cell_type": "code", |
223 | | - "execution_count": 7, |
| 302 | + "execution_count": null, |
224 | 303 | "id": "4f0b296b", |
225 | 304 | "metadata": {}, |
226 | 305 | "outputs": [ |
|
408 | 487 | " \n", |
409 | 488 | " visited.add(url)\n", |
410 | 489 | "\n", |
411 | | - " # Extract links (very basic)\n", |
412 | 490 | " try:\n", |
413 | 491 | " resp = requests.get(url, headers=HEADERS, timeout=10)\n", |
414 | 492 | " soup = BeautifulSoup(resp.text, \"html.parser\")\n", |
|
429 | 507 | "id": "a86cd964", |
430 | 508 | "metadata": {}, |
431 | 509 | "source": [ |
432 | | - "### Save the data to a file" |
| 510 | + "## 1.6. Saving the Collected Data\n", |
| 511 | + "\n", |
| 512 | + "We save everything as `data.json` — a clean, machine-readable format.\n", |
| 513 | + "\n", |
| 514 | + "This file can later be loaded into a vector database for Retrieval-Augmented Generation (RAG) or a chatbot." |
433 | 515 | ] |
434 | 516 | }, |
435 | 517 | { |
|
461 | 543 | "id": "2d1e4cff", |
462 | 544 | "metadata": {}, |
463 | 545 | "source": [ |
464 | | - "### Data Quality Check (very important!)" |
465 | | - ] |
466 | | - }, |
467 | | - { |
468 | | - "cell_type": "markdown", |
469 | | - "id": "b2b053b5", |
470 | | - "metadata": {}, |
471 | | - "source": [ |
472 | | - "The scraped contents are stored in the variable documents, which is saved in data.json. <br>\n", |
473 | | - "👀Let's take a look at the data structure." |
| 546 | + "## 1.7. Data Quality Check (Very Important!)\n", |
| 547 | + "\n", |
| 548 | + "Now let’s inspect what we actually collected.\n", |
| 549 | + "\n", |
| 550 | + "### Questions to ask:\n", |
| 551 | + "- Is the text clean and readable?\n", |
| 552 | + "- Are there any noisy elements (menus, footers, JavaScript leftovers)?\n", |
| 553 | + "- Does every page have meaningful content?\n", |
| 554 | + "- Are there any error messages?\n", |
| 555 | + "\n", |
| 556 | + "**Pro tip**: Always look at the first few and a few random pages before using the data!" |
474 | 557 | ] |
475 | 558 | }, |
476 | 559 | { |
|
533 | 616 | "source": [ |
534 | 617 | "print(documents[0][\"text\"])" |
535 | 618 | ] |
| 619 | + }, |
| 620 | + { |
| 621 | + "cell_type": "markdown", |
| 622 | + "id": "46048f1f", |
| 623 | + "metadata": {}, |
| 624 | + "source": [ |
| 625 | + "## 1.8. Conclusion & Next Steps\n", |
| 626 | + "\n", |
| 627 | + "**Congratulations!** 🎉 \n", |
| 628 | + "You have successfully built a polite recursive web scraper and collected data about HKU Innovation Wing & Innovation Academy.\n", |
| 629 | + "\n", |
| 630 | + "### What you learned\n", |
| 631 | + "- How to set up a controlled crawler\n", |
| 632 | + "- Domain restriction and politeness rules\n", |
| 633 | + "- HTML cleaning with BeautifulSoup\n", |
| 634 | + "- Saving data for AI applications\n", |
| 635 | + "\n", |
| 636 | + "### Possible next steps\n", |
| 637 | + "1. Improve text cleaning (remove non-context texts)\n", |
| 638 | + "2. Use website metadata (eg. headings) to further enhance the dataset\n", |
| 639 | + "3. Add data from other sources" |
| 640 | + ] |
536 | 641 | } |
537 | 642 | ], |
538 | 643 | "metadata": { |
|
0 commit comments