HKUGenAI
diff --git a/‎1_web_scraper.ipynb‎
Lines changed: 123 additions & 18 deletions b/‎1_web_scraper.ipynb‎
Lines changed: 123 additions & 18 deletions
@@ -5,15 +5,56 @@
    "id": "ed05b0b7",
    "metadata": {},
    "source": [
-    "# Web Scraper"
+    "# Part 1 Web Scraper\n",
+    "In this notebook you will learn how to build a **polite recursive web crawler** using `requests` and `BeautifulSoup4`.\n",
+    "\n",
+    "**What you will learn**\n",
+    "- How to scrape multiple related websites safely\n",
+    "- How a **recursive crawler** (BFS-style) works\n",
+    "- How to respect website rules (rate limiting, domain restriction)\n",
+    "- How to clean HTML into readable text\n",
+    "- How to save data for later use (e.g., RAG systems or chatbots)\n",
+    "\n",
+    "**Target websites**  \n",
+    "- Innovation Wing: https://innowings.engg.hku.hk/  \n",
+    "- InnoAcademy: https://innoacademy.engg.hku.hk/\n",
+    "\n",
+    "By the end you will have a JSON dataset that can be used to power AI applications!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f44502fa",
+   "metadata": {},
+   "source": [
+    "## 1.1. Introduction to Web Scraping\n",
+    "\n",
+    "Web scraping is the process of automatically downloading and extracting information from websites.\n",
+    "\n",
+    "**Why do we need a crawler?**  \n",
+    "Instead of scraping one page manually, a **recursive crawler** follows links and discovers new pages automatically — just like a search engine bot.\n",
+    "\n",
+    "**Important rules we follow**\n",
+    "- Only scrape pages we are allowed to (domain restriction)\n",
+    "- Be polite: wait between requests (`DELAY`)\n",
+    "- Use a realistic User-Agent so servers know it’s a student project\n",
+    "- Stop after a safe limit (`MAX_PAGES`)\n",
+    "\n",
+    "This notebook is designed for the **Chatbot Challenge** — the scraped data will become the knowledge base for an intelligent assistant."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "08dfbce5",
    "metadata": {},
    "source": [
-    "### Install and import libraries"
+    "## 1.2. Install and Import Libraries\n",
+    "\n",
+    "We use two main libraries:\n",
+    "- **`requests`** → downloads raw HTML from the internet\n",
+    "- **`beautifulsoup4` (bs4)** → parses HTML and lets us extract text and links easily\n",
+    "\n",
+    "We also import helper modules for URL handling, JSON saving, and timing."
    ]
   },
   {
@@ -95,9 +136,21 @@
    "id": "093e1060",
    "metadata": {},
    "source": [
-    "### Define constraits\n",
+    "## 1.3. Define Constraints & Configuration\n",
+    "\n",
+    "Before we start crawling, we need to set clear **rules** so the crawler stays safe and focused.\n",
     "\n",
-    "We will deploy a recursive crawler, that is, it not only scrapes the page given, it also recursively clicks links in the page and scrape them."
+    "### Seed URLs\n",
+    "These are the starting points of our crawl.\n",
+    "\n",
+    "### Allowed Domains\n",
+    "We **only** allow links from the two official HKU Innovation Wing domains.  \n",
+    "This prevents the crawler from accidentally wandering into the entire internet!\n",
+    "\n",
+    "### Other Safety Settings\n",
+    "- `MAX_PAGES`: stops after 150 pages (safety limit)\n",
+    "- `DELAY`: waits 1 second between requests (polite crawling)\n",
+    "- `HEADERS`: defines the browser type used"
    ]
   },
   {
@@ -172,7 +225,15 @@
    "id": "d38e6c3f",
    "metadata": {},
    "source": [
-    "### Define helper functions"
+    "## 1.4. Helper Functions\n",
+    "\n",
+    "We create three clean, reusable functions:\n",
+    "\n",
+    "1. **`is_internal_link()`** – checks whether a link belongs to our allowed domains\n",
+    "2. **`simple_clean_text()`** – removes scripts, styles, and extra whitespace (i.e. the website code that is not related to content)\n",
+    "3. **`scrape_page()`** – downloads one page and returns clean text\n",
+    "\n",
+    "**Note**: There was a duplicate `is_internal_link` definition earlier — we only need it once!"
    ]
   },
   {
@@ -210,6 +271,24 @@
     "        return {\"url\": url, \"text\": f\"[ERROR: {e}]\"}"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "07d31190",
+   "metadata": {},
+   "source": [
+    "## 1.5. How the Recursive Crawler Works\n",
+    "\n",
+    "We use a **queue** (Breadth-First Search) to explore pages:\n",
+    "\n",
+    "1. Start with the two seed URLs\n",
+    "2. Scrape the current page\n",
+    "3. Extract all links on that page\n",
+    "4. Add any new internal links to the queue\n",
+    "5. Repeat until we reach the maximum number of pages\n",
+    "\n",
+    "This is exactly how real web crawlers (like Googlebot) explore the web!"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "99d64747",
@@ -220,7 +299,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "id": "4f0b296b",
    "metadata": {},
    "outputs": [
@@ -408,7 +487,6 @@
     "    \n",
     "    visited.add(url)\n",
     "\n",
-    "    # Extract links (very basic)\n",
     "    try:\n",
     "        resp = requests.get(url, headers=HEADERS, timeout=10)\n",
     "        soup = BeautifulSoup(resp.text, \"html.parser\")\n",
@@ -429,7 +507,11 @@
    "id": "a86cd964",
    "metadata": {},
    "source": [
-    "### Save the data to a file"
+    "## 1.6. Saving the Collected Data\n",
+    "\n",
+    "We save everything as `data.json` — a clean, machine-readable format.\n",
+    "\n",
+    "This file can later be loaded into a vector database for Retrieval-Augmented Generation (RAG) or a chatbot."
    ]
   },
   {
@@ -461,16 +543,17 @@
    "id": "2d1e4cff",
    "metadata": {},
    "source": [
-    "### Data Quality Check (very important!)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b2b053b5",
-   "metadata": {},
-   "source": [
-    "The scraped contents are stored in the variable documents, which is saved in data.json. <br>\n",
-    "👀Let's take a look at the data structure."
+    "## 1.7. Data Quality Check (Very Important!)\n",
+    "\n",
+    "Now let’s inspect what we actually collected.\n",
+    "\n",
+    "### Questions to ask:\n",
+    "- Is the text clean and readable?\n",
+    "- Are there any noisy elements (menus, footers, JavaScript leftovers)?\n",
+    "- Does every page have meaningful content?\n",
+    "- Are there any error messages?\n",
+    "\n",
+    "**Pro tip**: Always look at the first few and a few random pages before using the data!"
    ]
   },
   {
@@ -533,6 +616,28 @@
    "source": [
     "print(documents[0][\"text\"])"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "46048f1f",
+   "metadata": {},
+   "source": [
+    "## 1.8. Conclusion & Next Steps\n",
+    "\n",
+    "**Congratulations!** 🎉  \n",
+    "You have successfully built a polite recursive web scraper and collected data about HKU Innovation Wing & Innovation Academy.\n",
+    "\n",
+    "### What you learned\n",
+    "- How to set up a controlled crawler\n",
+    "- Domain restriction and politeness rules\n",
+    "- HTML cleaning with BeautifulSoup\n",
+    "- Saving data for AI applications\n",
+    "\n",
+    "### Possible next steps\n",
+    "1. Improve text cleaning (remove non-context texts)\n",
+    "2. Use website metadata (eg. headings) to further enhance the dataset\n",
+    "3. Add data from other sources"
+   ]
   }
  ],
  "metadata": {