Skip to content

Rework notebooks to use the static self-hosted fake job board #350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,735 changes: 22 additions & 1,713 deletions build-a-web-scraper/01_inspect.ipynb

Large diffs are not rendered by default.

59 changes: 25 additions & 34 deletions build-a-web-scraper/02_scrape.ipynb

Large diffs are not rendered by default.

2,121 changes: 57 additions & 2,064 deletions build-a-web-scraper/03_parse.ipynb

Large diffs are not rendered by default.

34 changes: 24 additions & 10 deletions build-a-web-scraper/04_pipeline.ipynb
Original file line number Diff line number Diff line change
@@ -12,15 +12,28 @@
"- Target & Save Specific Information You Want"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ⚠️ Durabilty Warning ⚠️\n",
"\n",
"Like [mentioned in the course](https://realpython.com/lessons/challenge-of-durability/), websites frequently change. Unfortunately the job board that you'll see in the course, indeed.com, has started to block scraping of their site since the recording of the course.\n",
"\n",
"Just like in the associated written tutorial on [web scraping with beautiful soup](https://realpython.com/beautiful-soup-web-scraper-python/#scrape-the-fake-python-job-site), you can instead use [Real Python's fake jobs site](https://realpython.github.io/fake-jobs/) to practice scraping a static website.\n",
"\n",
"All the concepts discussed in the course lessons are still accurate. Translating what you see onto a different website will be a good learning opportunity where you'll have to synthesize the information and apply it practically."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Your Tasks:\n",
"\n",
"- Scrape the first 100 available search results\n",
"- Scrape all 100 available job postings\n",
"- Generalize your code to allow searching for different locations/jobs\n",
"- Pick out information about the URL, job title, and job location\n",
"- Pick out information about the apply URL, job title, and job location\n",
"- Save the results to a file"
]
},
@@ -40,8 +53,7 @@
"source": [
"### Part 1: Inspect\n",
"\n",
"- How do the URLs change when you navigate to the next results page?\n",
"- How do the URLs change when you use a different location and/or job title search?\n",
"- How do the URLs change when you navigate to a job detail?\n",
"- Which HTML elements contain the link, title, and location of each job?"
]
},
@@ -58,8 +70,9 @@
"source": [
"### Part 2: Scrape\n",
"\n",
"- Build the code to fetch the first 100 search results. This means you will need to automatically navigate to multiple results pages\n",
"- Write functions that allow you to specify the job title, location, and amount of results as arguments"
"- Build the code to fetch all 100 available job postings.\n",
"- Write functions that allow you to specify the job title, location, and amount of results as arguments\n",
"- Also fetch the information provided on each job details page. For this, you'll need to automatically follow URLs that you've fetched when getting the job postings."
]
},
{
@@ -75,8 +88,9 @@
"source": [
"### Part 3: Parse\n",
"\n",
"- Sieve through your HTML soup to pick out only the job title, link, and location\n",
"- Format the results in a readable format (e.g. JSON)\n",
"- Sieve through your HTML soup to pick out only the job title, link, and location from the main page\n",
"- Sieve through the HTML of each details page to get the job description and combine it with the other information\n",
"- Format the results in a readable format (e.g. JSON, TXT, TOML, ...)\n",
"- Save the results to a file"
]
},
@@ -90,7 +104,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -104,7 +118,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.0"
"version": "3.11.0"
}
},
"nbformat": 4,
500 changes: 0 additions & 500 deletions build-a-web-scraper/05_pipeline_solution.ipynb

This file was deleted.

13 changes: 0 additions & 13 deletions build-a-web-scraper/Pipfile

This file was deleted.

446 changes: 0 additions & 446 deletions build-a-web-scraper/Pipfile.lock

This file was deleted.

46 changes: 41 additions & 5 deletions build-a-web-scraper/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,52 @@
# Code Repository for Web Scraping Course

This repository contains Jupyter Notebooks with code examples relating to the Real Python video course on Building a Web Scraper with `requests` and Beautiful Soup.
This repository contains Jupyter Notebooks with code examples relating to the Real Python video course on [Building a Web Scraper with `requests` and Beautiful Soup](https://realpython.com/courses/web-scraping-beautiful-soup/).

The notebooks 01-03 represent the **web scraping pipeline** discussed in the course:
## Setup

Create and activate a virtual environment, then install `requests`, `beautifulsoup4`, and `jupyter`:

```bash
$ python -m venv venv
$ source venv/bin/activate
# PS> venv\Scripts\activate # on Windows
(venv) $ python -m pip install -r requirements.txt
```

Once all the dependencies are installed, you can start the Jupyter notebook server:

```bash
(venv) $ jupyter notebook
```

Now you can open the notebook that you want to work on.

## Notebook Files

The notebooks 01--03 represent the **web scraping pipeline** discussed in the course:

- **Part 1: Inspect** `01_inspect.ipynb`
- **Part 2: Scrape** `02_scrape.ipynb`
- **Part 3: Parse** `03_parse.ipynb`

The notebooks 04-05 contain tasks to work on individually for each learner to keep practicing the discussed concepts and personalize the project for themselves:
The notebook 04 contains tasks to work on individually so you can keep practicing the discussed concepts and personalize the project for yourself:

- **Tasks** `04_pipeline.ipynb`
- **Solution** `05_pipeline_solution.ipynb`

Attempt to build out your individual pipeline by yourself and use the solution document only if you get stuck. All the best, and keep learning! :)
Attempt to build out your individual pipeline by yourself. When you're done with the suggested practice website, try to repeat the process with a different website. All the best, and keep learning! :)

## ⚠️ Durabilty Warning ⚠️

Like [mentioned in the course](https://realpython.com/lessons/challenge-of-durability/), websites frequently change. Unfortunately the job board that you'll see in the course, indeed.com, has started to block scraping of their site since the recording of the course.

Just like in the associated written tutorial on [web scraping with beautiful soup](https://realpython.com/beautiful-soup-web-scraper-python/#scrape-the-fake-python-job-site), you can instead use [Real Python's fake jobs site](https://realpython.github.io/fake-jobs/) to practice scraping a static website.

All the concepts discussed in the course lessons are still accurate. Translating what you see onto a different website will be a good learning opportunity where you'll have to synthesize the information and apply it practically.

## About the Author

Martin Breuss - Email: martin@realpython.com

## License

Distributed under the MIT license. See `LICENSE` in the root directory of this `materials` repo for more information.
3 changes: 3 additions & 0 deletions build-a-web-scraper/requirements.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
jupyter
requests
beautifulsoup4
332 changes: 292 additions & 40 deletions build-a-web-scraper/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,51 +1,303 @@
appnope==0.1.0
attrs==19.3.0
backcall==0.1.0
#
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
# pip-compile requirements.in
#
anyio==3.6.2
# via jupyter-server
appnope==0.1.3
# via
# ipykernel
# ipython
argon2-cffi==21.3.0
# via
# jupyter-server
# nbclassic
# notebook
argon2-cffi-bindings==21.2.0
# via argon2-cffi
arrow==1.2.3
# via isoduration
asttokens==2.2.1
# via stack-data
attrs==22.2.0
# via jsonschema
backcall==0.2.0
# via ipython
beautifulsoup4==4.9.1
bleach==3.1.5
# via
# -r requirements.in
# nbconvert
bleach==5.0.1
# via nbconvert
certifi==2020.4.5.2
# via requests
cffi==1.15.1
# via argon2-cffi-bindings
chardet==3.0.4
decorator==4.4.2
defusedxml==0.6.0
entrypoints==0.3
# via requests
comm==0.1.2
# via ipykernel
debugpy==1.6.4
# via ipykernel
decorator==5.1.1
# via ipython
defusedxml==0.7.1
# via nbconvert
entrypoints==0.4
# via jupyter-client
executing==1.2.0
# via stack-data
fastjsonschema==2.16.2
# via nbformat
fqdn==1.5.1
# via jsonschema
idna==2.9
ipykernel==5.3.0
ipython==7.15.0
# via
# anyio
# jsonschema
# requests
ipykernel==6.19.4
# via
# ipywidgets
# jupyter
# jupyter-console
# nbclassic
# notebook
# qtconsole
ipython==8.7.0
# via
# ipykernel
# ipywidgets
# jupyter-console
ipython-genutils==0.2.0
jedi==0.17.0
Jinja2==2.11.2
json5==0.9.5
jsonschema==3.2.0
jupyter-client==6.1.3
jupyter-core==4.6.3
jupyterlab==2.1.4
jupyterlab-server==1.1.5
MarkupSafe==1.1.1
mistune==0.8.4
nbconvert==5.6.1
nbformat==5.0.6
notebook==6.0.3
packaging==20.4
pandocfilters==1.4.2
parso==0.7.0
# via
# nbclassic
# notebook
# qtconsole
ipywidgets==8.0.3
# via jupyter
isoduration==20.11.0
# via jsonschema
jedi==0.18.2
# via ipython
jinja2==3.1.2
# via
# jupyter-server
# nbclassic
# nbconvert
# notebook
jsonpointer==2.3
# via jsonschema
jsonschema[format-nongpl]==4.17.3
# via
# jupyter-events
# nbformat
jupyter==1.0.0
# via -r requirements.in
jupyter-client==7.4.8
# via
# ipykernel
# jupyter-console
# jupyter-server
# nbclassic
# nbclient
# notebook
# qtconsole
jupyter-console==6.4.4
# via jupyter
jupyter-core==5.1.0
# via
# jupyter-client
# jupyter-server
# nbclassic
# nbclient
# nbconvert
# nbformat
# notebook
# qtconsole
jupyter-events==0.5.0
# via jupyter-server
jupyter-server==2.0.2
# via
# nbclassic
# notebook-shim
jupyter-server-terminals==0.4.3
# via jupyter-server
jupyterlab-pygments==0.2.2
# via nbconvert
jupyterlab-widgets==3.0.4
# via ipywidgets
markupsafe==2.1.1
# via
# jinja2
# nbconvert
matplotlib-inline==0.1.6
# via
# ipykernel
# ipython
mistune==2.0.4
# via nbconvert
nbclassic==0.4.8
# via notebook
nbclient==0.7.2
# via nbconvert
nbconvert==7.2.7
# via
# jupyter
# jupyter-server
# nbclassic
# notebook
nbformat==5.7.1
# via
# jupyter-server
# nbclassic
# nbclient
# nbconvert
# notebook
nest-asyncio==1.5.6
# via
# ipykernel
# jupyter-client
# nbclassic
# notebook
notebook==6.5.2
# via jupyter
notebook-shim==0.2.2
# via nbclassic
packaging==22.0
# via
# ipykernel
# jupyter-server
# nbconvert
# qtpy
pandocfilters==1.5.0
# via nbconvert
parso==0.8.3
# via jedi
pexpect==4.8.0
# via ipython
pickleshare==0.7.5
prometheus-client==0.8.0
prompt-toolkit==3.0.5
ptyprocess==0.6.0
Pygments==2.6.1
pyparsing==2.4.7
pyrsistent==0.16.0
python-dateutil==2.8.1
pyzmq==19.0.1
# via ipython
platformdirs==2.6.0
# via jupyter-core
prometheus-client==0.15.0
# via
# jupyter-server
# nbclassic
# notebook
prompt-toolkit==3.0.36
# via
# ipython
# jupyter-console
psutil==5.9.4
# via ipykernel
ptyprocess==0.7.0
# via
# pexpect
# terminado
pure-eval==0.2.2
# via stack-data
pycparser==2.21
# via cffi
pygments==2.13.0
# via
# ipython
# jupyter-console
# nbconvert
# qtconsole
pyrsistent==0.19.2
# via jsonschema
python-dateutil==2.8.2
# via
# arrow
# jupyter-client
python-json-logger==2.0.4
# via jupyter-events
pyyaml==6.0
# via jupyter-events
pyzmq==24.0.1
# via
# ipykernel
# jupyter-client
# jupyter-server
# nbclassic
# notebook
# qtconsole
qtconsole==5.4.0
# via jupyter
qtpy==2.3.0
# via qtconsole
requests==2.23.0
Send2Trash==1.5.0
six==1.15.0
# via -r requirements.in
rfc3339-validator==0.1.4
# via jsonschema
rfc3986-validator==0.1.1
# via jsonschema
send2trash==1.8.0
# via
# jupyter-server
# nbclassic
# notebook
six==1.16.0
# via
# asttokens
# bleach
# python-dateutil
# rfc3339-validator
sniffio==1.3.0
# via anyio
soupsieve==2.0.1
terminado==0.8.3
testpath==0.4.4
tornado==6.0.4
traitlets==4.3.3
# via beautifulsoup4
stack-data==0.6.2
# via ipython
terminado==0.17.1
# via
# jupyter-server
# jupyter-server-terminals
# nbclassic
# notebook
tinycss2==1.2.1
# via nbconvert
tornado==6.2
# via
# ipykernel
# jupyter-client
# jupyter-server
# nbclassic
# notebook
# terminado
traitlets==5.8.0
# via
# comm
# ipykernel
# ipython
# ipywidgets
# jupyter-client
# jupyter-core
# jupyter-events
# jupyter-server
# matplotlib-inline
# nbclassic
# nbclient
# nbconvert
# nbformat
# notebook
# qtconsole
uri-template==1.2.0
# via jsonschema
urllib3==1.25.9
wcwidth==0.2.4
# via requests
wcwidth==0.2.5
# via prompt-toolkit
webcolors==1.12
# via jsonschema
webencodings==0.5.1
# via
# bleach
# tinycss2
websocket-client==1.4.2
# via jupyter-server
widgetsnbextension==4.0.4
# via ipywidgets