Skip to content

Commit 490e5bb

Browse files
authored
Add files via upload
1 parent f6a1936 commit 490e5bb

File tree

4 files changed

+139
-2
lines changed

4 files changed

+139
-2
lines changed

README.md

Lines changed: 123 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,123 @@
1-
# parse-html-pyquery
2-
How to Parse HTML with PyQuery
1+
# How to Parse HTML with PyQuery: Python Tutorial
2+
3+
In this article, we will learn how to write a web scraper in Python using the PyQuery Library. We will first explore the basics and after that, we will also do a comparison of PyQuery with Beautifulsoup. So, let’s get started.
4+
5+
## What is PyQuery?
6+
7+
PyQuery is a Python library that allows you to manipulate and extract data from HTML and XML documents. It provides a jQuery-like syntax and API, making it easy to work with web content in Python.
8+
9+
Like jQuery, PyQuery allows you to select elements from an HTML or XML document using CSS selectors, and then manipulate or extract data from those elements. You can use PyQuery to parse and manipulate HTML and XML documents, scrape web pages, and extract data from web APIs.
10+
11+
## How to install PyQuery
12+
13+
To install PyQuery, you will need to have Python installed on your machine. If you don't have Python installed, you can download and install it from the official Python website.
14+
15+
Once you have Python installed, you can download and install the PyQuery library using pip. To do this, open a terminal or command prompt and type the following command:
16+
17+
```bash
18+
python -m pip install pyquery
19+
```
20+
21+
This will install pyquery with all the necessary dependencies. If you get any errors, check out the official pyquery documentation
22+
23+
## Parsing DOM
24+
25+
Let’s write our first scraper using PyQuery. We will use the requests module to fetch the website and parse it using PyQuery Module. Let’s go ahead and import the necessary libraries:
26+
27+
```python
28+
import requests
29+
from pyquery import PyQuery as pq
30+
31+
Now, let's fetch the website: https://example.com and grab the title using pyquery.
32+
33+
r = requests.get("https://example.com")
34+
doc = pq(r.content)
35+
print(doc("title").text())
36+
```
37+
38+
Now, if we run this code, it will print the title of the website. Notice, we are using the get method to grab the website content. And, then using the PyQuery class we parsed the whole content and stored it in the doc object. We then use the CSS selector to parse and display the title text using the title tag as a CSS selector.
39+
40+
### Extract Multiple Elements Using CSS Selector
41+
42+
Next, we will extract multiple elements using the CSS Selector. We will use the https://books.toscrape.com website. PyQuery has built-in support to extract HTML from URL let’s also leverage it for the next example:
43+
44+
```python
45+
from pyquery import PyQuery as pq
46+
doc = pq(url="https://books.toscrape.com")
47+
for link in doc("h3>a"):
48+
print(link.text, link.attrib["href"])
49+
```
50+
51+
We are using CSS Selector to grab all the links inside the H3 tags. And, then using a for loop we are printing the text and URL of those links. Depending on the number of elements the CSS selector will return one or more elements.
52+
To access the element properties we use the attrib object. The syntax is the same as the python dictionary. So, we simply pass the “href” as a key and it returns the URL of the element.
53+
54+
### Removing Elements
55+
56+
Sometimes we might need to remove unwanted elements from the DOM. PyQuery has a method called remove() which can be used for this purpose. Let’s say we want to get rid of all the icons from the above example. We can do it by adding a few lines of code like below:
57+
58+
```python
59+
from pyquery import PyQuery as pq
60+
doc = pq(url="https://books.toscrape.com")
61+
for icon in doc("i"):
62+
icon.remove()
63+
```
64+
65+
Once, we run this code it will remove all the icons from the doc.
66+
67+
## PyQuery vs BeautifulSoup
68+
69+
Both PyQuery and Beautiful Soup are great Python libraries for working with HTML and XML documents with tools for parsing, traversing, and manipulating HTML and XML documents, as well as extracting data from web pages and APIs.
70+
71+
One key difference between PyQuery and Beautiful Soup is the syntax and API that they use. PyQuery is designed to have a syntax and API similar to jQuery, a popular JavaScript library for working with HTML and DOM elements. If you are familiar with jQuery, you should be able to pick up PyQuery quickly. Beautiful Soup, on the other hand, has a different syntax and API that is more similar to the ElementTree library in Python's standard library. If you are familiar with ElementTree, you may find Beautiful Soup easier to use. Also, Beautifulsoup supports HTML sanitization which is handy if you are trying to scrape a website with broken HTML. Beautifulsoup is more feature riched when it comes to built-in functions, however, being lightweight PyQuery can do things much faster than beautifulsoup.
72+
73+
Ultimately, the choice between PyQuery and Beautiful Soup will depend on your specific needs and preferences either one can be a good choice for working with HTML and XML documents in Python.
74+
75+
<table>
76+
<thead>
77+
<tr>
78+
<th> </th>
79+
<th>PyQuery </th>
80+
<th>BeautifulSoup</th>
81+
</tr>
82+
</thead>
83+
<tbody>
84+
<tr>
85+
<td>Syntax and API</td>
86+
<td>JQuery Like </td>
87+
<td>ElementTree like </td>
88+
</tr>
89+
<tr>
90+
<td>Performance</td>
91+
<td>Fast</td>
92+
<td>Good</td>
93+
</tr>
94+
<tr>
95+
<td>Support Multiple Parsers</td>
96+
<td>Yes</td>
97+
<td>Yes</td>
98+
</tr>
99+
<tr>
100+
<td>Unicode Support</td>
101+
<td>Yes</td>
102+
<td>Yes</td>
103+
</tr>
104+
<tr>
105+
<td>HTML Sanitization</td>
106+
<td>No</td>
107+
<td>Yes</td>
108+
</tr>
109+
<tr>
110+
<td>Multiple Language support</td>
111+
<td>No</td>
112+
<td>Yes</td>
113+
</tr>
114+
</tbody>
115+
</table>
116+
117+
## Conclusion
118+
119+
In conclusion, PyQuery is an easy-to-use Python library for working with HTML and XML documents. Its jQuery-like syntax and API make it easy to parse, traverse, and manipulate HTML and XML documents, and extract data.
120+
121+
While PyQuery is a powerful tool, it is not the only option available for working with HTML and XML documents in Python. Beautiful Soup is another popular library that offers a different syntax and API and is suitable for different use cases. Ultimately, the choice between PyQuery and Beautiful Soup will depend on your specific needs and preferences.
122+
123+
In this article, we have introduced PyQuery and its capabilities, provided examples of how to use it, and compared it to Beautiful Soup. We hope that this information has been useful and that you are now better equipped to work with HTML and XML documents using PyQuery in your own projects.

src/example.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import requests
2+
from pyquery import PyQuery as pq
3+
4+
r = requests.get("https://example.com")
5+
doc = pq(r.content)
6+
print(doc("title").text())

src/extract_links.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from pyquery import PyQuery as pq
2+
3+
doc = pq(url="https://books.toscrape.com")
4+
for link in doc("h3>a"):
5+
print(link.text, link.attrib["href"])

src/remove_icons.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from pyquery import PyQuery as pq
2+
3+
doc = pq(url="https://books.toscrape.com")
4+
for icon in doc("i"):
5+
icon.remove()

0 commit comments

Comments
 (0)