Web scraping for data extraction and analysis

To crawl all the links in the target website, first define "getHtmlText" as the function to get the links to the home page, this function is responsible for all the webpage requests in the subsequent steps. Set the website (https://community.dur.ac.uk/hubert.shum/comp42315/) as the starting URL and use the 'getHtmlText' get the page and parse it. Then get the link to the publication page by finding the tag 'navigator' and again use 'getHtmlText' to request and parse the page content. From the publication page, extract the P with the keyword = 'TextOption' to get all the text of the topic. Of all the strings obtained for the topics, the HTML structure of the first topic 'character animation' is found to be different from the others and needs to be munged by removing '\xa0'. For each topic page, extract the DIV with keyword class='w3-cell-middle', within the extracted block find all SPAN with keyword class='TextSmallDefault' to get the author and affiliate web links. All the results are stored in a csv file.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
PDS web scraper.ipynb		PDS web scraper.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraping for data extraction and analysis

About

Releases

Packages

Languages

jzdmx/Web-Scrapping-with-Beautiful-Soup

Folders and files

Latest commit

History

Repository files navigation

Web scraping for data extraction and analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages