Gutenberg EZD

A web scraper and text parser for Project Gutenberg.

Prerequisites

NodeJS 16.x+, use nvm to manage NodeJS/npm versions.
Typescript @ ^4.9.x

Ensure tsc is available globally via:

npm i -g typescript

Overview

This project is meant to be a tool for general aggregation and compilation of plaintext book files from Project Gutenberg, for the explicit purpose of generating a stable dataset of generalized plaintext to write simple programs against.

You could use the output(s) of this program as an input to learn file parsing basics in another programming language.

For example, the parse command produces output that can be verified in another program when ran against the files in data/txt-ebooks-stripped.

Getting Started

cd into project:

cd ./path/to/gutenberg-ezd

Install dependencies:

npm i

Compile with tsc:

tsc

Optionally, open a separate terminal and run tsc in watch mode (recommended):

tsc -w

Historical Data

The historical-data/ directory contains a result dataset from scraping the books via a daily automation from 12-09-2022 to 12-28-2022.

To seed the program with this dataset, unzip the historical-data/scraped-ebooks_12-09-2022_to_12-28-2022.zip.

The archive should unzip to a directory at historical-data/scraped-ebooks.

Copy the contents to data/scraped-ebooks:

cp -a ./historical-data/scraped-ebooks/. ./data/scraped-ebooks/

This will enable the fetch, strip, and parse scripts to be run over all of the ebooks captured during the December 2022 collection period.

Running the program

Generally, the commands for the program should be run sequentially:

scrape
fetch
strip
parse

Having them separate is intentional, because some commands can aggregate locally over time, and would be time consuming to run subsequent commands.

For example, scrape can be run on a daily basis to aggregate metadata for each list. fetch can also be run, but isn't necessary for capturing ebooks or rankings on any given day - it fetches the .txt files from the metadata produced by the scrape command.

strip only needs to be run when fetch is ran and fetched new files into data/txt-ebooks, or if you change the header/footer stripping logic.

Scrape

node dist/main.js scrape

Runs puppeteer in Chrome headless mode to scrape today's top100 / top1000 ebooks. Fetches from the following source pages:

top 100: https://www.gutenberg.org/browse/scores/top

top 1000: https://www.gutenberg.org/browse/scores/top1000.php

Writes the metadata by date and list type into data/scraped-ebooks/ as JSON.

Fetch

node dist/main.js fetch

Attempts to download all ebooks with plaintext files into data/txt-ebooks/. Skips 404s, otherwise attempts reties with backoff.

Strip

node dist/main.js strip

Strips the Project Gutenberg programmatic header & footer text blocks from the fetched files and rewrites them to data/txt-ebooks-stripped/.

Skips any files with non-parsable headers or footers, or in a format that's not parsable - e.g. .txt files that represent a Project Gutenberg index file instead of the file directly.

Parse

node dist/main.js parse

Runs a simple parsing program on the files produced by strip and prints rudimentary statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
historical-data		historical-data
src		src
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.npmrc		.npmrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gutenberg EZD

Prerequisites

Overview

Getting Started

Historical Data

Running the program

Scrape

Fetch

Strip

Parse

About

Releases

Packages

Languages

dev-advocacy-qualtrics/gutenberg-ezd

Folders and files

Latest commit

History

Repository files navigation

Gutenberg EZD

Prerequisites

Overview

Getting Started

Historical Data

Running the program

Scrape

Fetch

Strip

Parse

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages