Skip to content

Commit 8c743bb

Browse files
committed
removing pdftotext as too flaky
1 parent 0268d21 commit 8c743bb

File tree

5 files changed

+3
-54
lines changed

5 files changed

+3
-54
lines changed

Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ RUN pip install --upgrade setuptools wheel
3030

3131
# Create the environment:
3232
COPY environment.yml .
33+
3334
# Install everything at once:
3435
RUN mamba env create -f environment.yml
3536
# Do a debug or incremental env install (builds in under 3 min):

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ There is a Dockerfile for the environment, which is the easiest way to set up a
5757
1. Ensure you have docker, the docker extension for VS Code, and the VS Code remote extension installed
5858
2. In VS Code, right-click on the Dockerfile and hit "build image". Switch to the docker tab and hit "Run interactive". Switch to the VS Code Remote tab, then click "Attach in new window" on the running container.
5959
3. Once you are in the container via VS Code remote, you may need to install the Python extension (within the container version of VS Code).
60-
4. Open up the relevant folder, "app", in VS Code remote.
60+
4. Open up the relevant folder, "app", in VS Code remote. It's usually in `../app`.
6161
5. Once you are in the container, you'll need to use `conda activate codeforecon` to switch to the right conda environment to build files and run notebooks. The build command is as usual for Jupyter Book: `jupyter-book build .`.
6262
6. To ping back the built HTML files, use `docker cp running-container-name:app/_build/ .` from your local command line (not the docker container) within the folder where you want to move all of the build files. We suggest creating a "scratch" folder and running the command from within it, or transferring directly to your local "_build" folder. You can partially view the HTML within the docker container using a HTML preview extension but note that MathJax and internal book links won't work.
6363

data-advanced.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -517,7 +517,7 @@
517517
"\n",
518518
"class User(BaseModel):\n",
519519
" id: int\n",
520-
" name = \"Katherine Johnson\"\n",
520+
" name: str = \"Katherine Johnson\"\n",
521521
" signup_ts: Optional[datetime] = None\n",
522522
" friends: List[int] = []"
523523
]

data-extraction.ipynb

Lines changed: 0 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -486,57 +486,6 @@
486486
"This gives us the table neatly loaded into a **pandas** dataframe ready for further use."
487487
]
488488
},
489-
{
490-
"cell_type": "markdown",
491-
"id": "ae23f3f5",
492-
"metadata": {},
493-
"source": [
494-
"## Extracting data from PDFs\n",
495-
"\n",
496-
"PDFs are great. Unfortunately, some people love them so much that they think they're an appropriate way to store data rather than a convenient way to share text and/or figures. Or perhaps there's a table in a PDF that you'd legitimately like to get the info out from.\n"
497-
]
498-
},
499-
{
500-
"cell_type": "markdown",
501-
"id": "1084de32",
502-
"metadata": {},
503-
"source": [
504-
"### Extracting images and text from PDFs\n",
505-
"\n",
506-
"We'll use [**pdftotext**](https://github.com/jalan/pdftotext) to get text out of the same PDF."
507-
]
508-
},
509-
{
510-
"cell_type": "code",
511-
"execution_count": null,
512-
"id": "699e56e0",
513-
"metadata": {},
514-
"outputs": [],
515-
"source": [
516-
"from pathlib import Path\n",
517-
"\n",
518-
"import pdftotext\n",
519-
"\n",
520-
"# Download the pdf_with_table.pdf file from\n",
521-
"# https://github.com/aeturrell/coding-for-economists/blob/main/data/pdf_with_table.pdf\n",
522-
"# and put it in a subfolder called data before running the next line\n",
523-
"\n",
524-
"# Load the PDF\n",
525-
"with open(Path(\"data/pdf_with_table.pdf\"), \"rb\") as f:\n",
526-
" pdf = pdftotext.PDF(f)\n",
527-
"\n",
528-
"# Read all the text into one string; print a chunk of the string\n",
529-
"print(\"\\n\\n\".join(pdf)[:220])"
530-
]
531-
},
532-
{
533-
"cell_type": "markdown",
534-
"id": "ce2ab75b",
535-
"metadata": {},
536-
"source": [
537-
"Other options for extracting information from PDFs include [**pdfminer**](https://pdfminersix.readthedocs.io) (which can also extract images) and [borb](https://github.com/jorisschellekens/borb) (though be careful of its licence if you're using it for commercial purposes)."
538-
]
539-
},
540489
{
541490
"cell_type": "markdown",
542491
"id": "b76f6b2d",

environment.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,5 +78,4 @@ dependencies:
7878
- palmerpenguins
7979
- pyfixest>=0.17.0
8080
- watermark
81-
- pdftotext
8281
- ydata_profiling

0 commit comments

Comments
 (0)