removing pdftotext as too flaky

aeturrell · aeturrell · commit 8c743bb316a7 · 2024-12-22T23:53:20.000Z
diff --git a/Dockerfile b/Dockerfile
@@ -30,6 +30,7 @@ RUN pip install --upgrade setuptools wheel
 
 # Create the environment:
 COPY environment.yml .
+
 # Install everything at once:
 RUN mamba env create -f environment.yml
 # Do a debug or incremental env install (builds in under 3 min):
diff --git a/README.md b/README.md
@@ -57,7 +57,7 @@ There is a Dockerfile for the environment, which is the easiest way to set up a
 1. Ensure you have docker, the docker extension for VS Code, and the VS Code remote extension installed
 2. In VS Code, right-click on the Dockerfile and hit "build image". Switch to the docker tab and hit "Run interactive". Switch to the VS Code Remote tab, then click "Attach in new window" on the running container.
 3. Once you are in the container via VS Code remote, you may need to install the Python extension (within the container version of VS Code).
-4. Open up the relevant folder, "app", in VS Code remote.
+4. Open up the relevant folder, "app", in VS Code remote. It's usually in `../app`.
 5. Once you are in the container, you'll need to use `conda activate codeforecon` to switch to the right conda environment to build files and run notebooks. The build command is as usual for Jupyter Book: `jupyter-book build .`.
 6. To ping back the built HTML files, use `docker cp running-container-name:app/_build/ .` from your local command line (not the docker container) within the folder where you want to move all of the build files. We suggest creating a "scratch" folder and running the command from within it, or transferring directly to your local "_build" folder. You can partially view the HTML within the docker container using a HTML preview extension but note that MathJax and internal book links won't work.
 
diff --git a/data-advanced.ipynb b/data-advanced.ipynb
@@ -517,7 +517,7 @@
     "\n",
     "class User(BaseModel):\n",
     "    id: int\n",
-    "    name = \"Katherine Johnson\"\n",
+    "    name: str = \"Katherine Johnson\"\n",
     "    signup_ts: Optional[datetime] = None\n",
     "    friends: List[int] = []"
    ]
diff --git a/data-extraction.ipynb b/data-extraction.ipynb
@@ -486,57 +486,6 @@
     "This gives us the table neatly loaded into a **pandas** dataframe ready for further use."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "ae23f3f5",
-   "metadata": {},
-   "source": [
-    "## Extracting data from PDFs\n",
-    "\n",
-    "PDFs are great. Unfortunately, some people love them so much that they think they're an appropriate way to store data rather than a convenient way to share text and/or figures. Or perhaps there's a table in a PDF that you'd legitimately like to get the info out from.\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1084de32",
-   "metadata": {},
-   "source": [
-    "### Extracting images and text from PDFs\n",
-    "\n",
-    "We'll use [**pdftotext**](https://github.com/jalan/pdftotext) to get text out of the same PDF."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "699e56e0",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from pathlib import Path\n",
-    "\n",
-    "import pdftotext\n",
-    "\n",
-    "# Download the pdf_with_table.pdf file from\n",
-    "# https://github.com/aeturrell/coding-for-economists/blob/main/data/pdf_with_table.pdf\n",
-    "# and put it in a subfolder called data before running the next line\n",
-    "\n",
-    "# Load the PDF\n",
-    "with open(Path(\"data/pdf_with_table.pdf\"), \"rb\") as f:\n",
-    "    pdf = pdftotext.PDF(f)\n",
-    "\n",
-    "# Read all the text into one string; print a chunk of the string\n",
-    "print(\"\\n\\n\".join(pdf)[:220])"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ce2ab75b",
-   "metadata": {},
-   "source": [
-    "Other options for extracting information from PDFs include [**pdfminer**](https://pdfminersix.readthedocs.io) (which can also extract images) and [borb](https://github.com/jorisschellekens/borb) (though be careful of its licence if you're using it for commercial purposes)."
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "b76f6b2d",
diff --git a/environment.yml b/environment.yml
@@ -78,5 +78,4 @@ dependencies:
     - palmerpenguins
     - pyfixest>=0.17.0
     - watermark
-    - pdftotext
     - ydata_profiling

Original file line number	Diff line number	Diff line change
`@@ -517,7 +517,7 @@`
`517`	`517`	`"\n",`
`518`	`518`	`"class User(BaseModel):\n",`
`519`	`519`	`" id: int\n",`
`520`		`- " name = \"Katherine Johnson\"\n",`
	`520`	`+ " name: str = \"Katherine Johnson\"\n",`
`521`	`521`	`" signup_ts: Optional[datetime] = None\n",`
`522`	`522`	`" friends: List[int] = []"`
`523`	`523`	`]`