Skip to content

j-senn/nlrb-db

Repository files navigation

Case Details were retrieved from the NLRB website using the search and "Download CSV" feature. Searches were batched into time periods containing about 50,000 cases. However, some files were unable to be parsed by Pandas immediately so 'Case Numbers' were extracted using the grep regex

'[:digit:]{2}-[:alpha:]{2}-[:digit:]{6}'

This did lead to a discrepancy in the number of cases the NLRB said they were providing and parsed case numbers.

Example command line usage:

python scrape_case_pages.py -c case_numbers.csv

File descriptions

  • cases.tar.gz zipped CSVs of case details as downloaded from the NLRB website. Batched by time periods with ~50k rows
  • case_numbers_*.csv extracted case numbers from the original CSVs in the described time period.

Notes on behavior:

  • This script skips cases that already have a file in the case_htmls directory.
  • This process is still slow taking 3-4 hours per 50,000 cases.
  • It doesn't have any command line output and has sparse logging. You can check the case_htmls directory to see it's progress ls -l case_htmls | wc -l

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages