This repository contains a Node.js script that fetches daily EDGAR sitemap XML files, filters for Form D filings (where submissionType="D"
and testOrLive="LIVE"
), and enriches each issuer with possible domains using Clearbit’s Autocomplete API. Results are output to a CSV file. A separate cleaning script then de-duplicates the CSV by the CIK
column.
- Fetches a list of EDGAR sitemap XML files.
- Parses each sitemap to locate
primary_doc.xml
for Form D filings. - Filters out anything not
submissionType=D
andtestOrLive=LIVE
. - For each entity, calls Clearbit Autocomplete to find possible company names/domains.
- Writes results to
output.csv
.
- Reads
output.csv
. - De-duplicates rows based on the
CIK
column. - Writes a new
output_clean.csv
with only the first occurrence of eachCIK
.
- index.js: Main script to scrape EDGAR sitemaps and create
output.csv
. - clean.js: Utility script to remove duplicates from
output.csv
by theCIK
column, creatingoutput_clean.csv
. - package.json (if provided): Lists dependencies (
node-fetch
,xml2js
, etc.) and scripts for easy running. - README.md (this file): Documentation for usage and setup.
- Node.js (v14+ recommended)
- npm or yarn for installing packages
- Packages used:
node-fetch
for HTTP requestsxml2js
for XML parsingfs
(built-in) for file system operationsdotenv
for managing environment variables
Install dependencies by running:
npm install
Update .env
variables as needed:
CONTACT_EMAIL
: The email address to include in requests.
Run the main script:
npm run start
Upon completion, the script creates (or overwrites) an output.csv
file containing the combined results.
After you have output.csv
:
Run the cleaning script:
npm run clean
This reads output.csv
and writes output_clean.csv
with unique rows by CIK
(keeps the first entry for each CIK
).
Environment variables are stored in a .env
file. Create a .env
file in the project root with the following variables:
CONTACT_EMAIL=[email protected]
These variables will be automatically loaded by the dotenv
package.
Inside index.js
, you’ll find:
const sitemapUrls = [
'https://www.sec.gov/Archives/edgar/daily-index/2025/QTR1/sitemap.20250102.xml',
'https://www.sec.gov/Archives/edgar/daily-index/2025/QTR1/sitemap.20250103.xml',
// ...
];
Add or remove EDGAR sitemaps based on the date ranges you need. You can find a reference to all the daily XML files here.
The variable:
const delayMs = 100;
controls how long the script waits between each network request (in ms). This helps avoid 429 errors from the SEC website. The current limit is 10 requests per second, so 100ms should be fine. But it also counts if you are navigating the site manually.
You might see something like:
// const MAX_GOOD_COUNT = 3;
If uncommented, it limits the script to 3 successful findings and then stops (helpful for testing). If you want all possible records, comment out the line or set it to null
.
- The endpoint used does not require an API key but may throttle you if you make too many requests.
- It provides an array of name/domain suggestions. We store them in
clearbitNames
andclearbitDomains
columns. - If it fails or no suggestions found, those columns remain blank.
- I have found this process doesn't return a lot of values, but couldn't come up with a better free automated solution.
- I would also not rely on HubSpot (acquired Clearbit) keeping this endpoint open (not sure why this ungated endpoint ever existed anyhow).
- The SEC generally allows up to 10 requests per second.
clean.js
keeps only the first row for eachCIK
. If you have multiple rows with the sameCIK
, only one remains.- If your data can have variations in how the
CIK
is quoted or capitalized, you might need additional logic.
- The current scripts do naive splitting on commas. If your CSV data contains commas in quoted fields, consider using a robust CSV parser.