Funded Companies

This repository contains a Node.js script that fetches daily EDGAR sitemap XML files, filters for Form D filings (where submissionType="D" and testOrLive="LIVE"), and enriches each issuer with possible domains using Clearbit’s Autocomplete API. Results are output to a CSV file. A separate cleaning script then de-duplicates the CSV by the CIK column.

Overview

index.js

Fetches a list of EDGAR sitemap XML files.
Parses each sitemap to locate primary_doc.xml for Form D filings.
Filters out anything not submissionType=D and testOrLive=LIVE.
For each entity, calls Clearbit Autocomplete to find possible company names/domains.
Writes results to output.csv.

clean.js

Reads output.csv.
De-duplicates rows based on the CIK column.
Writes a new output_clean.csv with only the first occurrence of each CIK.

Files

index.js: Main script to scrape EDGAR sitemaps and create output.csv.
clean.js: Utility script to remove duplicates from output.csv by the CIK column, creating output_clean.csv.
package.json (if provided): Lists dependencies (node-fetch, xml2js, etc.) and scripts for easy running.
README.md (this file): Documentation for usage and setup.

Dependencies

Node.js (v14+ recommended)
npm or yarn for installing packages
Packages used:
- node-fetch for HTTP requests
- xml2js for XML parsing
- fs (built-in) for file system operations
- dotenv for managing environment variables

Install dependencies by running:

npm install

Usage

1. Fetch & Enrich

Update .env variables as needed:

CONTACT_EMAIL: The email address to include in requests.

Run the main script:

npm run start

Upon completion, the script creates (or overwrites) an output.csv file containing the combined results.

2. Deduplicate

After you have output.csv:

Run the cleaning script:

npm run clean

This reads output.csv and writes output_clean.csv with unique rows by CIK (keeps the first entry for each CIK).

Configuration

Environment Variables

Environment variables are stored in a .env file. Create a .env file in the project root with the following variables:

CONTACT_EMAIL=[email protected]

These variables will be automatically loaded by the dotenv package.

Sitemap URLs

Inside index.js, you’ll find:

const sitemapUrls = [
  'https://www.sec.gov/Archives/edgar/daily-index/2025/QTR1/sitemap.20250102.xml',
  'https://www.sec.gov/Archives/edgar/daily-index/2025/QTR1/sitemap.20250103.xml',
  // ...
];

Add or remove EDGAR sitemaps based on the date ranges you need. You can find a reference to all the daily XML files here.

Delay & Rate Limit

The variable:

const delayMs = 100;

controls how long the script waits between each network request (in ms). This helps avoid 429 errors from the SEC website. The current limit is 10 requests per second, so 100ms should be fine. But it also counts if you are navigating the site manually.

Max Good Count

You might see something like:

// const MAX_GOOD_COUNT = 3;

If uncommented, it limits the script to 3 successful findings and then stops (helpful for testing). If you want all possible records, comment out the line or set it to null.

Notes & Caveats

Clearbit Autocomplete

The endpoint used does not require an API key but may throttle you if you make too many requests.
It provides an array of name/domain suggestions. We store them in clearbitNames and clearbitDomains columns.
If it fails or no suggestions found, those columns remain blank.
I have found this process doesn't return a lot of values, but couldn't come up with a better free automated solution.
I would also not rely on HubSpot (acquired Clearbit) keeping this endpoint open (not sure why this ungated endpoint ever existed anyhow).

Rate Limits

The SEC generally allows up to 10 requests per second.

Deduplication by CIK

clean.js keeps only the first row for each CIK. If you have multiple rows with the same CIK, only one remains.
If your data can have variations in how the CIK is quoted or capitalized, you might need additional logic.

CSV Parsing

The current scripts do naive splitting on commas. If your CSV data contains commas in quoted fields, consider using a robust CSV parser.

Name	Name	Last commit message	Last commit date
Latest commit danguenet init Jan 21, 2025 bc2481e · Jan 21, 2025 History 1 Commit
.env.example	.env.example	init	Jan 21, 2025
.gitignore	.gitignore	init	Jan 21, 2025
README.md	README.md	init	Jan 21, 2025
clean.js	clean.js	init	Jan 21, 2025
index.js	index.js	init	Jan 21, 2025
package-lock.json	package-lock.json	init	Jan 21, 2025
package.json	package.json	init	Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Funded Companies

Table of Contents

Overview

index.js

clean.js

Files

Dependencies

Usage

1. Fetch & Enrich

2. Deduplicate

Configuration

Environment Variables

Sitemap URLs

Delay & Rate Limit

Max Good Count

Notes & Caveats

Clearbit Autocomplete

Rate Limits

Deduplication by CIK

CSV Parsing

About

Releases

Packages

Languages

danguenet/funded-companies

Folders and files

Latest commit

History

Repository files navigation

Funded Companies

Table of Contents

Overview

index.js

clean.js

Files

Dependencies

Usage

1. Fetch & Enrich

2. Deduplicate

Configuration

Environment Variables

Sitemap URLs

Delay & Rate Limit

Max Good Count

Notes & Caveats

Clearbit Autocomplete

Rate Limits

Deduplication by CIK

CSV Parsing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages