Skip to content

danguenet/funded-companies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Jan 21, 2025
bc2481e · Jan 21, 2025

History

1 Commit
Jan 21, 2025
Jan 21, 2025
Jan 21, 2025
Jan 21, 2025
Jan 21, 2025
Jan 21, 2025
Jan 21, 2025

Repository files navigation

Funded Companies

This repository contains a Node.js script that fetches daily EDGAR sitemap XML files, filters for Form D filings (where submissionType="D" and testOrLive="LIVE"), and enriches each issuer with possible domains using Clearbit’s Autocomplete API. Results are output to a CSV file. A separate cleaning script then de-duplicates the CSV by the CIK column.

Table of Contents

Overview

index.js

  • Fetches a list of EDGAR sitemap XML files.
  • Parses each sitemap to locate primary_doc.xml for Form D filings.
  • Filters out anything not submissionType=D and testOrLive=LIVE.
  • For each entity, calls Clearbit Autocomplete to find possible company names/domains.
  • Writes results to output.csv.

clean.js

  • Reads output.csv.
  • De-duplicates rows based on the CIK column.
  • Writes a new output_clean.csv with only the first occurrence of each CIK.

Files

  • index.js: Main script to scrape EDGAR sitemaps and create output.csv.
  • clean.js: Utility script to remove duplicates from output.csv by the CIK column, creating output_clean.csv.
  • package.json (if provided): Lists dependencies (node-fetch, xml2js, etc.) and scripts for easy running.
  • README.md (this file): Documentation for usage and setup.

Dependencies

  • Node.js (v14+ recommended)
  • npm or yarn for installing packages
  • Packages used:
    • node-fetch for HTTP requests
    • xml2js for XML parsing
    • fs (built-in) for file system operations
    • dotenv for managing environment variables

Install dependencies by running:

npm install

Usage

1. Fetch & Enrich

Update .env variables as needed:

  • CONTACT_EMAIL: The email address to include in requests.

Run the main script:

npm run start

Upon completion, the script creates (or overwrites) an output.csv file containing the combined results.

2. Deduplicate

After you have output.csv:

Run the cleaning script:

npm run clean

This reads output.csv and writes output_clean.csv with unique rows by CIK (keeps the first entry for each CIK).

Configuration

Environment Variables

Environment variables are stored in a .env file. Create a .env file in the project root with the following variables:

CONTACT_EMAIL=[email protected]

These variables will be automatically loaded by the dotenv package.

Sitemap URLs

Inside index.js, you’ll find:

const sitemapUrls = [
  'https://www.sec.gov/Archives/edgar/daily-index/2025/QTR1/sitemap.20250102.xml',
  'https://www.sec.gov/Archives/edgar/daily-index/2025/QTR1/sitemap.20250103.xml',
  // ...
];

Add or remove EDGAR sitemaps based on the date ranges you need. You can find a reference to all the daily XML files here.

Delay & Rate Limit

The variable:

const delayMs = 100;

controls how long the script waits between each network request (in ms). This helps avoid 429 errors from the SEC website. The current limit is 10 requests per second, so 100ms should be fine. But it also counts if you are navigating the site manually.

Max Good Count

You might see something like:

// const MAX_GOOD_COUNT = 3;

If uncommented, it limits the script to 3 successful findings and then stops (helpful for testing). If you want all possible records, comment out the line or set it to null.

Notes & Caveats

Clearbit Autocomplete

  • The endpoint used does not require an API key but may throttle you if you make too many requests.
  • It provides an array of name/domain suggestions. We store them in clearbitNames and clearbitDomains columns.
  • If it fails or no suggestions found, those columns remain blank.
  • I have found this process doesn't return a lot of values, but couldn't come up with a better free automated solution.
  • I would also not rely on HubSpot (acquired Clearbit) keeping this endpoint open (not sure why this ungated endpoint ever existed anyhow).

Rate Limits

  • The SEC generally allows up to 10 requests per second.

Deduplication by CIK

  • clean.js keeps only the first row for each CIK. If you have multiple rows with the same CIK, only one remains.
  • If your data can have variations in how the CIK is quoted or capitalized, you might need additional logic.

CSV Parsing

  • The current scripts do naive splitting on commas. If your CSV data contains commas in quoted fields, consider using a robust CSV parser.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published