Live Site: https://epsteingate.org
LLM-powered tooling for triaging the U.S. House Oversight Epstein Estate document release. This project:
- Streams the document corpus through a locally hosted, open-source model (
openai/gpt-oss-120brunning via LM Studio) to produce ranked, structured leads. - Ships a dashboard (
viewer/) so investigators can filter, chart, and inspect every scored document (including the full source text) offline.
The entire workflow operates on a single MacBook Pro (M3 Max, 128 GB RAM). With an average draw of 100 W, a 60-hour pass consumes ≈6 kWh ($1.50 at SoCal off-peak rates) with zero cloud/API spend.
✓ All documents analyzed and scored
The corpus contains ~25,800 documents from the House Oversight Committee release. As of now, the first 10,000 entries (rows 1–10,000) have been processed and are available in contrib/epstein_ranked_*.jsonl chunk files.
- Completed: Rows 1–10,000 (10 chunk files)
- Contributors welcome: See Collaborative ranking workflow below to help process the remaining documents
| Table View | Insights & Charts |
|---|---|
![]() |
![]() |
| Methodology Explainer |
|---|
![]() |
The repository’s base dataset is the “20,000 Epstein Files” text corpus prepared by tensonaut, who OCR’d ~25,000 pages released by the U.S. House Committee on Oversight and Government Reform.
Key references:
- Hugging Face dataset: https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K
- Original release: Oversight Committee Releases Additional Epstein Estate Documents (Nov 12, 2025)
Corpus outline (summarized from the dataset card):
-
25,000 plain-text documents derived from the committee’s Google Drive distribution.
TEXT/files were converted directly from native text sources;IMAGES/files (≈20k JPGs) were OCR’d with Tesseract.- Filenames preserve the source path so you can cross-reference with the official release.
- No manual edits beyond text extraction/OCR; expect OCR noise, redaction markers, or broken formatting.
- Legal status: documents originate from the House release and retain any original copyright. This repo asserts no ownership and offers no legal advice—ensure your use complies with applicable law.
- Content warning: expect references to sexual abuse, exploitation, trafficking, violence, and unverified allegations.
Huge thanks to tensonaut for the foundational OCR and dataset packaging; this project simply layers ranking and analytics on top.
- Python 3.9+
requests- LM Studio (or another OpenAI-compatible gateway) serving
openai/gpt-oss-120blocally athttp://localhost:5002/v1 - The dataset CSV (
data/EPS_FILES_20K_NOV2026.csv). Not included in this repo—download it from the Hugging Face link above and place it indata/(seedata/README.mdfor instructions).
Install Python deps (only requests is needed):
python -m pip install -r requirements.txt # or just: python -m pip install requestscp ranker_config.example.toml ranker_config.toml # optional defaults
python gpt_ranker.py \
--input data/EPS_FILES_20K_NOV2026.csv \
--output data/epstein_ranked.csv \
--json-output data/epstein_ranked.jsonl \
--endpoint http://localhost:5002/v1 \
--model openai/gpt-oss-120b \
--resume \
--sleep 0.5 \
--config ranker_config.toml \
--reasoning-effort lowBy default, the ranker writes 1,000-row chunks to contrib/ and updates data/chunks.json.
Set --chunk-size 0 if you really want a single CSV/JSONL output (not recommended for sharing).
Notable flags:
--prompt-file: specify a custom system prompt file (defaults toprompts/default_system_prompt.txt). Seeprompts/README.mdfor details on creating custom prompts.--system-prompt: provide an inline system prompt string (overrides--prompt-file).--resume: skips rows already present in the JSONL/checkpoint so you can stop/restart long runs.--checkpoint data/.epstein_checkpoint: stores processed filenames to guard against duplication.--reasoning-effort low/high: trade accuracy for speed if your model exposes the reasoning control knob.--include-action-items: opt-in if you want the LLM to list action items (off by default for brevity).--timeout: HTTP request timeout in seconds (default: 600 = 10 minutes). Increase for very large documents (100K+ tokens).--max-rows N: smoke-test on a small subset.--list-models: query your endpoint for available model IDs.--rebuild-manifest: scancontrib/for chunk files and rebuilddata/chunks.json(useful if the manifest gets out of sync).--start-row,--end-row: process only a slice of the dataset (ideal for collaborative chunking).--chunk-size,--chunk-dir,--chunk-manifest: control chunk splitting, where chunk files live, and where the manifest is written.--overwrite-output: explicitly allow truncating existing files (default is to refuse unless--resumeor unique paths are used).--power-watts,--electric-rate,--run-hours: plug in your local power draw/cost to estimate total electricity usage (also configurable via the TOML file).
Outputs:
contrib/epstein_ranked_<start>_<end>.jsonl– Default chunked outputs (one file per 1,000 rows) that contain the headline, score, metadata, and original text for each document.data/chunks.json– Manifest listing all chunk files (the viewer uses this to load data).data/epstein_ranked.csv/jsonl– Only produced if you disable chunking via--chunk-size 0.
The ranker automatically updates data/chunks.json after each chunk is completed. This manifest tells the viewer which chunk files exist and what row ranges they cover.
If the manifest gets out of sync (e.g., due to interrupted runs or manual file moves), you can rebuild it:
python gpt_ranker.py --rebuild-manifestThis scans contrib/ for all chunk files and regenerates the manifest automatically.
The ranker uses a system prompt to instruct the model on how to analyze and score documents. You can customize this prompt to fit your specific needs:
- Create your own prompt file in the
prompts/directory (e.g.,prompts/my_custom_prompt.txt) - Run the ranker with
--prompt-file:
python gpt_ranker.py --prompt-file prompts/my_custom_prompt.txt --config ranker_config.tomlOr set it in your config file:
prompt_file = "prompts/my_custom_prompt.txt"See prompts/README.md for detailed guidance on creating custom prompts, and check out prompts/example_strict_scoring.txt for an example of a stricter scoring methodology.
The ranker loads prompts in this order of priority:
--system-prompt(inline string argument)--prompt-fileorprompt_filein configprompts/default_system_prompt.txt(if it exists)- Hardcoded default prompt (fallback)
The prompt source is automatically included in the output metadata for each document, so you can always see which prompt was used for analysis.
The default prompt instructs the model with the following criteria (excerpt):
You analyze primary documents related to court and investigative filings.
Focus on whether the passage offers potential leads—even if unverified—that connect influential actors ... to controversial actions, financial flows, or possible misconduct.
Score each passage on:
1. Investigative usefulness
2. Controversy / sensitivity
3. Novelty
4. Power linkage
Assign an importance_score from 0 (no meaningful lead) to 100 (blockbuster lead linking powerful actors to fresh controversy). Use the scale consistently:
• 0–10 : noise, duplicates, previously published facts, or gossip with no actors.
• 10–30 : low-value context; speculative or weak leads lacking specifics.
• 30–50 : moderate leads with partial details or missing novelty.
• 50–70 : strong leads with actionable info or notable controversy.
• 70–85 : high-impact, new revelations tying powerful actors to clear misconduct.
• 85–100: blockbuster revelations demanding immediate follow-up.
Reserve 70+ for claims that, if true, would represent major revelations or next-step investigations.
Return strict JSON with fields: headline, importance_score, reason, key_insights, tags, power_mentions, agency_involvement, lead_types.
Rows ≥70 typically surface multi-factor leads (named actors + money trail + novelty). Anything below ~30 is often speculation or previously reported context.
Serve the dashboard to explore results, filter, and inspect the full source text of each document:
./viewer.sh 9000
# or manually:
cd viewer && python -m http.server 9000Open http://localhost:9000. Features:
- Automatically loads any chunk listed in
data/chunks.json(falls back todata/epstein_ranked.jsonlif no chunks exist). - AG Grid table sorted by importance score (click a row to expand the detail drawer and read the entire document text).
- Filters for score threshold, lead types, power mentions, ad hoc search, and row limits.
- Charts showing lead-type distribution, score histogram, top power mentions, and top agencies.
- Methodology accordion describing the scoring criteria, prompt, and compute footprint.
viewer/app.js reads data/chunks.json by default, so remember to commit updated manifests + chunk files. If no manifest exists, it automatically scans for files named contrib/epstein_ranked_*.jsonl before falling back to data/epstein_ranked.jsonl.
Want to help process more of the corpus? Fork the repo, claim a range of rows, and submit your results:
-
Pick a chunk – e.g., rows
00001–01000,01001–02000, etc. Use whatever increments work. Announce the chunk (issue/Discord) so others don’t duplicate effort. -
Run the ranker on that slice using the new range flags:
python gpt_ranker.py \ --config ranker_config.toml \ --start-row 1001 \ --end-row 2000 \ --chunk-dir contrib \ --chunk-manifest data/chunks.json \ --known-json data/epstein_ranked.jsonl \ --resume
This only processes documents in that range, emits
contrib/epstein_ranked_<range>.jsonl, and updates the manifest.
--known-jsonmakes the script aware of previously merged results (so duplicates are skipped automatically). Combine with--resumeif you need to pause and continue later. -
Export your outputs – each run writes the chunk JSONL straight into
contrib/. Keep the naming patterncontrib/epstein_ranked_<start>_<end>.jsonl. -
Submit a PR with your chunk (the JSONL + updated
data/chunks.json). We’ll merge the contributions into the global dataset and credit collaborators in the README.
Guidelines:
- Do not commit the original 100 MB source CSV; each contributor should download it separately.
- Keep the JSONL chunks intact (no reformatting) so we can merge them programmatically.
- If you discover inconsistencies or interesting leads, open an issue to coordinate follow-up analysis.
- Pull the latest
data/chunks.json(and any merged JSONL files) before starting; pass the merged JSON via--known-jsonso you never duplicate work.
- The corpus contains sensitive content (sexual abuse, trafficking, violence, unverified allegations). Use with care.
- Documents are part of the public record but may still be subject to copyright/privacy restrictions; verify before sharing or redistributing.
- Recommended use cases: investigative triage, exploratory data analysis, RAG/IR experiments, or academic review.
- This project does not assert any claims about the veracity of individual documents—scores merely prioritize leads for deeper human review.
- tensonaut for compiling the OCR corpus and publishing it to Hugging Face.
- U.S. House Committee on Oversight and Government Reform for releasing the source documents.
- The LM Studio community &
r/LocalLLaMAfor pushing local LLM workflows forward.
Released under the MIT License. Please retain attribution to this project, the tensonaut dataset, and the U.S. House Oversight Committee release when building derivative tools or analyses.


