Ultimate Privacy Research Scraper – Converts peer‑reviewed privacy & anonymity papers into clean, reproducible, auditable datasets ready for analysis, dashboards, and code.
Live Actor: https://console.apify.com/oblate_wildcat/privacy-stack
GitHub: https://github.com/BikramBiswas786/privacy-stack
Privacy Stack is an Apify Actor that builds a large‑scale, high‑quality research corpus of real arXiv papers in security & privacy.
It scrapes and normalizes 5000 unique arXiv cs.CR papers across 4 critical categories, making it easy to explore, filter, and build on top of the latest privacy research without touching the arXiv UI. [web:20][file:13]
You get:
- A clean JSON/CSV dataset you can drop into analysis pipelines
- Strong deduplication guarantees
- A stable schema designed for LLMs, dashboards, and downstream tools
Each run targets exactly 1250 papers per category, for a total of 5000 unique cs.CR papers:
-
🌐 Internet Privacy
- Tor, mix networks, I2P, VPNs, onion routing
- Traffic analysis attacks & defenses
- Website fingerprinting, metadata‑hiding systems
-
🔐 Crypto Privacy
- Zero‑knowledge proofs (zk‑SNARKs, zk‑STARKs)
- FHE, MPC, Bulletproofs, Pedersen commitments
- Privacy coins (Zcash, Monero), mixer protocols, CoinJoin
-
📊 Data Privacy
- Differential privacy (local & global)
- Federated learning, secure aggregation
- Synthetic data, re‑identification resistance, anonymization
-
⚛️ Post‑Quantum / PQ Security
- Kyber, Dilithium, SPHINCS+, Falcon
- Lattice‑based crypto, hash‑based signatures
- PQ‑safe anonymous communication & key exchange
Each paper is tagged with a primary category plus the full arXiv category string, so you can slice the dataset however you want. [file:13]
-
✅ 100% real arXiv papers
Directly scraped from arxiv.orgcs.CR– no synthetic titles, no hallucinations, no fake IDs. -
✅ 5000 UNIQUE papers
Global deduplication by arXiv ID, plus per‑category deduplication so the same paper is never counted twice within a category. -
✅ Balanced categories
1250 papers for each of the 4 categories → balanced training/test sets for ML and fair comparisons between research areas. -
✅ Production‑grade dataset schema
Designed for:- LLM context building
- dashboards (Grafana/Metabase/Superset)
- offline analytics (Python/pandas, DuckDB, BigQuery)
-
✅ Zero manual setup on Apify
Norequirements.txtneeded – runs on Apify’s managed Python runtime. -
✅ Repeatable & auditable
Same input → same structure, easy to diff across runs as new papers appear on arXiv.
Each paper in the dataset has a consistent JSON structure:
{
"id": 1,
"title": "Device-Independent Anonymous Communication",
"arxiv_id": "2512.21047",
"full_category": "cs.CR (Internet Privacy)",
"short_category": "internet_privacy",
"authors": ["John Doe", "Jane Smith"],
"url": "https://arxiv.org/abs/2512.21047",
"pdf_url": "https://arxiv.org/pdf/2512.21047.pdf",
"is_real_arxiv": true,
"published": "2025-12-21",
"updated": "2025-12-23",
"abstract": "We propose a device-independent protocol for anonymous communication...",
"source_run_id": "RUN_ID_FOR_AUDIT"
}When you run Privacy Stack in Apify Console, the Output tab is split into multiple views (using dataset schema): [file:15]
- 📚 All Papers (5000) – full corpus merged
- 🌐 Internet Privacy (1250) – Tor, mixnets, I2P, traffic analysis
- 🔐 Crypto Privacy (1250) – ZK, FHE, MPC, crypto protocols
- 📊 Data Privacy (1250) – DP, FL, anonymization, re‑identification
- ⚛️ Post‑Quantum (1250) – Kyber, Dilithium, PQ anonymous systems
- 📋 Live Logs – scrape progress, dedup stats, category counts
Each view is sortable & filterable directly in the Apify Console, and also accessible as CSV/JSON via API.
{
"id": 42,
"title": "Traffic Analysis Resistant Mix Networks for the Modern Internet",
"arxiv_id": "2507.12345",
"full_category": "cs.CR (Cryptography and Security)",
"short_category": "internet_privacy",
"authors": ["Alice Anon", "Bob Mixnet"],
"url": "https://arxiv.org/abs/2507.12345",
"pdf_url": "https://arxiv.org/pdf/2507.12345.pdf",
"is_real_arxiv": true
}- Input: categories + maximum papers per category (defaults to 1250 × 4).
- Fetch arXiv feeds / search results for each category (
cs.CR+ keywords / sub-tags). - Normalize results into the unified schema:
- parse title, authors, IDs, URLs, dates, category strings
- Deduplicate:
- global deduplication by
arxiv_id - ensure each category’s slice has only unique entries
- global deduplication by
- Store into Apify Dataset with multiple views (all + per‑category).
The Actor is designed to be idempotent in terms of structure, but you will naturally see newer papers when you re‑run it over time.
- Open: Privacy Stack Actor
https://console.apify.com/oblate_wildcat/privacy-stack - Set input (optional):
maxPapersPerCategory: default 1250- category toggles (if you want only 1–2 categories)
- Click Run
- When it finishes, open the Output tab:
- Browse
All Papers - Or switch to specific category views
- Browse
- Export as:
- JSON (
items?clean=true) - CSV (
items?format=csv) - HTML table (for quick browsing)
- JSON (
apify run privacy-stack-research-scraperThis will:
- run the Actor locally
- store dataset in
./storage/datasets/default/ - you can then inspect
OUTPUT.jsonor CSV in that folder.
import requests
import pandas as pd
DATASET_URL = "https://api.apify.com/v2/datasets/<DATASET_ID>/items?clean=true"
res = requests.get(DATASET_URL)
res.raise_for_status()
items = res.json()
df = pd.DataFrame(items)
# Example: show recent ZK papers
zk_df = df[df['title'].str.contains("zero-knowledge", case=False, na=False)]
print(zk_df[['title', 'arxiv_id', 'url']].head())
# Example: count papers per short_category
print(df['short_category'].value_counts())-
Literature review for PhD / MSc / paper writing
Quickly get 5000+ relevant cs.CR papers organized by topical area. -
Benchmark building
Curate evaluation sets for LLMs, anonymization tools, or privacy frameworks. -
Trend analysis
See how research volume changes over time in areas like ZK proofs or post‑quantum crypto. -
Dataset for downstream models
Usetitle + abstractas input for topic modeling, embeddings, or semantic search. -
Meta‑research
Study the evolution of anonymity, privacy‑preserving ML, and PQ crypto.
- Real papers only – every record must correspond to a real arXiv entry.
- Transparent scraping – URLs always point back to arxiv.org.
- No guessing / hallucinating metadata – if arXiv does not provide it, it is not faked.
- Reproducibility – input + time window → deterministically shaped dataset schema.
Typical input fields (simplified):
{
"maxPapersPerCategory": 1250,
"includeInternetPrivacy": true,
"includeCryptoPrivacy": true,
"includeDataPrivacy": true,
"includePostQuantum": true
}You can extend this in future (e.g., year range, specific arXiv query strings, exclusion filters).
Bikram Biswas (@BikramBiswas786)
- Quantum & privacy tooling developer
- Creator of Anon Lab (interactive privacy paper explorer)
- Active on Apify building research‑grade Actors for security, privacy, and data aggregation.
Apify profile: https://apify.com/bikrambiswas
If Privacy Stack helps in your work, you can cite it as:
@software{biswas2025privacystack,
author = {Biswas, Bikram},
title = {Privacy Stack: 5000 Real arXiv Privacy Papers for Researchers},
year = {2025},
url = {https://apify.com/oblate_wildcat/privacy-stack}
}- Use this dataset responsibly.
- All papers belong to their respective authors and arXiv.
- This Actor only organizes metadata and links; it does not strip or redistribute paywalled content.
Privacy Stack turns scattered security & privacy literature into a single, structured research surface you can actually build on.
Run it, export it, and plug it straight into your research pipeline.