Automated pipeline for discovering, parsing, and visualising community-sourced interview experiences from LeetCode discussions.
This project powers a full ingestion-to-visualisation flow for public LeetCode discussion posts tagged as interview. The Java collector walks the GraphQL API, calls OpenAI to structure the content into a canonical Interview schema, validates and deduplicates the payloads, then persists them into a sorted JSON dataset. A Vite + Vue dashboard renders aggregated insights, charts, and searchable tables from that dataset.
Key capabilities:
- Reliable incremental backfill that walks LeetCode discussions until a configurable start date.
- Post-level filtering (e.g. discard posts with positive
UPVOTEreactions) and validation to keep the dataset clean. - Automated ID enrichment and date normalisation to simplify downstream analytics.
- Front-end visualisations built with Nuxt UI components, ECharts, and typed table sorting.
├── build.gradle.kts # Gradle build for the Java ingestion pipeline
├── src/main/java # Fetcher, models, storage, and OpenAI integration
├── src/main/web # Vite + Vue 3 frontend for exploring the dataset
│ ├── public/interviews.json # Published dataset consumed by the UI
│ └── public/leetcode-interview-questions.svg # Brand assets
└── README.md # You are here
- Fetch discussions –
com.prastavna.leetcode.services.Leetcodequeries the LeetCode GraphQL API in pages, respecting rate limits and the configuredLEETCODE_FETCH_START_DATE. - Filter & dedupe – Reactions with
UPVOTEcounts remove low-signal posts. Existing interviews are updated in-place using composite keys (leetcodeId,id + date). - OpenAI parsing –
Openai.getJsonCompletionsends the discussion title + content to an OpenAI structured output model that fills theInterviewPOJO. - Validation –
InterviewValidatorenforces required fields (company, rounds with questions, YoE bounds). Invalid payloads are logged and skipped. - Persistence –
JsonStorageappends or replaces entries, then sorts the array bydatedescending so the JSON data stays chronologically ordered. - Visualisation – The Vue dashboard consumes the generated dataset (drop it into
src/main/web/public/interviews.json) to drive charts and tables.
ℹ️ LeetCode paging ceiling: The GraphQL API currently stops returning results after
skip >= 3000. When backfilling, watch the logs emitted inAppto see when the crawler reaches that boundary.
- Java 21+
- Gradle wrapper (bundled)
- OpenAI API key with Structured Output access
- Bun 1.1+ (recommended) or Node.js 18+ for the frontend
Create a .env file based on the sample:
cp .env.sample .envFill in the variables:
| Variable | Description |
|---|---|
LEETCODE_FETCH_START_DATE |
Earliest date to retain (e.g. 2025-01-01). |
LEETCODE_PAGE_SIZE |
Page size for GraphQL paging. 50 works well. |
LEETCODE_LAG_DAYS |
Skip extremely recent posts that may still change. |
OPENAI_BASE_URL |
Optional override for the OpenAI endpoint. |
OPENAI_API_KEY |
Required for parsing discussions into interviews. |
INTERVIEWS_JSON_PATH |
Output location for the dataset (interviews.json by default). |
./gradlew runThe CLI will page through LeetCode, print diagnostics for each stop/skip condition, invoke OpenAI, and write the resulting interviews into the configured JSON file. Logs such as Breaking: created=... or Skipping topicId=... help track paging progress.
cd src/main/web
bun install # or npm install / pnpm install
bun run dev # or npm run devEnsure public/interviews.json contains the dataset produced by the ingestion step (copy or symlink if needed). Then open the printed Vite URL to explore charts, filters, and tables.
cd src/main/web
bun run build # or npm run buildStatic assets will be emitted to dist/ ready for deployment.
The automatic-data-refresh workflow mirrors the manual ingestion flow and runs every day at 02:00 UTC (plus workflow_dispatch for on-demand runs). It:
- Checks out the repository and sets up Java 21 + Gradle caching.
- Executes
./gradlew --no-daemon run, which regeneratessrc/main/web/public/interviews.json. - Commits the refreshed dataset and opens an automated PR (
data-refreshbranch) whenever the JSON changes.
Configure the following secrets/variables in the repository or organization settings before enabling the workflow:
| Name | Type | Required | Description |
|---|---|---|---|
OPENAI_API_KEY |
Secret | ✅ | Key used by the OpenAI client to structure discussions. |
OPENAI_BASE_URL |
Secret | Optional | Override for custom OpenAI-compatible endpoints. |
PAT_TOKEN |
Secret | ✅ | Classic PAT with repo scope so the workflow can open PRs. |
LEETCODE_* |
Variable | Optional | Override API URL, paging, lag days, or fetch start date. |
OPENAI_CONCURRENCY |
Variable | Optional | Caps concurrent OpenAI requests (defaults to 4). |
INTERVIEWS_JSON_PATH |
Variable | Optional | Alternate output location for the dataset. |
Once these values are present, the workflow will keep the published dataset up to date without manual intervention.
- Backfill window: Adjust
LEETCODE_FETCH_START_DATEto match your historical cut-off. The crawler will only stop once it reaches that date or the LeetCode paging ceiling. - Lag strategy:
LEETCODE_LAG_DAYScontrols how many days to wait before ingesting extremely fresh posts. - Concurrency:
OPENAI_CONCURRENCY(optional.env) throttles the number of parallel OpenAI requests (defaults to min(cores, 4)). - Storage path: Override
INTERVIEWS_JSON_PATHif you want to version the dataset separately from the repo.
- The LeetCode GraphQL endpoint stops returning pages past
skip ≈ 3000, so manual offsets are required beyond that. - Some posts lack structured details (company, rounds). The validator intentionally discards low-signal data; relax the rules in
InterviewValidatorif you prefer partial records. - OpenAI parsing quality depends on prompt adherence. Monitor skipped IDs and adjust the prompt or fallback handling as needed.
- Fork & clone the repository.
- Create a feature branch (
git checkout -b feature/my-change). - Run
./gradlew runandbun run build(ornpm run build) to ensure everything still works. - Open a PR describing the motivation, approach, and screenshots/logs if relevant.
Issues and feature proposals are welcome—please include reproduction details or sample discussion links when reporting ingestion problems.
- LeetCode for providing open discussion data.
- OpenAI for structured completions that make parsing practical.
- Nuxt UI and ECharts for the frontend building blocks.
Built by Prastavna with ♥︎