feat(validation): add CSV validation pipeline with CI integration#1052
feat(validation): add CSV validation pipeline with CI integration#1052
Conversation
Add scripts/validate-prompts-csv.ts with 9 checks: header validation, required fields, content quality, dataset drift, exact/near-duplicate titles, near-duplicate prompts (two-stage bucket strategy with sub-bucketing and length filters), and spam heuristics. Outputs both console report and .validation/validation-report.json. Integrated into CI between lint and test steps with --json flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensures generated Prisma types are available before lint, validation, and tests. While postinstall runs prisma generate implicitly, an explicit step makes it visible and resilient to postinstall changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis pull request introduces a CSV validation system for prompts. A new TypeScript script performs comprehensive validation of prompts.csv, checking headers, content quality, duplicates, and dataset metrics. The validation is integrated into the CI pipeline and generates persistent JSON reports stored in a Changes
Sequence Diagram(s)sequenceDiagram
participant CI as CI Workflow
participant Script as Validation Script
participant CSV as prompts.csv
participant FS as File System
participant Console as Output
CI->>Script: npm run validate:csv --json
Script->>CSV: Load and parse CSV
Script->>Script: Validate header structure
Script->>Script: Validate each row (columns, types)
Script->>Script: Run spam heuristics check
Script->>Script: Check for exact duplicates
Script->>Script: Check for near-duplicates
Script->>Script: Analyze dataset drift
Script->>FS: Write .validation/validation-report.json
Script->>Console: Print JSON summary (--json flag)
Script->>CI: Exit with code 0 or 1
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
scripts/validate-prompts-csv.ts (4)
553-565: Consider capturing the error for more informative diagnostics.The bare
catchdiscards the underlying error. Including the error message would help diagnose file access issues (e.g., permissions vs. file not found).Include error details
try { content = await loadCsvFile(csvPath); - } catch { + } catch (err) { + const errorMsg = err instanceof Error ? err.message : String(err); report.issues.push({ severity: "error", row: null, check: "file_read", - message: `Cannot read ${csvPath}`, + message: `Cannot read ${csvPath}: ${errorMsg}`, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-prompts-csv.ts` around lines 553 - 565, The current bare catch swallowing errors around loadCsvFile prevents useful diagnostics; change the try/catch to catch the error (e.g., catch (err)) and include its message/details in the report and/or logged output: update the report.issues entry created in the catch of loadCsvFile to include the error string (err.message or String(err)) in the message field (e.g., `Cannot read ${csvPath}: ${String(err)}`), keep setting report.errors and call printReport(report, jsonOutput) and process.exit(1) as before; this uses the existing loadCsvFile, report, printReport, and process.exit symbols so behavior is identical except now the underlying error is surfaced.
31-37: Unused interfaceCsvRow.The
CsvRowinterface is defined but never used in the code. The script works directly withstring[]arrays from the CSV parser.Remove unused interface
-interface CsvRow { - act: string; - prompt: string; - for_devs: string; - type: string; - contributor: string; -} -🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-prompts-csv.ts` around lines 31 - 37, Remove the unused CsvRow interface; delete the "interface CsvRow { act: string; prompt: string; for_devs: string; type: string; contributor: string; }" declaration and any references to it, since the CSV parser produces string[] rows in this script. If you prefer to keep typed rows instead, replace usages of string[] with CsvRow and type the CSV parse result accordingly (e.g., cast/transform parser output to CsvRow[]), but otherwise simply remove the unused CsvRow symbol to eliminate dead code.
380-386: Silent gap in duplicate detection for bucket sizes 51-200.Buckets with 51-200 entries are skipped entirely—too large for direct pairing but below the sub-bucketing threshold. This creates a detection gap where near-duplicates could be missed if they happen to land in a medium-sized bucket.
Consider either:
- Lowering
SUB_BUCKET_THRESHOLDto matchMAX_BUCKET_SIZE + 1(51) so all large buckets get sub-bucketed- Adding a log/warning when buckets in this range are skipped for observability
Option 1: Eliminate the gap
-const SUB_BUCKET_THRESHOLD = 200; // Split buckets larger than this into sub-buckets +const SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE; // Split all buckets exceeding MAX_BUCKET_SIZEThen adjust the condition:
- if (indices.length <= SUB_BUCKET_THRESHOLD) { - // Between MAX_BUCKET_SIZE and SUB_BUCKET_THRESHOLD: skip entirely - // (too large for direct pairing, too small to warrant sub-bucketing) - continue; - } + // Sub-bucket all large buckets🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-prompts-csv.ts` around lines 380 - 386, The code currently skips buckets where indices.length is between MAX_BUCKET_SIZE+1 and SUB_BUCKET_THRESHOLD creating a silent detection gap; fix by either setting SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE + 1 so all buckets > MAX_BUCKET_SIZE are sub-bucketed, or change the conditional from checking SUB_BUCKET_THRESHOLD to compare against MAX_BUCKET_SIZE (e.g., only continue when indices.length <= MAX_BUCKET_SIZE), and add a processLogger.warn or console.warn when you intentionally skip a medium-sized bucket (referencing indices, SUB_BUCKET_THRESHOLD, MAX_BUCKET_SIZE and the continue) for observability.
598-610: Check numbering in comments is inconsistent."Check 7" appears twice (lines 598 and 601). Based on the PR description, the correct numbering should be:
- Check 7: dataset drift
- Check 8: duplicate titles
- Check 9: near-duplicate prompts/titles
Fix comment numbering
// Check 7: dataset drift (drop >10% = error, spike >30% = warning) report.issues.push(...checkDatasetDrift(records.length)); - // Check 7: exact title duplicates + // Check 8: exact title duplicates report.issues.push(...findExactTitleDuplicates(prompts)); - // Check 8: near-duplicate prompts (bucket strategy, similarity >= 0.90) + // Check 9: near-duplicate prompts (bucket strategy, similarity >= 0.90) const { issues: dupIssues, pairs: dupPairs } = findNearDuplicatePrompts(prompts); report.issues.push(...dupIssues); report.nearDuplicatePrompts = dupPairs; - // Check 9: near-duplicate titles + // (Also part of Check 8/9: near-duplicate titles) report.issues.push(...findNearDuplicateTitles(prompts));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate-prompts-csv.ts` around lines 598 - 610, The inline check-numbering comments are inconsistent; update the comments around the calls to checkDatasetDrift, findExactTitleDuplicates, findNearDuplicatePrompts, and findNearDuplicateTitles so they read: "Check 7: dataset drift" before checkDatasetDrift(records.length), "Check 8: duplicate titles" before findExactTitleDuplicates(prompts), and "Check 9: near-duplicate prompts/titles" before the near-duplicate checks (findNearDuplicatePrompts and findNearDuplicateTitles) to match the PR description..github/workflows/ci.yml (1)
31-32: Minor redundancy:prisma generatealready runs viapostinstall.The
postinstallscript inpackage.json(line 20) already executesprisma generateduringnpm ci. This explicit step is harmless but redundant. Consider removing it or adding a comment explaining why it's kept (e.g., for visibility in CI logs).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 31 - 32, The "Generate Prisma schema" CI step runs "npx prisma generate" but this is redundant because package.json's "postinstall" already runs prisma generate during npm ci; either remove the step in .github/workflows/ci.yml (the step with name "Generate Prisma schema" and command "npx prisma generate") or keep it but add a short inline comment in the workflow explaining it's retained only for CI log visibility, so readers know it's intentional rather than accidental duplication with the "postinstall" script in package.json.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/validate-prompts-csv.ts`:
- Around line 138-145: The current logic skips validation when typeVal is empty
and later silently defaults to "TEXT"; add an explicit check for an empty or
missing type and push an issues entry (severity "error" or "warning" per spec)
using the same structure as the existing issues.push (include row: rowNumber,
check: "required_field" or "value_constraint", and a clear message that type is
missing), rather than allowing the later default to hide the problem; keep
VALID_TYPES validation for non-empty values and ensure the defaulting to "TEXT"
only happens after validation or only when the field is explicitly allowed to be
optional.
---
Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 31-32: The "Generate Prisma schema" CI step runs "npx prisma
generate" but this is redundant because package.json's "postinstall" already
runs prisma generate during npm ci; either remove the step in
.github/workflows/ci.yml (the step with name "Generate Prisma schema" and
command "npx prisma generate") or keep it but add a short inline comment in the
workflow explaining it's retained only for CI log visibility, so readers know
it's intentional rather than accidental duplication with the "postinstall"
script in package.json.
In `@scripts/validate-prompts-csv.ts`:
- Around line 553-565: The current bare catch swallowing errors around
loadCsvFile prevents useful diagnostics; change the try/catch to catch the error
(e.g., catch (err)) and include its message/details in the report and/or logged
output: update the report.issues entry created in the catch of loadCsvFile to
include the error string (err.message or String(err)) in the message field
(e.g., `Cannot read ${csvPath}: ${String(err)}`), keep setting report.errors and
call printReport(report, jsonOutput) and process.exit(1) as before; this uses
the existing loadCsvFile, report, printReport, and process.exit symbols so
behavior is identical except now the underlying error is surfaced.
- Around line 31-37: Remove the unused CsvRow interface; delete the "interface
CsvRow { act: string; prompt: string; for_devs: string; type: string;
contributor: string; }" declaration and any references to it, since the CSV
parser produces string[] rows in this script. If you prefer to keep typed rows
instead, replace usages of string[] with CsvRow and type the CSV parse result
accordingly (e.g., cast/transform parser output to CsvRow[]), but otherwise
simply remove the unused CsvRow symbol to eliminate dead code.
- Around line 380-386: The code currently skips buckets where indices.length is
between MAX_BUCKET_SIZE+1 and SUB_BUCKET_THRESHOLD creating a silent detection
gap; fix by either setting SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE + 1 so all
buckets > MAX_BUCKET_SIZE are sub-bucketed, or change the conditional from
checking SUB_BUCKET_THRESHOLD to compare against MAX_BUCKET_SIZE (e.g., only
continue when indices.length <= MAX_BUCKET_SIZE), and add a processLogger.warn
or console.warn when you intentionally skip a medium-sized bucket (referencing
indices, SUB_BUCKET_THRESHOLD, MAX_BUCKET_SIZE and the continue) for
observability.
- Around line 598-610: The inline check-numbering comments are inconsistent;
update the comments around the calls to checkDatasetDrift,
findExactTitleDuplicates, findNearDuplicatePrompts, and findNearDuplicateTitles
so they read: "Check 7: dataset drift" before checkDatasetDrift(records.length),
"Check 8: duplicate titles" before findExactTitleDuplicates(prompts), and "Check
9: near-duplicate prompts/titles" before the near-duplicate checks
(findNearDuplicatePrompts and findNearDuplicateTitles) to match the PR
description.
ℹ️ Review info
Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6a0904c7-fc1e-48d0-a09d-f0bf8dbcc56d
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (4)
.github/workflows/ci.yml.gitignorepackage.jsonscripts/validate-prompts-csv.ts
| if (typeVal && !VALID_TYPES.has(typeVal)) { | ||
| issues.push({ | ||
| severity: "error", | ||
| row: rowNumber, | ||
| check: "value_constraint", | ||
| message: `Invalid type value "${type.trim()}", expected one of: ${[...VALID_TYPES].join(", ")}`, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Empty type field silently defaults to "TEXT" without validation warning.
When type is empty, the validation at line 138 is skipped (falsy check), and line 167 defaults to "TEXT". This behavior is inconsistent with other required field checks. If type is truly optional, this is fine; if it should be required, add validation.
Option: Validate empty type as an error (if required)
// Value constraints
- if (forDevsVal && !VALID_FOR_DEVS.has(forDevsVal)) {
+ if (!forDevsVal) {
+ issues.push({
+ severity: "error",
+ row: rowNumber,
+ check: "required_field",
+ message: "Empty for_devs value",
+ });
+ } else if (!VALID_FOR_DEVS.has(forDevsVal)) {Similarly for type if it should be required.
Also applies to: 167-167
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/validate-prompts-csv.ts` around lines 138 - 145, The current logic
skips validation when typeVal is empty and later silently defaults to "TEXT";
add an explicit check for an empty or missing type and push an issues entry
(severity "error" or "warning" per spec) using the same structure as the
existing issues.push (include row: rowNumber, check: "required_field" or
"value_constraint", and a clear message that type is missing), rather than
allowing the later default to hide the problem; keep VALID_TYPES validation for
non-empty values and ensure the defaulting to "TEXT" only happens after
validation or only when the field is explicitly allowed to be optional.
|
Wow this looks useful! I'll review it shortly. |
Summary
scripts/validate-prompts-csv.ts— a comprehensive CSV validation script with 9 checks covering structure, content quality, spam heuristics, dataset drift, and duplicate detection--jsonoutputcsv-parsedevDependency for RFC 4180-compliant CSV parsing.validation/to.gitignorefor local report artifactsValidation checks
act,prompt,for_devs,type,contributoract,prompt, andcontributormust be non-emptyfor_devsmust be TRUE/FALSE;typemust be TEXT/JSON/YAML/IMAGE/VIDEO/AUDIO/STRUCTUREDCI pipeline order
Near-duplicate detection (designed for 100k+ prompts)
calculateSimilarity()(Jaccard 60% + trigram 40%) with word-count guardCurrent results (1391 prompts)
Files changed
scripts/validate-prompts-csv.ts(new — 695 lines)package.json(addedvalidate:csvscript,csv-parsedevDependency)package-lock.json(updated lockfile).github/workflows/ci.yml(added Prisma generate + validation steps).gitignore(added.validation/)Test plan
npm run validate:csvprints report with 0 errorsnpm run validate:csv -- --jsonoutputs valid JSON.validation/validation-report.jsonis generated correctlyprompts.csvheader → script exits with code 1🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores