Skip to content

feat(validation): add CSV validation pipeline with CI integration#1052

Open
1rashiid wants to merge 2 commits intof:mainfrom
1rashiid:feat/csv-validation-pipeline
Open

feat(validation): add CSV validation pipeline with CI integration#1052
1rashiid wants to merge 2 commits intof:mainfrom
1rashiid:feat/csv-validation-pipeline

Conversation

@1rashiid
Copy link
Copy Markdown

@1rashiid 1rashiid commented Mar 4, 2026

Summary

  • Add scripts/validate-prompts-csv.ts — a comprehensive CSV validation script with 9 checks covering structure, content quality, spam heuristics, dataset drift, and duplicate detection
  • Add explicit Prisma schema generation step in CI to ensure generated types are available before lint, validation, and tests
  • Integrate validation into CI between lint and test steps with --json output
  • Add csv-parse devDependency for RFC 4180-compliant CSV parsing
  • Add .validation/ to .gitignore for local report artifacts

Validation checks

# Check Severity Description
1 Header validation error Verifies 5 expected columns: act, prompt, for_devs, type, contributor
2 Column count error Every row must have exactly 5 columns
3 Required fields error act, prompt, and contributor must be non-empty
4 Value constraints error for_devs must be TRUE/FALSE; type must be TEXT/JSON/YAML/IMAGE/VIDEO/AUDIO/STRUCTURED
5 Content quality warning Flags prompts shorter than 50 characters
6 Spam heuristics warning Detects >3 URLs, repeated punctuation, word repetition >30x, content >20K chars
7 Dataset drift error/warning >10% row count drop = error; >30% increase = warning (baseline: 1391)
8 Duplicate titles warning Exact case-insensitive title matches
9 Near-duplicate prompts warning Two-stage bucket strategy with similarity ≥ 0.90 threshold

CI pipeline order

  1. Checkout → 2. Setup Node.js 24 → 3. Install dependencies → 4. Generate Prisma schema → 5. Lint → 6. Validate prompts.csv → 7. Tests

Near-duplicate detection (designed for 100k+ prompts)

  • Stage 1 (O(n)): Normalize content (capped at 5000 chars), bucket by fingerprint and first-3-words, sub-bucket oversized buckets (>200), character-length pre-filter
  • Stage 2 (candidates only): calculateSimilarity() (Jaccard 60% + trigram 40%) with word-count guard

Current results (1391 prompts)

Check Errors Warnings
Content quality 0 2
Spam heuristic 0 41
Duplicate title 0 16
Near-duplicate prompt 0 18
Total 0 77

Files changed

  • scripts/validate-prompts-csv.ts (new — 695 lines)
  • package.json (added validate:csv script, csv-parse devDependency)
  • package-lock.json (updated lockfile)
  • .github/workflows/ci.yml (added Prisma generate + validation steps)
  • .gitignore (added .validation/)

Test plan

  • npm run validate:csv prints report with 0 errors
  • npm run validate:csv -- --json outputs valid JSON
  • .validation/validation-report.json is generated correctly
  • CI pipeline passes: install → prisma generate → lint → validate → test
  • Corrupt prompts.csv header → script exits with code 1

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Introduced automated CSV validation for prompts with duplicate detection, content quality assessment, spam pattern identification, header validation, and dataset drift monitoring to ensure data quality and consistency across the system.
  • Chores

    • Enhanced CI pipeline to run automated validation checks on prompts during every build.
    • Updated project ignore rules to exclude validation report artifacts.

seyyed rashid khazeiynasab and others added 2 commits March 3, 2026 20:20
Add scripts/validate-prompts-csv.ts with 9 checks: header validation,
required fields, content quality, dataset drift, exact/near-duplicate
titles, near-duplicate prompts (two-stage bucket strategy with
sub-bucketing and length filters), and spam heuristics. Outputs both
console report and .validation/validation-report.json. Integrated into
CI between lint and test steps with --json flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensures generated Prisma types are available before lint, validation,
and tests. While postinstall runs prisma generate implicitly, an
explicit step makes it visible and resilient to postinstall changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a CSV validation system for prompts. A new TypeScript script performs comprehensive validation of prompts.csv, checking headers, content quality, duplicates, and dataset metrics. The validation is integrated into the CI pipeline and generates persistent JSON reports stored in a .validation/ directory.

Changes

Cohort / File(s) Summary
CI/Workflow Configuration
.github/workflows/ci.yml
Adds two new CI steps: schema generation via npx prisma generate after dependency installation, and CSV validation via npm run validate:csv -- --json after linting.
Gitignore Configuration
.gitignore
Adds .validation/ directory to ignore list for validation report artifacts.
Package Configuration
package.json
Adds validate:csv npm script invoking scripts/validate-prompts-csv.ts; adds csv-parse package as both dependency and devDependency (v^6.1.0).
Validation Script
scripts/validate-prompts-csv.ts
New comprehensive CSV validator with header/structure checks, content quality warnings, spam detection heuristics, exact and near-duplicate detection, dataset drift analysis, and JSON report generation.

Sequence Diagram(s)

sequenceDiagram
    participant CI as CI Workflow
    participant Script as Validation Script
    participant CSV as prompts.csv
    participant FS as File System
    participant Console as Output

    CI->>Script: npm run validate:csv --json
    Script->>CSV: Load and parse CSV
    Script->>Script: Validate header structure
    Script->>Script: Validate each row (columns, types)
    Script->>Script: Run spam heuristics check
    Script->>Script: Check for exact duplicates
    Script->>Script: Check for near-duplicates
    Script->>Script: Analyze dataset drift
    Script->>FS: Write .validation/validation-report.json
    Script->>Console: Print JSON summary (--json flag)
    Script->>CI: Exit with code 0 or 1
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 hop-hop-hop, validating CSV with care,
Headers, duplicates, spam—let's check with flair!
Rows of prompts dance through my script so keen,
Near-duplicates caught in my bucketing machine,
Reports now written, the pipeline runs clean! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a CSV validation pipeline with CI integration. It directly reflects the primary modifications across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
scripts/validate-prompts-csv.ts (4)

553-565: Consider capturing the error for more informative diagnostics.

The bare catch discards the underlying error. Including the error message would help diagnose file access issues (e.g., permissions vs. file not found).

Include error details
   try {
     content = await loadCsvFile(csvPath);
-  } catch {
+  } catch (err) {
+    const errorMsg = err instanceof Error ? err.message : String(err);
     report.issues.push({
       severity: "error",
       row: null,
       check: "file_read",
-      message: `Cannot read ${csvPath}`,
+      message: `Cannot read ${csvPath}: ${errorMsg}`,
     });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 553 - 565, The current bare
catch swallowing errors around loadCsvFile prevents useful diagnostics; change
the try/catch to catch the error (e.g., catch (err)) and include its
message/details in the report and/or logged output: update the report.issues
entry created in the catch of loadCsvFile to include the error string
(err.message or String(err)) in the message field (e.g., `Cannot read
${csvPath}: ${String(err)}`), keep setting report.errors and call
printReport(report, jsonOutput) and process.exit(1) as before; this uses the
existing loadCsvFile, report, printReport, and process.exit symbols so behavior
is identical except now the underlying error is surfaced.

31-37: Unused interface CsvRow.

The CsvRow interface is defined but never used in the code. The script works directly with string[] arrays from the CSV parser.

Remove unused interface
-interface CsvRow {
-  act: string;
-  prompt: string;
-  for_devs: string;
-  type: string;
-  contributor: string;
-}
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 31 - 37, Remove the unused
CsvRow interface; delete the "interface CsvRow { act: string; prompt: string;
for_devs: string; type: string; contributor: string; }" declaration and any
references to it, since the CSV parser produces string[] rows in this script. If
you prefer to keep typed rows instead, replace usages of string[] with CsvRow
and type the CSV parse result accordingly (e.g., cast/transform parser output to
CsvRow[]), but otherwise simply remove the unused CsvRow symbol to eliminate
dead code.

380-386: Silent gap in duplicate detection for bucket sizes 51-200.

Buckets with 51-200 entries are skipped entirely—too large for direct pairing but below the sub-bucketing threshold. This creates a detection gap where near-duplicates could be missed if they happen to land in a medium-sized bucket.

Consider either:

  1. Lowering SUB_BUCKET_THRESHOLD to match MAX_BUCKET_SIZE + 1 (51) so all large buckets get sub-bucketed
  2. Adding a log/warning when buckets in this range are skipped for observability
Option 1: Eliminate the gap
-const SUB_BUCKET_THRESHOLD = 200; // Split buckets larger than this into sub-buckets
+const SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE; // Split all buckets exceeding MAX_BUCKET_SIZE

Then adjust the condition:

-      if (indices.length <= SUB_BUCKET_THRESHOLD) {
-        // Between MAX_BUCKET_SIZE and SUB_BUCKET_THRESHOLD: skip entirely
-        // (too large for direct pairing, too small to warrant sub-bucketing)
-        continue;
-      }
+      // Sub-bucket all large buckets
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 380 - 386, The code currently
skips buckets where indices.length is between MAX_BUCKET_SIZE+1 and
SUB_BUCKET_THRESHOLD creating a silent detection gap; fix by either setting
SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE + 1 so all buckets > MAX_BUCKET_SIZE are
sub-bucketed, or change the conditional from checking SUB_BUCKET_THRESHOLD to
compare against MAX_BUCKET_SIZE (e.g., only continue when indices.length <=
MAX_BUCKET_SIZE), and add a processLogger.warn or console.warn when you
intentionally skip a medium-sized bucket (referencing indices,
SUB_BUCKET_THRESHOLD, MAX_BUCKET_SIZE and the continue) for observability.

598-610: Check numbering in comments is inconsistent.

"Check 7" appears twice (lines 598 and 601). Based on the PR description, the correct numbering should be:

  • Check 7: dataset drift
  • Check 8: duplicate titles
  • Check 9: near-duplicate prompts/titles
Fix comment numbering
   // Check 7: dataset drift (drop >10% = error, spike >30% = warning)
   report.issues.push(...checkDatasetDrift(records.length));

-  // Check 7: exact title duplicates
+  // Check 8: exact title duplicates
   report.issues.push(...findExactTitleDuplicates(prompts));

-  // Check 8: near-duplicate prompts (bucket strategy, similarity >= 0.90)
+  // Check 9: near-duplicate prompts (bucket strategy, similarity >= 0.90)
   const { issues: dupIssues, pairs: dupPairs } = findNearDuplicatePrompts(prompts);
   report.issues.push(...dupIssues);
   report.nearDuplicatePrompts = dupPairs;

-  // Check 9: near-duplicate titles
+  // (Also part of Check 8/9: near-duplicate titles)
   report.issues.push(...findNearDuplicateTitles(prompts));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 598 - 610, The inline
check-numbering comments are inconsistent; update the comments around the calls
to checkDatasetDrift, findExactTitleDuplicates, findNearDuplicatePrompts, and
findNearDuplicateTitles so they read: "Check 7: dataset drift" before
checkDatasetDrift(records.length), "Check 8: duplicate titles" before
findExactTitleDuplicates(prompts), and "Check 9: near-duplicate prompts/titles"
before the near-duplicate checks (findNearDuplicatePrompts and
findNearDuplicateTitles) to match the PR description.
.github/workflows/ci.yml (1)

31-32: Minor redundancy: prisma generate already runs via postinstall.

The postinstall script in package.json (line 20) already executes prisma generate during npm ci. This explicit step is harmless but redundant. Consider removing it or adding a comment explaining why it's kept (e.g., for visibility in CI logs).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 31 - 32, The "Generate Prisma schema"
CI step runs "npx prisma generate" but this is redundant because package.json's
"postinstall" already runs prisma generate during npm ci; either remove the step
in .github/workflows/ci.yml (the step with name "Generate Prisma schema" and
command "npx prisma generate") or keep it but add a short inline comment in the
workflow explaining it's retained only for CI log visibility, so readers know
it's intentional rather than accidental duplication with the "postinstall"
script in package.json.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/validate-prompts-csv.ts`:
- Around line 138-145: The current logic skips validation when typeVal is empty
and later silently defaults to "TEXT"; add an explicit check for an empty or
missing type and push an issues entry (severity "error" or "warning" per spec)
using the same structure as the existing issues.push (include row: rowNumber,
check: "required_field" or "value_constraint", and a clear message that type is
missing), rather than allowing the later default to hide the problem; keep
VALID_TYPES validation for non-empty values and ensure the defaulting to "TEXT"
only happens after validation or only when the field is explicitly allowed to be
optional.

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 31-32: The "Generate Prisma schema" CI step runs "npx prisma
generate" but this is redundant because package.json's "postinstall" already
runs prisma generate during npm ci; either remove the step in
.github/workflows/ci.yml (the step with name "Generate Prisma schema" and
command "npx prisma generate") or keep it but add a short inline comment in the
workflow explaining it's retained only for CI log visibility, so readers know
it's intentional rather than accidental duplication with the "postinstall"
script in package.json.

In `@scripts/validate-prompts-csv.ts`:
- Around line 553-565: The current bare catch swallowing errors around
loadCsvFile prevents useful diagnostics; change the try/catch to catch the error
(e.g., catch (err)) and include its message/details in the report and/or logged
output: update the report.issues entry created in the catch of loadCsvFile to
include the error string (err.message or String(err)) in the message field
(e.g., `Cannot read ${csvPath}: ${String(err)}`), keep setting report.errors and
call printReport(report, jsonOutput) and process.exit(1) as before; this uses
the existing loadCsvFile, report, printReport, and process.exit symbols so
behavior is identical except now the underlying error is surfaced.
- Around line 31-37: Remove the unused CsvRow interface; delete the "interface
CsvRow { act: string; prompt: string; for_devs: string; type: string;
contributor: string; }" declaration and any references to it, since the CSV
parser produces string[] rows in this script. If you prefer to keep typed rows
instead, replace usages of string[] with CsvRow and type the CSV parse result
accordingly (e.g., cast/transform parser output to CsvRow[]), but otherwise
simply remove the unused CsvRow symbol to eliminate dead code.
- Around line 380-386: The code currently skips buckets where indices.length is
between MAX_BUCKET_SIZE+1 and SUB_BUCKET_THRESHOLD creating a silent detection
gap; fix by either setting SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE + 1 so all
buckets > MAX_BUCKET_SIZE are sub-bucketed, or change the conditional from
checking SUB_BUCKET_THRESHOLD to compare against MAX_BUCKET_SIZE (e.g., only
continue when indices.length <= MAX_BUCKET_SIZE), and add a processLogger.warn
or console.warn when you intentionally skip a medium-sized bucket (referencing
indices, SUB_BUCKET_THRESHOLD, MAX_BUCKET_SIZE and the continue) for
observability.
- Around line 598-610: The inline check-numbering comments are inconsistent;
update the comments around the calls to checkDatasetDrift,
findExactTitleDuplicates, findNearDuplicatePrompts, and findNearDuplicateTitles
so they read: "Check 7: dataset drift" before checkDatasetDrift(records.length),
"Check 8: duplicate titles" before findExactTitleDuplicates(prompts), and "Check
9: near-duplicate prompts/titles" before the near-duplicate checks
(findNearDuplicatePrompts and findNearDuplicateTitles) to match the PR
description.

ℹ️ Review info
Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6a0904c7-fc1e-48d0-a09d-f0bf8dbcc56d

📥 Commits

Reviewing files that changed from the base of the PR and between ec34d2e and 85a6021.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (4)
  • .github/workflows/ci.yml
  • .gitignore
  • package.json
  • scripts/validate-prompts-csv.ts

Comment on lines +138 to +145
if (typeVal && !VALID_TYPES.has(typeVal)) {
issues.push({
severity: "error",
row: rowNumber,
check: "value_constraint",
message: `Invalid type value "${type.trim()}", expected one of: ${[...VALID_TYPES].join(", ")}`,
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Empty type field silently defaults to "TEXT" without validation warning.

When type is empty, the validation at line 138 is skipped (falsy check), and line 167 defaults to "TEXT". This behavior is inconsistent with other required field checks. If type is truly optional, this is fine; if it should be required, add validation.

Option: Validate empty type as an error (if required)
   // Value constraints
-  if (forDevsVal && !VALID_FOR_DEVS.has(forDevsVal)) {
+  if (!forDevsVal) {
+    issues.push({
+      severity: "error",
+      row: rowNumber,
+      check: "required_field",
+      message: "Empty for_devs value",
+    });
+  } else if (!VALID_FOR_DEVS.has(forDevsVal)) {

Similarly for type if it should be required.

Also applies to: 167-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 138 - 145, The current logic
skips validation when typeVal is empty and later silently defaults to "TEXT";
add an explicit check for an empty or missing type and push an issues entry
(severity "error" or "warning" per spec) using the same structure as the
existing issues.push (include row: rowNumber, check: "required_field" or
"value_constraint", and a clear message that type is missing), rather than
allowing the later default to hide the problem; keep VALID_TYPES validation for
non-empty values and ensure the defaulting to "TEXT" only happens after
validation or only when the field is explicitly allowed to be optional.

@f
Copy link
Copy Markdown
Owner

f commented Mar 5, 2026

Wow this looks useful! I'll review it shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants