feat(validation): add CSV validation pipeline with CI integration by 1rashiid · Pull Request #1052 · f/prompts.chat

1rashiid · 2026-03-04T04:30:26Z

Summary

Add scripts/validate-prompts-csv.ts — a comprehensive CSV validation script with 9 checks covering structure, content quality, spam heuristics, dataset drift, and duplicate detection
Add explicit Prisma schema generation step in CI to ensure generated types are available before lint, validation, and tests
Integrate validation into CI between lint and test steps with --json output
Add csv-parse devDependency for RFC 4180-compliant CSV parsing
Add .validation/ to .gitignore for local report artifacts

Validation checks

#	Check	Severity	Description
1	Header validation	error	Verifies 5 expected columns: `act`, `prompt`, `for_devs`, `type`, `contributor`
2	Column count	error	Every row must have exactly 5 columns
3	Required fields	error	`act`, `prompt`, and `contributor` must be non-empty
4	Value constraints	error	`for_devs` must be TRUE/FALSE; `type` must be TEXT/JSON/YAML/IMAGE/VIDEO/AUDIO/STRUCTURED
5	Content quality	warning	Flags prompts shorter than 50 characters
6	Spam heuristics	warning	Detects >3 URLs, repeated punctuation, word repetition >30x, content >20K chars
7	Dataset drift	error/warning	>10% row count drop = error; >30% increase = warning (baseline: 1391)
8	Duplicate titles	warning	Exact case-insensitive title matches
9	Near-duplicate prompts	warning	Two-stage bucket strategy with similarity ≥ 0.90 threshold

CI pipeline order

Checkout → 2. Setup Node.js 24 → 3. Install dependencies → 4. Generate Prisma schema → 5. Lint → 6. Validate prompts.csv → 7. Tests

Near-duplicate detection (designed for 100k+ prompts)

Stage 1 (O(n)): Normalize content (capped at 5000 chars), bucket by fingerprint and first-3-words, sub-bucket oversized buckets (>200), character-length pre-filter
Stage 2 (candidates only): calculateSimilarity() (Jaccard 60% + trigram 40%) with word-count guard

Current results (1391 prompts)

Check	Errors	Warnings
Content quality	0	2
Spam heuristic	0	41
Duplicate title	0	16
Near-duplicate prompt	0	18
Total	0	77

Files changed

scripts/validate-prompts-csv.ts (new — 695 lines)
package.json (added validate:csv script, csv-parse devDependency)
package-lock.json (updated lockfile)
.github/workflows/ci.yml (added Prisma generate + validation steps)
.gitignore (added .validation/)

Test plan

npm run validate:csv prints report with 0 errors
npm run validate:csv -- --json outputs valid JSON
.validation/validation-report.json is generated correctly
CI pipeline passes: install → prisma generate → lint → validate → test
Corrupt prompts.csv header → script exits with code 1

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Introduced automated CSV validation for prompts with duplicate detection, content quality assessment, spam pattern identification, header validation, and dataset drift monitoring to ensure data quality and consistency across the system.
Chores
- Enhanced CI pipeline to run automated validation checks on prompts during every build.
- Updated project ignore rules to exclude validation report artifacts.

Add scripts/validate-prompts-csv.ts with 9 checks: header validation, required fields, content quality, dataset drift, exact/near-duplicate titles, near-duplicate prompts (two-stage bucket strategy with sub-bucketing and length filters), and spam heuristics. Outputs both console report and .validation/validation-report.json. Integrated into CI between lint and test steps with --json flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ensures generated Prisma types are available before lint, validation, and tests. While postinstall runs prisma generate implicitly, an explicit step makes it visible and resilient to postinstall changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-04T04:30:57Z

📝 Walkthrough

Walkthrough

This pull request introduces a CSV validation system for prompts. A new TypeScript script performs comprehensive validation of prompts.csv, checking headers, content quality, duplicates, and dataset metrics. The validation is integrated into the CI pipeline and generates persistent JSON reports stored in a .validation/ directory.

Changes

Cohort / File(s)	Summary
CI/Workflow Configuration `.github/workflows/ci.yml`	Adds two new CI steps: schema generation via `npx prisma generate` after dependency installation, and CSV validation via `npm run validate:csv -- --json` after linting.
Gitignore Configuration `.gitignore`	Adds `.validation/` directory to ignore list for validation report artifacts.
Package Configuration `package.json`	Adds `validate:csv` npm script invoking `scripts/validate-prompts-csv.ts`; adds `csv-parse` package as both dependency and devDependency (v^6.1.0).
Validation Script `scripts/validate-prompts-csv.ts`	New comprehensive CSV validator with header/structure checks, content quality warnings, spam detection heuristics, exact and near-duplicate detection, dataset drift analysis, and JSON report generation.

Sequence Diagram(s)

sequenceDiagram
    participant CI as CI Workflow
    participant Script as Validation Script
    participant CSV as prompts.csv
    participant FS as File System
    participant Console as Output

    CI->>Script: npm run validate:csv --json
    Script->>CSV: Load and parse CSV
    Script->>Script: Validate header structure
    Script->>Script: Validate each row (columns, types)
    Script->>Script: Run spam heuristics check
    Script->>Script: Check for exact duplicates
    Script->>Script: Check for near-duplicates
    Script->>Script: Analyze dataset drift
    Script->>FS: Write .validation/validation-report.json
    Script->>Console: Print JSON summary (--json flag)
    Script->>CI: Exit with code 0 or 1

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 hop-hop-hop, validating CSV with care,
Headers, duplicates, spam—let's check with flair!
Rows of prompts dance through my script so keen,
Near-duplicates caught in my bucketing machine,
Reports now written, the pipeline runs clean! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a CSV validation pipeline with CI integration. It directly reflects the primary modifications across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

scripts/validate-prompts-csv.ts (4)

553-565: Consider capturing the error for more informative diagnostics.

The bare catch discards the underlying error. Including the error message would help diagnose file access issues (e.g., permissions vs. file not found).

Include error details

   try {
     content = await loadCsvFile(csvPath);
-  } catch {
+  } catch (err) {
+    const errorMsg = err instanceof Error ? err.message : String(err);
     report.issues.push({
       severity: "error",
       row: null,
       check: "file_read",
-      message: `Cannot read ${csvPath}`,
+      message: `Cannot read ${csvPath}: ${errorMsg}`,
     });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 553 - 565, The current bare
catch swallowing errors around loadCsvFile prevents useful diagnostics; change
the try/catch to catch the error (e.g., catch (err)) and include its
message/details in the report and/or logged output: update the report.issues
entry created in the catch of loadCsvFile to include the error string
(err.message or String(err)) in the message field (e.g., `Cannot read
${csvPath}: ${String(err)}`), keep setting report.errors and call
printReport(report, jsonOutput) and process.exit(1) as before; this uses the
existing loadCsvFile, report, printReport, and process.exit symbols so behavior
is identical except now the underlying error is surfaced.

31-37: Unused interface CsvRow.

The CsvRow interface is defined but never used in the code. The script works directly with string[] arrays from the CSV parser.

Remove unused interface

-interface CsvRow {
-  act: string;
-  prompt: string;
-  for_devs: string;
-  type: string;
-  contributor: string;
-}
-

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 31 - 37, Remove the unused
CsvRow interface; delete the "interface CsvRow { act: string; prompt: string;
for_devs: string; type: string; contributor: string; }" declaration and any
references to it, since the CSV parser produces string[] rows in this script. If
you prefer to keep typed rows instead, replace usages of string[] with CsvRow
and type the CSV parse result accordingly (e.g., cast/transform parser output to
CsvRow[]), but otherwise simply remove the unused CsvRow symbol to eliminate
dead code.

380-386: Silent gap in duplicate detection for bucket sizes 51-200.

Buckets with 51-200 entries are skipped entirely—too large for direct pairing but below the sub-bucketing threshold. This creates a detection gap where near-duplicates could be missed if they happen to land in a medium-sized bucket.

Consider either:

Lowering SUB_BUCKET_THRESHOLD to match MAX_BUCKET_SIZE + 1 (51) so all large buckets get sub-bucketed
Adding a log/warning when buckets in this range are skipped for observability

Option 1: Eliminate the gap

-const SUB_BUCKET_THRESHOLD = 200; // Split buckets larger than this into sub-buckets
+const SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE; // Split all buckets exceeding MAX_BUCKET_SIZE

Then adjust the condition:

-      if (indices.length <= SUB_BUCKET_THRESHOLD) {
-        // Between MAX_BUCKET_SIZE and SUB_BUCKET_THRESHOLD: skip entirely
-        // (too large for direct pairing, too small to warrant sub-bucketing)
-        continue;
-      }
+      // Sub-bucket all large buckets

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 380 - 386, The code currently
skips buckets where indices.length is between MAX_BUCKET_SIZE+1 and
SUB_BUCKET_THRESHOLD creating a silent detection gap; fix by either setting
SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE + 1 so all buckets > MAX_BUCKET_SIZE are
sub-bucketed, or change the conditional from checking SUB_BUCKET_THRESHOLD to
compare against MAX_BUCKET_SIZE (e.g., only continue when indices.length <=
MAX_BUCKET_SIZE), and add a processLogger.warn or console.warn when you
intentionally skip a medium-sized bucket (referencing indices,
SUB_BUCKET_THRESHOLD, MAX_BUCKET_SIZE and the continue) for observability.

598-610: Check numbering in comments is inconsistent.

"Check 7" appears twice (lines 598 and 601). Based on the PR description, the correct numbering should be:

Check 7: dataset drift
Check 8: duplicate titles
Check 9: near-duplicate prompts/titles

Fix comment numbering

   // Check 7: dataset drift (drop >10% = error, spike >30% = warning)
   report.issues.push(...checkDatasetDrift(records.length));

-  // Check 7: exact title duplicates
+  // Check 8: exact title duplicates
   report.issues.push(...findExactTitleDuplicates(prompts));

-  // Check 8: near-duplicate prompts (bucket strategy, similarity >= 0.90)
+  // Check 9: near-duplicate prompts (bucket strategy, similarity >= 0.90)
   const { issues: dupIssues, pairs: dupPairs } = findNearDuplicatePrompts(prompts);
   report.issues.push(...dupIssues);
   report.nearDuplicatePrompts = dupPairs;

-  // Check 9: near-duplicate titles
+  // (Also part of Check 8/9: near-duplicate titles)
   report.issues.push(...findNearDuplicateTitles(prompts));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/validate-prompts-csv.ts` around lines 598 - 610, The inline
check-numbering comments are inconsistent; update the comments around the calls
to checkDatasetDrift, findExactTitleDuplicates, findNearDuplicatePrompts, and
findNearDuplicateTitles so they read: "Check 7: dataset drift" before
checkDatasetDrift(records.length), "Check 8: duplicate titles" before
findExactTitleDuplicates(prompts), and "Check 9: near-duplicate prompts/titles"
before the near-duplicate checks (findNearDuplicatePrompts and
findNearDuplicateTitles) to match the PR description.

.github/workflows/ci.yml (1)

31-32: Minor redundancy: prisma generate already runs via postinstall.

The postinstall script in package.json (line 20) already executes prisma generate during npm ci. This explicit step is harmless but redundant. Consider removing it or adding a comment explaining why it's kept (e.g., for visibility in CI logs).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 31 - 32, The "Generate Prisma schema"
CI step runs "npx prisma generate" but this is redundant because package.json's
"postinstall" already runs prisma generate during npm ci; either remove the step
in .github/workflows/ci.yml (the step with name "Generate Prisma schema" and
command "npx prisma generate") or keep it but add a short inline comment in the
workflow explaining it's retained only for CI log visibility, so readers know
it's intentional rather than accidental duplication with the "postinstall"
script in package.json.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/validate-prompts-csv.ts`:
- Around line 138-145: The current logic skips validation when typeVal is empty
and later silently defaults to "TEXT"; add an explicit check for an empty or
missing type and push an issues entry (severity "error" or "warning" per spec)
using the same structure as the existing issues.push (include row: rowNumber,
check: "required_field" or "value_constraint", and a clear message that type is
missing), rather than allowing the later default to hide the problem; keep
VALID_TYPES validation for non-empty values and ensure the defaulting to "TEXT"
only happens after validation or only when the field is explicitly allowed to be
optional.

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 31-32: The "Generate Prisma schema" CI step runs "npx prisma
generate" but this is redundant because package.json's "postinstall" already
runs prisma generate during npm ci; either remove the step in
.github/workflows/ci.yml (the step with name "Generate Prisma schema" and
command "npx prisma generate") or keep it but add a short inline comment in the
workflow explaining it's retained only for CI log visibility, so readers know
it's intentional rather than accidental duplication with the "postinstall"
script in package.json.

In `@scripts/validate-prompts-csv.ts`:
- Around line 553-565: The current bare catch swallowing errors around
loadCsvFile prevents useful diagnostics; change the try/catch to catch the error
(e.g., catch (err)) and include its message/details in the report and/or logged
output: update the report.issues entry created in the catch of loadCsvFile to
include the error string (err.message or String(err)) in the message field
(e.g., `Cannot read ${csvPath}: ${String(err)}`), keep setting report.errors and
call printReport(report, jsonOutput) and process.exit(1) as before; this uses
the existing loadCsvFile, report, printReport, and process.exit symbols so
behavior is identical except now the underlying error is surfaced.
- Around line 31-37: Remove the unused CsvRow interface; delete the "interface
CsvRow { act: string; prompt: string; for_devs: string; type: string;
contributor: string; }" declaration and any references to it, since the CSV
parser produces string[] rows in this script. If you prefer to keep typed rows
instead, replace usages of string[] with CsvRow and type the CSV parse result
accordingly (e.g., cast/transform parser output to CsvRow[]), but otherwise
simply remove the unused CsvRow symbol to eliminate dead code.
- Around line 380-386: The code currently skips buckets where indices.length is
between MAX_BUCKET_SIZE+1 and SUB_BUCKET_THRESHOLD creating a silent detection
gap; fix by either setting SUB_BUCKET_THRESHOLD = MAX_BUCKET_SIZE + 1 so all
buckets > MAX_BUCKET_SIZE are sub-bucketed, or change the conditional from
checking SUB_BUCKET_THRESHOLD to compare against MAX_BUCKET_SIZE (e.g., only
continue when indices.length <= MAX_BUCKET_SIZE), and add a processLogger.warn
or console.warn when you intentionally skip a medium-sized bucket (referencing
indices, SUB_BUCKET_THRESHOLD, MAX_BUCKET_SIZE and the continue) for
observability.
- Around line 598-610: The inline check-numbering comments are inconsistent;
update the comments around the calls to checkDatasetDrift,
findExactTitleDuplicates, findNearDuplicatePrompts, and findNearDuplicateTitles
so they read: "Check 7: dataset drift" before checkDatasetDrift(records.length),
"Check 8: duplicate titles" before findExactTitleDuplicates(prompts), and "Check
9: near-duplicate prompts/titles" before the near-duplicate checks
(findNearDuplicatePrompts and findNearDuplicateTitles) to match the PR
description.

ℹ️ Review info

Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6a0904c7-fc1e-48d0-a09d-f0bf8dbcc56d

📥 Commits

Reviewing files that changed from the base of the PR and between ec34d2e and 85a6021.

⛔ Files ignored due to path filters (1)

package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (4)

.github/workflows/ci.yml
.gitignore
package.json
scripts/validate-prompts-csv.ts

coderabbitai · 2026-03-04T04:36:07Z

+  if (typeVal && !VALID_TYPES.has(typeVal)) {
+    issues.push({
+      severity: "error",
+      row: rowNumber,
+      check: "value_constraint",
+      message: `Invalid type value "${type.trim()}", expected one of: ${[...VALID_TYPES].join(", ")}`,
+    });
+  }


⚠️ Potential issue | 🟡 Minor

Empty type field silently defaults to "TEXT" without validation warning.

When type is empty, the validation at line 138 is skipped (falsy check), and line 167 defaults to "TEXT". This behavior is inconsistent with other required field checks. If type is truly optional, this is fine; if it should be required, add validation.

Option: Validate empty type as an error (if required)

// Value constraints - if (forDevsVal && !VALID_FOR_DEVS.has(forDevsVal)) { + if (!forDevsVal) { + issues.push({ + severity: "error", + row: rowNumber, + check: "required_field", + message: "Empty for_devs value", + }); + } else if (!VALID_FOR_DEVS.has(forDevsVal)) {

Similarly for type if it should be required.

Also applies to: 167-167

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/validate-prompts-csv.ts` around lines 138 - 145, The current logic skips validation when typeVal is empty and later silently defaults to "TEXT"; add an explicit check for an empty or missing type and push an issues entry (severity "error" or "warning" per spec) using the same structure as the existing issues.push (include row: rowNumber, check: "required_field" or "value_constraint", and a clear message that type is missing), rather than allowing the later default to hide the problem; keep VALID_TYPES validation for non-empty values and ensure the defaulting to "TEXT" only happens after validation or only when the field is explicitly allowed to be optional.

f · 2026-03-05T07:13:28Z

Wow this looks useful! I'll review it shortly.

seyyed rashid khazeiynasab and others added 2 commits March 3, 2026 20:20

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(validation): add CSV validation pipeline with CI integration#1052

feat(validation): add CSV validation pipeline with CI integration#1052
1rashiid wants to merge 2 commits intof:mainfrom
1rashiid:feat/csv-validation-pipeline

1rashiid commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 4, 2026

Uh oh!

f commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

1rashiid commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation checks

CI pipeline order

Near-duplicate detection (designed for 100k+ prompts)

Current results (1391 prompts)

Files changed

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

f commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1rashiid commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading