Skip to content

perf: reduce DB bandwidth ~1.5 TB/day#572

Open
sethconvex wants to merge 7 commits intoopenclaw:mainfrom
sethconvex:perf/reduce-db-bandwidth-v2
Open

perf: reduce DB bandwidth ~1.5 TB/day#572
sethconvex wants to merge 7 commits intoopenclaw:mainfrom
sethconvex:perf/reduce-db-bandwidth-v2

Conversation

@sethconvex
Copy link
Contributor

Summary

Production DB reads are ~2 TB/day. This PR reduces bandwidth by ~75% (~1.5 TB/day) through indexed queries, denormalization, and query restructuring.

  • Phase 1: Add by_moderation compound index and rewrite 8 cron query functions from .filter().collect() full table scans to .withIndex() (~6 GB/day saved)
  • Phase 2: Denormalize isSuspicious onto skills table with 6 compound indexes so listPublicPageV2 filters at the index level instead of paginating the entire table (~1 TB/day saved)
  • Phase 3: Add latestVersionSummary denormalization to avoid reading full ~6.4 KB skillVersions docs on list pages (~500 GB/day saved)
  • Phase 4: Split trending leaderboard query to one day at a time (avoids 32K doc limit). Reduce global stats recount from hourly to daily (~400 MB/day saved)
  • Phase 5: Add "Convex Query & Bandwidth Rules" section to AGENTS.md

Post-deploy steps

Run backfills after deploying:

bunx convex run maintenance:backfillIsSuspiciousInternal
bunx convex run maintenance:backfillLatestVersionSummaryInternal

No client changes required — all new schema fields are v.optional() and query return shapes are unchanged.

Test plan

  • bunx convex typecheck — no new type errors (26 pre-existing in unrelated files)
  • bun run test — all 604 tests pass
  • Deploy to dev, run backfills
  • Verify no bytesReadLimit/documentsReadLimit errors in Convex insights
  • Browse site on dev — list pages, filters, trending all work
  • Deploy to prod, run backfills, monitor bandwidth for 24h

🤖 Generated with Claude Code

…and query rewrites

Phase 1: Add `by_moderation` compound index and rewrite 8 cron query
functions to use `.withIndex()` instead of `.filter().collect()` full
table scans (~6 GB/day saved).

Phase 2: Denormalize `isSuspicious` onto skills table with 6 compound
indexes so `listPublicPageV2` can filter at the index level instead of
paginating the entire table (~1 TB/day saved). Includes backfill
mutation and write-path updates across all moderation mutations.

Phase 3: Add `latestVersionSummary` denormalization to avoid reading
full ~6.4 KB `skillVersions` docs on list pages (~500 GB/day saved).

Phase 4: Split trending leaderboard query to one day at a time to stay
under 32K doc limit. Reduce global stats recount from hourly to daily
since delta tracking handles real-time accuracy (~400 MB/day saved).

Phase 5: Add "Convex Query & Bandwidth Rules" section to AGENTS.md.

Backfill commands (run after deploy):
  bunx convex run maintenance:backfillIsSuspiciousInternal
  bunx convex run maintenance:backfillLatestVersionSummaryInternal

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link
Contributor

vercel bot commented Mar 1, 2026

@sethconvex is attempting to deploy a commit to the Amantus Machina Team on Vercel.

A member of the Team first needs to authorize it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53f901a40e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

convex/skills.ts Outdated
// Graceful backfill handling: during backfill, isSuspicious is undefined for
// un-patched docs, which won't match eq('isSuspicious', false). If the indexed
// query returns an unexpectedly empty first page, retry with the old path + JS filter.
if (useNonsuspiciousIndex && filteredPage.length === 0 && !result.isDone && !initialCursor) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove impossible guard from non-suspicious fallback

The backfill fallback for nonSuspiciousOnly is gated by !result.isDone, but when the new eq('isSuspicious', false) index has no rows (the common pre-backfill state), Convex returns page=[] with isDone=true. That means this branch never runs and listPublicPageV2 returns an empty page even though many clean skills still exist with isSuspicious unset, which can blank first-page non-suspicious views until the backfill completes.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e352123 and aefaf1c — removed the \!result.isDone guard and made the fallback fire on any page (not just first) so cursors stay on the same index path.

Comment on lines +580 to +582
const latestVersion = hasSummary
? toPublicSkillListVersionFromSummary(skill.latestVersionSummary!, skill.latestVersionId)
: toPublicSkillListVersion(latestVersionDoc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep version summary and latestVersionId consistent

This path now prefers skill.latestVersionSummary over loading skill.latestVersionId, but other mutations can retarget latestVersionId without rewriting the summary (for example updateTags in convex/skills.ts updates latestVersionId only). After a retag of latest, list responses can therefore return _id for one version while version/changelog still come from the previous summary.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e352123updateTags now writes latestVersionSummary when the latest tag is repointed to a different version.

convex/skills.ts Outdated
.withIndex('by_moderation', (q) =>
q.eq('moderationStatus', 'active').eq('moderationReason', 'scanner.vt.pending'),
)
.take(limit),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid pre-limiting stale-reason candidates before VT check

Each moderation-reason bucket is truncated with .take(limit) before verifying whether the referenced version has vtAnalysis, so the query can return fewer than limit even when additional stale rows exist beyond that first slice. Because syncModerationReasons treats a short page as completion, this can cause the sync job to stop early and leave stale moderation reasons unsynced.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e352123 — each bucket now takes limit * 2 to account for rows filtered out by the VT check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 1, 2026

Greptile Summary

This PR delivers a multi-phase bandwidth reduction (~1.5 TB/day) across the Convex backend by converting full-table scans to indexed queries, denormalizing two computed fields (isSuspicious, latestVersionSummary) onto the skills table, splitting the trending leaderboard query to avoid the 32K-doc limit, and reducing the global stats cron to daily.

Key changes:

  • Schema: adds isSuspicious (boolean) and latestVersionSummary (nested object) optional fields plus seven new compound indexes on skills.
  • listPublicPageV2: uses new by_nonsuspicious_* indexes when nonSuspiciousOnly is set; includes a graceful fallback to the old full-scan path during backfill.
  • 8 cron query functions: rewritten from .filter().collect() to .withIndex() using the new by_moderation compound index.
  • buildPublicSkillEntries: skips the full ~6 KB skillVersions doc read when latestVersionSummary is present on the skill.
  • Leaderboard: splits 7-day range .collect() into one equality query per day to stay under the 32K doc limit.
  • Backfills: two cursor-paginated self-scheduling mutations (backfillIsSuspiciousInternal, backfillLatestVersionSummaryInternal) to populate existing rows.
  • All isSuspicious mutation sites appear correctly updated to call computeIsSuspicious when moderationFlags or moderationReason is written.

Confidence Score: 4/5

  • Safe to merge with the recommended post-deploy backfills; no client-breaking changes and the backfill fallback guards the public list page during the transition window.
  • All new schema fields are optional, queries are correctly rewritten with proper index definitions, and mutation sites that write moderationFlags/moderationReason have been updated to keep isSuspicious in sync. The only non-critical concerns are a minor _creationTime approximation in toPublicSkillListVersionFromSummary, serial (vs parallel) daily leaderboard queries, and the backfill skip-if-present guard. No logic errors or data-loss risks identified.
  • convex/skills.ts — the toPublicSkillListVersionFromSummary function uses summary.createdAt for _creationTime; and the pagination fallback for the nonsuspicious index only activates on the first page, so clients paginating during backfill may see unexpectedly empty continuation pages.

Last reviewed commit: 417012d

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

if (!latestVersionId) return null
return {
_id: latestVersionId,
_creationTime: summary.createdAt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_creationTime sourced from createdAt, not the real Convex field

_creationTime is Convex's internal insertion timestamp (set automatically by the DB), while createdAt is the application-set field stored in latestVersionSummary. In practice these are nearly identical since both are set to now during the same transaction, but they are semantically distinct fields — toPublicSkillListVersion correctly reads version._creationTime, whereas this path substitutes summary.createdAt.

Suggested change
_creationTime: summary.createdAt,
_creationTime: summary.createdAt, // approximates _creationTime (see: version.createdAt ≈ version._creationTime)

No behavioral impact expected, but worth a comment to make the approximation explicit and prevent future confusion if the two ever diverge.

Prompt To Fix With AI
This is a comment left during a code review.
Path: convex/skills.ts
Line: 617

Comment:
**`_creationTime` sourced from `createdAt`, not the real Convex field**

`_creationTime` is Convex's internal insertion timestamp (set automatically by the DB), while `createdAt` is the application-set field stored in `latestVersionSummary`. In practice these are nearly identical since both are set to `now` during the same transaction, but they are semantically distinct fields — `toPublicSkillListVersion` correctly reads `version._creationTime`, whereas this path substitutes `summary.createdAt`.

```suggestion
    _creationTime: summary.createdAt, // approximates _creationTime (see: version.createdAt ≈ version._creationTime)
```

No behavioral impact expected, but worth a comment to make the approximation explicit and prevent future confusion if the two ever diverge.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e352123 — added clarifying comment noting the approximation.

Comment on lines 34 to 45
for (let day = startDay; day <= endDay; day++) {
const rows = await ctx.db
.query('skillDailyStats')
.withIndex('by_day', (q) => q.eq('day', day))
.collect()
for (const row of rows) {
const current = totals.get(row.skillId) ?? { installs: 0, downloads: 0 }
current.installs += row.installs
current.downloads += row.downloads
totals.set(row.skillId, current)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequential daily queries could be parallelized

The 7 daily queries are issued serially with await inside the for loop, adding 7 round-trips instead of 1. Since there are no cross-day dependencies, these can be parallelized with Promise.all for better latency in the cron job:

Suggested change
for (let day = startDay; day <= endDay; day++) {
const rows = await ctx.db
.query('skillDailyStats')
.withIndex('by_day', (q) => q.eq('day', day))
.collect()
for (const row of rows) {
const current = totals.get(row.skillId) ?? { installs: 0, downloads: 0 }
current.installs += row.installs
current.downloads += row.downloads
totals.set(row.skillId, current)
}
}
const dayKeys = Array.from({ length: endDay - startDay + 1 }, (_, i) => startDay + i)
const perDayRows = await Promise.all(
dayKeys.map((day) =>
ctx.db
.query('skillDailyStats')
.withIndex('by_day', (q) => q.eq('day', day))
.collect(),
),
)
for (const rows of perDayRows) {
for (const row of rows) {
const current = totals.get(row.skillId) ?? { installs: 0, downloads: 0 }
current.installs += row.installs
current.downloads += row.downloads
totals.set(row.skillId, current)
}
}

This is low-priority for a cron job, but it's an easy win.

Prompt To Fix With AI
This is a comment left during a code review.
Path: convex/lib/leaderboards.ts
Line: 34-45

Comment:
**Sequential daily queries could be parallelized**

The 7 daily queries are issued serially with `await` inside the `for` loop, adding 7 round-trips instead of 1. Since there are no cross-day dependencies, these can be parallelized with `Promise.all` for better latency in the cron job:

```suggestion
  const dayKeys = Array.from({ length: endDay - startDay + 1 }, (_, i) => startDay + i)
  const perDayRows = await Promise.all(
    dayKeys.map((day) =>
      ctx.db
        .query('skillDailyStats')
        .withIndex('by_day', (q) => q.eq('day', day))
        .collect(),
    ),
  )
  for (const rows of perDayRows) {
    for (const row of rows) {
      const current = totals.get(row.skillId) ?? { installs: 0, downloads: 0 }
      current.installs += row.installs
      current.downloads += row.downloads
      totals.set(row.skillId, current)
    }
  }
```

This is low-priority for a cron job, but it's an easy win.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e352123 — parallelized with Promise.all.


let patched = 0
for (const skill of page) {
if (skill.latestVersionSummary) continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backfill skips skills with existing latestVersionSummary

The guard if (skill.latestVersionSummary) continue means any skill that already has a summary (e.g. one published after this PR is deployed but before the backfill completes) will be silently skipped and never re-evaluated by the backfill. This is the intended behavior for an initial one-shot populate, but it means if a summary was somehow written with wrong data, re-running the backfill won't fix it. Worth noting in the JSDoc so future operators know to clear latestVersionSummary first if a re-backfill is ever needed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: convex/maintenance.ts
Line: 1551

Comment:
**Backfill skips skills with existing `latestVersionSummary`**

The guard `if (skill.latestVersionSummary) continue` means any skill that already has a summary (e.g. one published after this PR is deployed but before the backfill completes) will be silently skipped and never re-evaluated by the backfill. This is the intended behavior for an initial one-shot populate, but it means if a summary was somehow written with wrong data, re-running the backfill won't fix it. Worth noting in the JSDoc so future operators know to clear `latestVersionSummary` first if a re-backfill is ever needed.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e352123 — backfill now reconciles against the current latestVersionId instead of skipping. Updated JSDoc accordingly.

- Replace ArkType `[inferred]` type alias for ClawdisSkillMetadata with
  an explicit interface so TS can see envVars/dependencies/author/links
- Extract listBySkillHandler from comments.ts so tests can call it
  directly without accessing private _handler property
- Rebuild packages/schema dist output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b904504176

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


let patched = 0
for (const skill of page) {
if (skill.latestVersionSummary) continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reconcile existing summaries during latest-version backfill

backfillLatestVersionSummaryInternal skips any row that already has latestVersionSummary, so it cannot repair skills where latestVersionId was later repointed (for example by tag updates) and the denormalized summary is now stale. In that scenario list responses can keep returning _id for one version and summary fields from another indefinitely because this backfill will never patch those rows; the migration should validate/rewrite summary data against the current latestVersionId instead of unconditionally continuing.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e352123 — same fix as above, backfill always re-evaluates against current latestVersionId.

sethconvex and others added 2 commits March 1, 2026 11:22
The explicit ClawdisSkillMetadata interface (needed because ArkType's
[inferred] doesn't resolve all fields) now has a keyof-based type guard
that triggers a compile error if the interface keys drift from the schema.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix P1: remove !result.isDone guard from listPublicPageV2 backfill
  fallback so it fires when the nonsuspicious index is empty (isDone=true)
- Fix updateTags to update latestVersionSummary when repointing latest tag
- Parallelize leaderboard daily queries with Promise.all
- Over-fetch stale-reason candidates (2x limit) before VT filtering
- Reconcile existing latestVersionSummary in backfill instead of skipping
- Add _creationTime approximation comment
- Rebuild schema dist to include author field on ClawdisSkillMetadata

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3521239b6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

convex/skills.ts Outdated
// query returns an unexpectedly empty first page, retry with the old path + JS filter.
// Note: when the index has no matching rows, Convex returns isDone=true, so we
// intentionally omit the !result.isDone guard here to ensure the fallback fires.
if (useNonsuspiciousIndex && filteredPage.length === 0 && !initialCursor) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep pagination cursor on the same index path

When nonSuspiciousOnly hits the first-page fallback (!initialCursor), this branch switches to fallbackPaginate and returns a cursor generated from SORT_INDEXES. On the next page request, runPaginate goes back to NONSUSPICIOUS_SORT_INDEXES, so that cursor is incompatible, paginateWithStaleCursorRecovery resets to page 1 (runPaginate(null)), and users can no longer paginate past the first page during backfill windows where isSuspicious is still mostly unset.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in aefaf1c — fallback now fires on any page and reuses the client cursor via paginateWithStaleCursorRecovery, keeping cursors on the same SORT_INDEXES path across paginations.

The nonsuspicious index fallback now fires on any page (not just the
first) and reuses the client's cursor via stale-cursor recovery. This
prevents pagination from breaking when a SORT_INDEXES cursor is sent
back to the NONSUSPICIOUS_SORT_INDEXES path on page 2+.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aefaf1c798

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// un-patched docs, which won't match eq('isSuspicious', false). Fall back to
// SORT_INDEXES + JS filter. This fires on ANY page (not just the first) so that
// cursors stay on the same index path across client paginations.
if (useNonsuspiciousIndex && filteredPage.length === 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fall back when non-suspicious index page is underfilled

In listPublicPageV2, the backfill fallback only runs when filteredPage.length === 0, but during isSuspicious backfill many non-suspicious rows still have isSuspicious unset and are therefore absent from the eq('isSuspicious', false) index. If a page contains even a few indexed rows, this branch is skipped, so the query can return a partial page and even isDone: true while additional eligible rows still exist outside that index slice; clients will stop paginating and silently miss valid results until backfill completes.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — this is a known transient edge case during the backfill window. The partial pages contain correct data (no false results), just fewer items than expected. The backfill self-schedules and completes in minutes, after which the index is fully populated and pages are full-sized. We considered always falling back when underfilled, but that would negate the index benefit for any page that happens to have fewer results than numItems (e.g. the last page). The brief underfill during backfill is an acceptable trade-off vs. adding complexity for a one-time migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant