Skip to content

fix(sync): deadlock on incremental sync with > 10 modified files#132

Open
sunnnybala wants to merge 1 commit intogarrytan:masterfrom
sunnnybala:fix/sync-nested-transaction-deadlock
Open

fix(sync): deadlock on incremental sync with > 10 modified files#132
sunnnybala wants to merge 1 commit intogarrytan:masterfrom
sunnnybala:fix/sync-nested-transaction-deadlock

Conversation

@sunnnybala
Copy link
Copy Markdown

Summary

Symptom: gbrain sync (incremental) hangs indefinitely when the diff touches more than 10 syncable files. --full always works. Reproduced with a 15-file commit: unpatched sync hangs forever, patched sync completes in 3.4 seconds on the same diff.

Cause: src/commands/sync.ts wraps the add/modify loop in engine.transaction() when useTransaction > 10. Each file in the loop then calls importFromContent, which opens another engine.transaction() on the same PGLite instance. PGLite transactions are not reentrant — the inner call queues on the same _runExclusiveTransaction mutex the outer is holding, producing a classic recursive-mutex deadlock. Main thread parks in ep_poll, workers park in futex_wait_queue, zero CPU advancement.

Fix: drop the outer transaction wrap. Each file's inner transaction is already atomic; per-file atomicity is also the right granularity (one file's failure should not roll back the others' successful imports).

Diff: 21 insertions, 21 deletions, no behavior change below the ≤10 threshold.


Details

The deadlock chain

  1. Incremental sync enters the useTransaction > 10 branch at sync.ts:225:
    if (useTransaction) {
      await engine.transaction(async () => { await processAddsModifies(); });
    }
  2. engine.transaction in pglite-engine.ts delegates to PGLite:
    async transaction<T>(fn) {
      return this.db.transaction(async (tx) => {
        const txEngine = Object.create(this) as PGLiteEngine;
        Object.defineProperty(txEngine, 'db', { get: () => tx });
        return fn(txEngine);
      });
    }
  3. Closure pitfall: the sync.ts callback is async () => { await processAddsModifies(); } — it does not take txEngine as a parameter. processAddsModifies is defined in the outer scope of performSync and closes over the original engine variable, not the txEngine passed into the callback. So every inner importFile(engine, …) call uses the original PGLiteEngine, whose this.db is the raw PGlite instance (not the outer tx object).
  4. Inside importFromContent (import-file.ts:95):
    await engine.transaction(async (tx) => {
      // putPage, upsertChunks, tag reconciliation, createVersion
    });
    This calls this.db.transaction(...) on the original PGlite instance — the same instance whose exclusive-transaction mutex is currently held by the outer transaction.
  5. PGLite's transaction() (from @electric-sql/pglite chunk-HDIMFN25.js):
    async transaction(s) {
      return await this._checkReady(),
        await this._runExclusiveTransaction(async () => {
          await /*BEGIN*/ call(this, "BEGIN");
          // ... await user callback ...
          // COMMIT / ROLLBACK
        })
    }
    _runExclusiveTransaction is a mutex. The inner call queues on the same mutex the outer is still holding. The outer can't release — it's awaiting the user callback, which is awaiting the inner transaction, which is queued forever.

Reproduction

# From a clean gbrain-brain repo
cd /your/gbrain-brain
for f in $(ls people/*.md | grep -v README | head -11); do
  printf "\n<!-- repro -->\n" >> "$f"
done
git commit -am "repro: 11 non-README .md files"

# Unpatched: hangs forever. ep_poll on main thread, futex_wait on workers.
# Zero CPU advancement after entering transaction branch.
gbrain sync --no-pull --no-embed

# Workaround that already exists: --full uses performFullSync which
# goes through runImport() (a different code path without the outer wrap).
gbrain sync --full --no-pull --no-embed

Evidence captured live:

  • Process state: /proc/<pid>/task/*/wchan → main ep_poll, two worker threads futex_wait_queue
  • CPU ticks frozen: utime 205 stime 47 unchanged across three 2-second samples
  • After applying this patch: same 16-file diff completes in real 0m3.4s with status synced, all pages affected, chunks created

Why this has flown under the radar

  • Only triggers when a single git diff contains more than 10 files that pass isSyncable() (excludes README.md, index.md, schema.md, log.md, hidden dirs, non-.md/.mdx)
  • Most day-to-day commits are well under 10 files and take the else branch, bypassing the wrap entirely
  • gbrain import (bulk ingest) and first-sync go through performFullSyncrunImport, which has no outer wrap
  • sync --full works as an apparent workaround, leading users to blame "PGLite is slow" rather than report a bug
  • The hang is completely silent — no log, no error, no CPU. Looks indistinguishable from a very slow sync

Where it bites in production

  • Large initial backfills (importing months of daily notes, calendar events, email digests in one commit)
  • Citation-fixer / maintenance runs that touch many people/company pages at once — e.g. a dream-cycle job that patches 12+ pages for missing [Source: ...] citations
  • Repo migrations / mass rewrites where a single commit renames or reformats many files
  • Any agent that batches brain updates before a single git commit + gbrain sync

Why removing the outer wrap is safe

  • importFromContent is already atomic per file (its inner transaction commits the pages row, tags, chunks, and page_versions together)
  • The outer wrap did not add any atomicity guarantee beyond "all-or-nothing across files", which is actually worse semantics: a single malformed file would roll back every successfully imported file in the same sync run
  • Per-file atomicity means partial sync progress survives a transient failure mid-run, and the next sync picks up where the failed one left off
  • performFullSync path has always done per-file inner transactions with no outer wrap — this fix simply makes the incremental path match

The diff

-  // Process adds and modifies
-  const useTransaction = (filtered.added.length + filtered.modified.length) > 10;
-  const processAddsModifies = async () => {
-    for (const path of [...filtered.added, ...filtered.modified]) {
-      const filePath = join(repoPath, path);
-      if (!existsSync(filePath)) continue;
-      try {
-        const result = await importFile(engine, filePath, path, { noEmbed });
-        if (result.status === 'imported') {
-          chunksCreated += result.chunks;
-          pagesAffected.push(result.slug);
-        }
-      } catch (e: unknown) {
-        const msg = e instanceof Error ? e.message : String(e);
-        console.error(`  Warning: skipped ${path}: ${msg}`);
-      }
-    }
-  };
-
-  if (useTransaction) {
-    await engine.transaction(async () => { await processAddsModifies(); });
-  } else {
-    await processAddsModifies();
-  }
+  // Process adds and modifies.
+  //
+  // NOTE: do NOT wrap this loop in engine.transaction(). importFromContent
+  // already opens its own inner transaction per file, and PGLite transactions
+  // are not reentrant — they acquire the same _runExclusiveTransaction mutex,
+  // so a nested call from inside a user callback queues forever on the mutex
+  // the outer transaction is still holding. Result: incremental sync hangs in
+  // ep_poll whenever the diff crosses the old > 10 threshold that used to
+  // trigger the outer wrap. Per-file atomicity is also the right granularity:
+  // one file's failure should not roll back the others' successful imports.
+  for (const path of [...filtered.added, ...filtered.modified]) {
+    const filePath = join(repoPath, path);
+    if (!existsSync(filePath)) continue;
+    try {
+      const result = await importFile(engine, filePath, path, { noEmbed });
+      if (result.status === 'imported') {
+        chunksCreated += result.chunks;
+        pagesAffected.push(result.slug);
+      }
+    } catch (e: unknown) {
+      const msg = e instanceof Error ? e.message : String(e);
+      console.error(`  Warning: skipped ${path}: ${msg}`);
+    }
+  }

sync.ts wraps the add/modify loop in engine.transaction(), and each
importFromContent inside opens another one. PGLite's
_runExclusiveTransaction is a non-reentrant mutex — the second call
queues on the mutex the first is holding, and the process hangs forever
in ep_poll. Reproduced with a 15-file commit: unpatched hangs, patched
runs in 3.4s. Fix drops the outer wrap; per-file atomicity is correct
anyway (one file's failure should not roll back the others).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant