Skip to content

[pull] main from danny-avila:main#104

Merged
pull[bot] merged 4 commits intoinnFactory:mainfrom
danny-avila:main
Apr 9, 2026
Merged

[pull] main from danny-avila:main#104
pull[bot] merged 4 commits intoinnFactory:mainfrom
danny-avila:main

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 9, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

github-actions Bot and others added 4 commits April 9, 2026 07:45
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…ay (#12586)

* fix: Merge Custom Endpoints by Name Instead of Replacing Entire Array

The DB base config's `endpoints.custom` array was wholesale-replacing
the YAML-derived array, causing YAML endpoint additions to be silently
lost after the first admin panel save. Add path-aware array merging
to `deepMerge` so keyed arrays (matched by `name`) are merged item-by-item
instead of replaced.

* fix: Harden mergeArrayByKey — deduplicate, sanitize, and prevent mutation

- Remove redundant sourceOrder array; iterate Map.keys() instead to
  prevent duplicate entries when source contains repeated names.
- Sanitize override-only appended items through deepMerge to enforce
  UNSAFE_KEYS prototype-pollution protection on all code paths.
- Shallow-copy unmatched base items to prevent mutation leak-back.
- Add post-OVERRIDE_KEY_MAP remapping note to ARRAY_MERGE_KEYS JSDoc.
- Add tests: duplicate source names, base mutation safety, multi-priority
  sequential merging of the same custom endpoint.

* fix: Add inline comments and test for keyless source items

- Document keyless item drop behavior in mergeArrayByKey with inline
  comment and matching test case.
- Add last-write-wins comment to deduplication test assertion.
- Clarify path semantics in mergeArrayByKey target-iteration comment.

* fix: Relocate keyless-item comment and test target-side preserve

- Move keyless-item comment to the if-guard where the skip happens and
  clarify that target-side keyless items are preserved, not dropped.
- Add test verifying base items without a name field are kept in output.
…ments (#12578)

* fix: atomize Redis event sequence counters for multi-replica deployments

Replace in-memory sequenceCounters Map with shared atomic Redis counter
(INCR via Lua script) so all replicas share an authoritative sequence
source. Make syncReorderBuffer async to read current sequence from Redis
instead of defaulting to 0 on non-publishing replicas.

Closes #12575

* fix: harden emitChunk error handling and syncReorderBuffer race safety

Wrap getNextSequence inside emitChunk's try/catch so a transient Redis
failure is logged-and-swallowed, preserving the non-fatal chunk emission
contract. In syncReorderBuffer, replace pending.clear() with selective
discard of stale entries (seq < currentSeq) followed by
flushPendingMessages, preventing loss of chunks that arrived during the
async GET window.

* fix: address review findings — harden error paths, validate types, DRY tests

- Wrap syncReorderBuffer in try/catch in GenerationJobManager.subscribe()
  so a Redis GET failure degrades gracefully instead of crashing SSE reconnect.
- Validate eval return type in getNextSequence; throw on unexpected type
  instead of silently computing NaN/-1 and dropping all messages.
- Add NaN guard to parseInt in syncReorderBuffer for corrupted Redis values.
- Consolidate two streams loops in destroy() into a single pass.
- Extract createMockPublisher to shared helpers/publisher.ts (DRY).
- Add tests for eval failure and unexpected eval return type in emitChunk.
- Document performance tradeoff (2x RTT per emit) and TTL rationale.

* test: add cross-replica integration tests for sequence desync fix (#12575)

Three real-Redis tests that reproduce the exact multi-replica failure:

- Late subscriber on a different replica syncs to the shared counter
  and receives chunks immediately (no 500ms force-flush).
- Multiple subscribe/unsubscribe cycles across replicas maintain
  correct sequence alignment on every reconnect.
- Shared counter key is cleaned up when the stream is destroyed.

* fix: close syncReorderBuffer race, stop destroy() from nuking shared keys

- Replace selective prune with Math.min(currentSeq, minPendingSeq) so
  chunks that arrive in pending during the async GET window are preserved
  instead of incorrectly pruned. Since unsubscribe() already clears
  pending, any entries at sync time are live messages from the current
  subscription and must not be discarded.
- Remove DEL from destroy() — the sequence key is shared across replicas
  and a shutting-down instance must not nuke it for active publishers.
  Keys expire naturally via their 1-hour TTL; cleanup() handles
  single-stream lifecycle teardown.
- Add race-condition test: pauses GET, injects a message into pending
  during the window, resolves GET, asserts the chunk is delivered.
- Add emitDone/emitError eval failure tests to cover the asymmetric
  error contract (chunk swallows, done/error propagates).
- Add explicit MockPublisher return type to helpers/publisher.ts.

* fix: context-aware syncReorderBuffer to prevent duplicate delivery, silence test logs

The Math.min(currentSeq, minPending) fix from the prior commit correctly
handles the cross-replica race (chunk arriving during async GET), but
causes duplicate delivery in same-replica mode: onAbort subscribes to
the Redis channel during createJob, so chunks published before
subscribe() may arrive via pub/sub AND earlyEventBuffer. The Math.min
logic then flushes the pub/sub copy as a "live" message.

Fix: add a clearPending parameter to syncReorderBuffer. The manager
passes true when earlyEventBuffer was replayed (same-replica: pending
entries are duplicates → clear them) and false/undefined when it was
not (cross-replica: pending entries are live → preserve via Math.min).

- Silence winston logger in stream test files via logger.silent = true,
  following the existing pattern in data-schemas/prompt.spec.ts.
- Remove sequence key DEL from destroy() — shared across replicas,
  a shutting-down instance must not nuke the counter for active
  publishers. Keys expire via TTL; cleanup() handles stream teardown.

* fix: share publisher client in cross-replica tests for Redis Cluster compat

The cross-replica tests created publisherB via (ioredisClient as Redis)
.duplicate(), which produces a plain Redis connection to a single node.
In Redis Cluster, GET/EVAL on this client can't follow MOVED redirects,
so syncReorderBuffer reads null → nextSeq=0 → chunks are buffered.

Fix: share ioredisClient as the publisher for both replicas. It's the
correct Cluster-aware client and handles slot routing automatically.
Only subscriber connections need to be separate (pub/sub requirement).

* fix: increase cross-replica test timeouts and use polling for CI cluster

Replace fixed 300ms delivery waits with polling (50ms intervals, 2s max)
to handle Redis Cluster's cross-node pub/sub broadcast latency in CI.
Increase subscription activation wait from 100ms to 500ms to match the
pattern used by existing same-instance tests.

* chore: silence winston logs in remaining stream test files

Add logger.silent = true to GenerationJobManager, RedisJobStore, and
collectedUsage test files, matching the pattern already applied to
RedisEventTransport and reconnect-reorder-desync tests.

* fix: remove flaky cluster pub/sub tests, fix log suppression for resetModules

Remove two cross-replica transport-level integration tests that are
inherently non-deterministic in Redis Cluster: cluster pub/sub fan-out
is async across nodes, so pre-subscribe PUBLISHes can arrive at the
subscriber's node after SUBSCRIBE takes effect, causing random +/- 1
message counts. The core logic is already covered by deterministic unit
tests (mock publisher) and GenerationJobManager integration tests
(end-to-end with earlyEventBuffer). The cleanup test is retained.

Switch log suppression from logger.silent (which doesn't survive
jest.resetModules) to jest.spyOn(console, 'log').mockImplementation()
for files that use resetModules (GenerationJobManager, RedisJobStore,
collectedUsage). The logger.silent approach remains for files that
don't use resetModules (RedisEventTransport, reconnect-reorder-desync).

* fix: replace Lua EVAL+TTL with plain INCR, preserve live chunks during same-replica sync

P1: syncReorderBuffer with clearPending=true was unconditionally clearing
pending, dropping NEW chunks (seq >= currentSeq) from ongoing generation
that arrived via pub/sub during the async GET. Fix: selectively prune
only entries with seq < currentSeq (duplicates of earlyEventBuffer) and
flush remaining live entries.

P2: The 1-hour TTL on sequence keys could expire mid-stream during long
quiet periods (e.g., slow tool calls), restarting the counter from zero
and causing subscribers to silently drop all subsequent messages. Fix:
remove the Lua EVAL+EXPIRE script entirely and use plain Redis INCR —
no TTL. Keys are cleaned up explicitly by cleanup()/resetSequence()
when streams end. Orphaned keys from crashed processes are a few bytes
each, negligible compared to the production risk of TTL expiry.

- Update mock publisher helper: eval → incr
- Remove "unexpected eval type" tests (not applicable to incr)
- Update remaining eval error tests to reference incr

* fix: re-arm flush timeout after syncReorderBuffer when gaps remain

syncReorderBuffer clears flushTimeout before processing, but if pending
still has gaps after flushPendingMessages, no timeout is re-armed.
Without new messages arriving to trigger scheduleFlushTimeout, those
buffered entries sit indefinitely — stalling the stream after reconnect.

* chore: address final review — fix stale docs, rename clearPending, tighten tests

Fix all 10 findings from final review pass:

F1: Fix stale TTL comment in destroy() — no TTL exists after EVAL removal
F2: Fix stale EVAL reference in emitChunk JSDoc → INCR + PUBLISH
F3: Fix syncReorderBuffer JSDoc — describes old broken behavior, not
    the current selective prune
F4: Rename clearPending → pruneStaleEntries for clarity at call sites
F5: Add publish-not-called assertion to INCR failure test
F6: Replace Math.min(...spread) with explicit loop to avoid heap alloc
F7: Remove resetSequence from IEventTransport interface (no external
    callers; implementation detail only)
F8: Consolidate cleanup/resetSequence overlap — cleanup() owns the DEL,
    resetReorderBuffer() (now private) handles buffer state only
F9: Fix import order in reconnect test (value before type imports)
F10: Export MockPublisher interface from helpers/publisher.ts

* chore: fix stale resetSequence references and hadBufferReplay JSDoc

Two comments referenced resetSequence() which was removed in the F7/F8
refactor (now private resetReorderBuffer(), which doesn't DEL the key).
Only cleanup() deletes the Redis key. Also update hadBufferReplay JSDoc
to say "prune stale entries" instead of the pre-refactor "clear pending".

* chore: grammar

* fix: use earlyReplayCount as prune cutoff instead of Redis counter

The boolean pruneStaleEntries flag used currentSeq (from Redis GET) as
the prune threshold, but INCR can advance the counter past a live
chunk's seq during the GET window. Example: earlyEventBuffer held seqs
0-4, generation emits seq 5 during GET, INCR advances counter to 6,
GET returns 6, prune condition 5 < 6 deletes the live chunk.

Fix: pass the earlyEventBuffer replay count (5) as the prune cutoff.
Entries with seq < earlyReplayCount are true duplicates; entries at or
above are live regardless of what the Redis counter reads. After
pruning, the unified Math.min(currentSeq, minPending) logic handles
both same-replica and cross-replica paths correctly.

Add a targeted test that exercises this exact race: pauses GET, emits
seq 5 during the window, resolves GET with counter=6, asserts seq 5
is preserved (would have been dropped with the old boolean approach).

* fix: pass earlyReplayCount for skipBufferReplay path to prune pub/sub duplicates

When skipBufferReplay is true (resume scenario), earlyEventBuffer events
are delivered via the resume sync payload, not replayed directly. But
earlyReplayCount was left at 0, so syncReorderBuffer treated all pending
entries as live — meaning delayed pub/sub copies of those buffered
events could be delivered again, duplicating content.

Fix: capture earlyEventBuffer.length before the skip/replay branch so
the count is always passed to syncReorderBuffer regardless of delivery
method. Seqs 0..earlyReplayCount-1 are pruned as duplicates whether
they were replayed or delivered via sync payload.

* fix: keep nextSeq monotonic in syncReorderBuffer after async GET

handleOrderedChunk can deliver in-order messages and advance nextSeq
during the async GET window. If those messages leave pending empty,
the unconditional nextSeq = currentSeq could regress nextSeq below
its already-advanced value, reopening a delivered gap and causing
subsequent messages to be buffered until force-flush.

Fix: wrap both nextSeq assignments with Math.max(nextSeq, ...) so
syncReorderBuffer never moves the delivery frontier backward.

* fix: cap nextSeq at earlyReplayCount, add 24h safety TTL, add post-GET race test

Finding 1 (CRITICAL): When earlyReplayCount > 0 and pending is empty,
nextSeq was set to currentSeq — but INCR can advance the counter past
a live chunk whose pub/sub hasn't arrived yet. Cap at earlyReplayCount
instead (what was actually delivered), so in-flight chunks are not
skipped. Adds test for the message-arrives-AFTER-GET-resolves scenario.

Finding 3: Add a 24-hour safety-net TTL set once on first INCR only
(val === 1), never refreshed. This caps orphan lifetime from crashed
processes without risking mid-stream counter resets.

Finding 6: Replace setTimeout(100) in cleanup test with polling loop.

Finding 9: Fix variadic DEL mock + add expire mock for the new TTL.

Revert P2 fix (earlyReplayCount for skipBufferReplay path) — when
skipBufferReplay is true, the resume sync payload delivers everything
up to currentSeq, so syncReorderBuffer should trust the Redis counter
as the frontier, not the buffer length.

* chore: log expire failures consistently with other fire-and-forget errors
@pull pull Bot locked and limited conversation to collaborators Apr 9, 2026
@pull pull Bot added the ⤵️ pull label Apr 9, 2026
@pull pull Bot merged commit 7ef0339 into innFactory:main Apr 9, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant