Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
# HANDOFF — Multi-Orchestrator Runtime · Learnings & Carry-Forward

**Date**: 2026-05-27
**Scope**: Post-Phase-B retrospective after #146, #148 (omadia) and
`omadia-byte5-plugins#3` shipped.
**Audience**: Next session picking up the platform's per-Agent routing
work — what we built, what surprised us, what we'd build differently.

---

## What shipped (live on Fly today)

| PR | Repo | Content |
|---|---|---|
| #146 | byte5ai/omadia | Multi-orchestrator runtime US1–US9 + Phase A chat routing + Phase B operator UX |
| #148 | byte5ai/omadia | A+B channel-directory capability + `/operator/channels` dashboard |
| #3 | byte5ai/omadia-byte5-plugins | `@omadia/channel-teams` 0.9.0 — per-conversation routing, mention-only, directory contribution |

Live on `https://odoo-bot-middleware.fly.dev` + `https://odoo-bot-harness.fly.dev`.

---

## Top 10 Learnings

### L1 · Per-Agent orchestrators built before kernel DomainTools exist → empty tool surface

The biggest single bug found post-Phase-B. The orchestrator plugin's
`activate()` runs as part of `toolPluginRuntime.activateAllInstalled()`,
which is **earlier** in boot than where the kernel assembles its
`DomainTool[]` set (sub-agent query tools like `query_odoo_accounting`,
`query_confluence`). Every per-Agent `Orchestrator` built by the
multi-orchestrator registry started with `domainTools: []`. Result: chat
against the fallback Agent could not reach **any** sub-agent — the
fallback was effectively brain-dead.

**Fix landed**: post-`dynamicAgentRuntime.activateAllInstalled()` the
kernel now walks `registry.list()` and `registerDomainTool` on each
per-Agent orchestrator. `OrchestratorRegistry.setOnAgentBuilt(...)`
re-runs the hydration on every future `add` / `rebuild` action.

**Lesson for next time**: a runtime-level component built per Agent
needs **either** all deps resolved at construct-time (impossible here
because of capability ordering), **or** a post-build hook for the
kernel to inject deferred deps. We chose the hook. Future per-Agent
deps (e.g. per-Agent tool permissions in the in-flight refactor)
should plug into the same `onAgentBuilt` callback rather than adding
a second one.

### L2 · `agent_plugins.config` is persisted but not yet runtime-applied

The Phase B dnd editor + per-plugin `setup_fields` drawer happily writes
per-Agent config to the `agent_plugins.config` JSONB column. The
operator sees the form, the DB row updates, the operator-channels page
shows the binding. **But `toolPluginRuntime.activateAllInstalled()`
still activates each plugin ONCE with the GLOBAL store-config** from
`installedRegistry`. Per-Agent runtime config is queued but not
delivered.

We explicitly enforced the **fallback Agent contract** (fallback's
config is force-emptied on save, the UI hides the drawer) so the
behaviour is consistent: today every Agent uses the store-config. The
moment per-(Agent × plugin) `PluginContext` lands, the fallback contract
stays (operator opted into "fallback = store-default by design") and
named Agents start honouring their saved configs.

**Lesson**: don't let the UI imply behaviour the runtime doesn't yet
implement. We surfaced the disconnect via the fallback-card notice and
documented it as carry-forward.

### L3 · Microsoft Teams' `conversation.id` semantics are non-obvious

| Operator's mental model | Teams reality |
|---|---|
| "I open another Teams tab" | Same conv-id (UI tabs do not segment conversations) |
| "I add the bot to another channel" | Same conv-id if all messages are replies to the same root post |
| "@-mention the bot in a different chat" | New conv-id only if the chat is a different group constellation OR a different Team-channel-root |
| "Two group chats with different people" | Two different conv-ids, both `19:<id>@thread.skype` |
| "Personal chat with bot" | One `a:<aad>` conv-id, stable |

We hit this hard during testing. The operator @-mentioned the bot from
what they believed was a different chat three times — same conv-id
each time. The fix wasn't code, it was understanding the platform.

**Documented in the operator-channels dashboard tooltip**: each entry
labels its `conversationType` (`channel` / `personal` / `groupChat`)
so the operator can correlate.

### L4 · Node ESM module cache breaks plugin hot-reload of sub-imports

The `DynamicChannelPluginResolver` busts plugin.js's import URL with
`?v=<token>` on `invalidate()`, so `plugin.js` reloads cleanly. BUT
plugin.js's `import { TeamsBot } from './teamsBot.js'` resolves to the
**same** absolute file URL on every load, and Node's module cache
returns the **first** `teamsBot.js` it ever loaded — even if a newer
version sits on disk at the same path.

We saw this concretely: 0.9.0 was uploaded + activate-logs ran fresh,
but the bot's `handleMessage` was still 0.8.0's implementation (no
mention-filter, no `inbound-meta` diagnostic). Cache bust on the
entry-point doesn't propagate to relative imports.

**Workaround in production**: `fly machine restart` after each plugin
upload. Wipes the whole Node cache, fresh import on next activate.

**Proper fix (deferred)**: extract uploads to a version-suffixed
directory (`<id>/<version>/`) which the code *already does*, but
combine it with **also** keeping the in-memory `cache` and `bustTokens`
maps invalidated transitively (delete on uninstall + on re-upload
re-activate). The current invalidate() handles channel-resolver
correctly; the bug is elsewhere — likely in the way the runtime keeps
a *bot-instance reference* alive across activations even after
deactivate() was called. Worth a focused debugging session.

### L5 · `replaceAgentPlugins` re-upserts stale rows if the UI echoes them back

Operator clicks Save with N plugins selected → PUT body contains all N
plugin ids → the route deletes rows missing from body, upserts each
listed row. If the UI includes "orphan" rows (plugin ids in
`agent_plugins` that no longer exist in the installed-plugin catalog),
those keep getting re-upserted on every save. "STALE" entries pile up.

**Fix landed**: dnd editor drops orphans from the save payload by
default + per-orphan "Keep" checkbox + bulk "Remove all" button. The
backend stayed unchanged — fix is correct at the UI level because
"explicit operator choice to keep" is the only safe signal.

**Lesson**: list-replace semantics in REST are correct, but the UI must
not echo every row back blindly. Filter what's clearly stale.

### L6 · `agent.updated_at` doesn't bump on plugin/binding writes

Initial Phase B used `key={agent.updated_at}` on `PluginsDnd` and
`BindingsEditor` to remount after server writes. `agents` table's
`updated_at` only changes on `UPDATE agents`, NOT on writes to
`agent_plugins` or `channel_bindings`. So saving plugins looked like a
no-op — local state stayed stale, operator thought save broke.

**Fix landed**: payload-hash key (`pluginsRevisionKey`,
`bindingsRevisionKey`) that incorporates the actual content of the
related tables. Remount whenever the editable data changes regardless
of which DB table changed.

**Lesson**: trust the actual data being edited, not a meta-timestamp
on a parent row.

### L7 · `validateSnapshot` rejects the entire snapshot when one plugin is "not installed"

`isInstalled === false` for any plugin throws `ConfigValidationError`,
which aborts the whole `registry.reload()`. If B1's first-boot seed
attached every catalog plugin (including some not in
`installedRegistry`), the registry would never finish loading. Hard to
diagnose because the failure is mid-snapshot and previous registry
state remains.

**Fix landed**: B1's `attachAllPlugins` only attaches `installed.list()`
entries (not catalog entries). B3d "Reset fallback" filters
`status !== 'errored'` so inactive-but-installed plugins stay attached.

**Lesson**: snapshot validators should ideally be **per-row** with
isolation (skip the offending row, log it, continue) instead of
all-or-nothing. Worth a small refactor on a future PR.

### L8 · Auto-merge via GraphQL is rate-limit-fragile; REST merge is the workhorse

GraphQL got rate-limited 5000/hr several times during the day's PR
churn — `gh pr merge --auto --squash --delete-branch` failed because it
goes through GraphQL. REST `PUT /repos/{}/pulls/{n}/merge` always
worked. Tracking via REST check-runs API also worked when GraphQL was
empty.

**Operational lesson**: when shipping >1 PR per hour, use REST endpoints
directly:
```bash
gh api -X PUT repos/owner/repo/pulls/N/merge -F merge_method=squash
gh api -X DELETE repos/owner/repo/git/refs/heads/<branch>
```
Auto-merge is convenience, not a hard requirement.

### L9 · `usePathname` not `window.location.pathname + popstate`

The stream-toast component watched route changes via
`window.location.pathname` + `popstate`. Next.js App Router does NOT
fire `popstate` on client-side `router.push` / `<Link>` clicks. So
after Chat → Memory the toast component still thought it was on /
and the toast never appeared.

**Fix**: `usePathname()` from `next/navigation` re-renders on every
client-side route change.

**Lesson**: any client component watching the route MUST use the
framework's hook, not the browser API. Burned a deploy on this one.

### L10 · Header z-index needs an explicit stacking context

Nav-cluster dropdown menus rendered behind the `/store` hero. The
dropdown had `z-50`, but only within the header's stacking context —
and the header had no `position` or `z-index`, so a sibling with
`position:relative` further down the DOM painted on top.

**Fix**: `relative z-50` on `<header>` lifts the whole stacking context
above main content.

**Lesson**: dropdowns that escape their parent box need a managed
z-axis from the page root, not just from the immediate parent.

---

## What's deferred (carry-forward)

1. **Per-(Agent × plugin) `PluginContext`** — the actual unlock for L2.
The runtime still activates plugins globally with store-config.
Per-Agent isolation needs a new lifecycle where each Agent owns its
own PluginContext instance with per-Agent config + per-Agent secrets
scope. Big refactor; tracked separately.

2. **Plugin module-cache fix (L4)** — the workaround (machine restart)
is operationally painful. Fix candidates:
- Use `import()` with a fresh URL containing the version suffix in
the file URL itself, not as query string — Node's cache then
misses for sibling imports too.
- Move each plugin to a Worker thread (heavy, isolates everything).
- Implement a custom ESM loader that namespaces by plugin version.

3. **Conversation-observer persistence** — currently in-memory. After
middleware restart `/operator/channels` shows only the bot-level
catch-all until each conversation receives a new message. Two
options: vault-JSON (cheap, lossy on multi-instance Teams), or
Postgres table (durable, joinable with `channel_bindings`).

4. **Azure Bot Service display name** — Teams app sideload manifest
was bumped to v1.3.0 with `name: omadia-agent`, BUT the @-mention
chip name in Teams comes from the Azure-side bot's display name,
which still reads `virtual-bitch`. Manual rename in Azure portal
needed; no code change available.

5. **byte5 channel webhook handlers via `channelResolver@1`** — Teams
is now wired (Plugin 0.9.0). Telegram (`@omadia/channel-telegram`)
still consumes the legacy `chatAgent@1`. Symmetrical change in the
Telegram plugin needed before Telegram conversations can be routed
per-Agent.

6. **Snapshot validation per-row isolation** (L7) — convert
`validateSnapshot`'s all-or-nothing throw into per-row skip+log so
one bad plugin can't take down the whole registry boot.

7. **Operator-side agent-config-per-plugin runtime semantics** — once
L2 lands, decide UX for "this Agent uses Odoo prod, that one uses
Odoo staging". The dashboard already persists; behaviour wakes up
when the runtime catches up.

8. **T043 structured-logging audit, T044 Notion docs sync, T045
two-Agent boot smoke, T046 quickstart e2e** — original Phase 12
tasks, untouched.

---

## Operational recipes (post-mortem-ready)

### Fresh deploy of an updated channel plugin

```bash
# 1. Build + ZIP
cd ~/sources/omadia-byte5-plugins/packages/channel-teams
npm run build
cd ../.. && node scripts/package-all.mjs channel-teams

# 2. Upload via Operator UI: /operator/agents → Plugins → Store →
# channel-teams → Update with zips/channel-teams-X.Y.Z.zip

# 3. Force fresh module cache (REQUIRED — see L4)
fly machine restart <id> -a odoo-bot-middleware

# 4. Verify boot log
fly logs -a odoo-bot-middleware --no-tail | grep -E "channel-key directory|Teams per-turn"
```

### Diagnose "operator/channels shows nothing"

```bash
# 1. Is the directory registered?
fly logs -a odoo-bot-middleware --no-tail | grep "channelDirectoryRegistry: registered"

# 2. Is the bot receiving messages?
fly logs -a odoo-bot-middleware --no-tail | grep "inbound conv="

# 3. Live endpoint check (auth'd browser)
GET https://odoo-bot-harness.fly.dev/bot-api/v1/operator/channels
```

### Diagnose "different chat, same conv-id"

L3 is the most common cause — operator believes two chats are different
but Teams treats them as one conversation. Confirm with:

```bash
fly logs -a odoo-bot-middleware --no-tail \
| grep "inbound conv=" | awk -F'conv=' '{print $2}' | awk '{print $1}' | sort -u
```

If the output has only one line across multiple operator-side tests,
the chats are the same conv from Teams' perspective. Solution: open a
genuinely different Teams chat (different team-channel, different
group-chat constellation) — not just a different tab/sidebar entry.

---

## One-liner to resume

```bash
cd ~/sources/odoo-bot-multi-orchestrator
git log --oneline -8
cat specs/002-chat-routes-via-registry/HANDOFF-2026-05-27-multi-orchestrator-learnings.md
# Pick from carry-forward 1–8 above; each entry has enough context to
# stand alone.
```

The platform is stable for one-bot operation with per-conversation
routing. The next operator-visible win is L2 (per-Agent configs
actually flowing into plugin activations) — that's also the unblock
for "Agent-X talks to Odoo prod, Agent-Y to Odoo staging" use cases.
Loading