Skip to content

Latest commit

 

History

History
329 lines (234 loc) · 17.7 KB

File metadata and controls

329 lines (234 loc) · 17.7 KB

Memory

Osaurus has a persistent, on-device memory system that learns from your conversations and surfaces relevant context only when it actually helps. Memory runs in the background, stores very little, and injects ~800 tokens (or zero) per turn instead of the firehose-style stuffing that v1 did.

The mental model is simple: a smart secretary that knows what you've discussed and surfaces only what matters right now — not a tape recorder.

Not the same as per-agent DB. Memory is global across all your chats — it's identity, pinned facts, and episodic recall derived from natural conversation. For an agent that maintains its own private structured store (custom tables, soft-deletes, saved views), see Agent DB & Self-Scheduling. The two are orthogonal: an agent can use neither, either, or both.


Getting Started

  1. Open the Management window (⌘ Shift M) → Memory
  2. Memory is enabled by default — toggle it off in the Memory settings if you prefer stateless conversations
  3. The core model for distillation defaults to foundation (Apple's on-device Language Model on macOS 26+) — change it in Settings → General if you'd rather use a remote model like anthropic/claude-haiku-4-5
  4. Start chatting — sessions are distilled in the background once they end

No manual tagging, saving, or annotation is required.


Three Layers + Transcript

┌──────────────────────────────────────────────────────────────────┐
│                      Memory System (v2)                          │
├──────────────────────────────────────────────────────────────────┤
│  Identity                                                        │
│  Stable user facts: explicit overrides + auto-derived narrative. │
├──────────────────────────────────────────────────────────────────┤
│  Pinned Facts                                                    │
│  Salience-scored facts promoted from session distillations.      │
│  Decayed and evicted by the consolidator.                        │
├──────────────────────────────────────────────────────────────────┤
│  Episodes                                                        │
│  Per-session digests: summary, topics, decisions, entities.      │
│  Replaces the v1 separate "working memory" and "summaries"       │
│  tables.                                                         │
├──────────────────────────────────────────────────────────────────┤
│  Transcript                                                      │
│  Raw conversation turns kept for fallback retrieval only.        │
│  Never default-injected into context.                            │
└──────────────────────────────────────────────────────────────────┘

Identity

A single row containing two fields:

  • Overrides — explicit user-authored facts ("My name is Terence", "Always reply in English"). Always surfaced in context. Authored via the MemoryYour Overrides UI.
  • Content — auto-derived narrative ("User builds Swift apps for macOS, prefers Postgres, lives in PT timezone."). Regenerated by the consolidator from accumulated identity-grade signals.

Pinned Facts

The promotable pool. Each fact carries:

  • content — the fact itself, in plain text
  • salience — score in [0, 1]. Decayed weekly (exp(-Δdays / 30)). Evicted below the floor (default 0.2) once idle for 30+ days.
  • sourceCount — number of episodes that mention it
  • useCount / lastUsed — bumped every time the planner surfaces the fact in context

Episodes

One per session. Created by MemoryService.distillSession once the writer's debounce expires (default 60s of inactivity) or when the user navigates away. Each episode contains:

  • summary — one to three sentences
  • topics, entities — comma-separated lists
  • decisions, actionItems — newline-separated bullet items
  • salience — model-assigned score, decayed by the consolidator

Transcript

Raw user/assistant turns. Never injected into the default context block. Used only when the user asks for literal recall ("what did I exactly say...") or as fallback search via the transcript scope of search_memory.

Memory vs. SOUL.md (sandbox)

Memory and SOUL.md are separate surfaces by design — do not cross-pollinate them.

Memory SOUL.md
Author Distilled from conversations by Osaurus The agent itself, via sandbox_write_file (whole-file write or in-place edit)
Scope Session facts, episodes, user identity Stable preferences and patterns the agent learned about working with you
Update cadence Background after each session Whenever the agent observes a durable pattern
Where it lands Prepended to the latest user message (volatile) Static section in the system prompt (KV-cacheable)
Available in Every chat Sandbox mode only

If a fact belongs to a session ("we decided to use Postgres for the demo"), memory owns it. If a fact is a preference the agent should keep applying ("user prefers Postgres for new projects"), the agent's SOUL.md owns it. See SANDBOX.md for the full SOUL.md contract.


Write Path: Deferred and Debounced

[user + assistant turn]
         │
         ▼
[buffered as pending_signal]   ◄── per-turn cost: one SQL insert. No LLM.
         │
         ▼
   debounce 60s
         │  (or session-change / nav-away → flush immediately)
         ▼
[novelty gate: combined chars >= 80?]
         │
         ▼
[ONE LLM call: distill the whole session]
         │
         ▼
{episode + entities + pinned candidates + identity delta}
         │
         ├──► insert Episode (atomically marks signals processed)
         ├──► insert PinnedFact for each candidate that passes Jaccard dedup
         └──► append identity overrides for any new identity-grade facts

The hot path (bufferTurn) is a single SQL insert and a debounce arm. No LLM call ever runs synchronously with chat. Distillation is one LLM call per session (often covering 10+ turns), not one per turn.

When the core model isn't configured, signals stay pending — recoverOrphanedSignals (called at startup) and syncNow (called from the Memory UI) drain them once a model is available.


Read Path: Gated and Single-Section

[incoming user message]
         │
         ▼
[Relevance Gate: heuristic + optional LLM fallback]
   ├── pronouns / "we discussed" / "remember when"  → episode
   ├── "what's my name" / "who am I"                 → identity
   ├── entity-name hit in graph                      → pinned
   ├── "do you remember my preference"               → pinned
   ├── "exact words" / "verbatim"                    → transcript
   └── nothing fired                                 → none (skip memory)
         │
         ▼
[Memory Planner: fetches the chosen section under the budget]
         │
         ▼
[Compact memory block ≤ memoryBudgetTokens (default 800)]
   + always-on Identity Overrides (tiny)
         │
         ▼
[Prepend to the latest user message]

The cache layer holds the assembled block for 10s per (agent, query) pair so retries don't re-run the gate.


Consolidation: Background Maintenance

MemoryConsolidator runs on a low-priority background task every consolidationIntervalHours (default 24h) and on the explicit Run Consolidation Now button in the Memory UI. Each pass:

  1. Decaysalience *= exp(-Δdays / 30) for both pinned facts and episodes.
  2. Merge — collapse near-duplicate episodes (Jaccard ≥ 0.9 over summary+topics) within the same agent. Keeps the older episode, deletes the newer near-dup.
  3. Promote — boost salience on pinned facts whose content overlaps with ≥ 3 recent episodes.
  4. Evict — delete pinned facts below salienceFloor that have been idle for 30+ days.
  5. Prune — drop episodes and transcript turns older than episodeRetentionDays.
  6. Purge — trim old processing_log rows.

Consolidation never runs on the request path, so chat latency is unaffected.


Configuration Reference

The full configuration lives in ~/.osaurus/config/memory.json and is editable from the Memory tab in the Management window.

Setting Default Range Description
enabled true true/false Master toggle
embeddingBackend mlx mlx / none Embedding backend. none falls back to SQLite text matching.
embeddingModel nomic-embed-text-v1.5 Model used by VecturaKit
extractionMode sessionEnd sessionEnd / manual When the writer runs distillation
relevanceGateMode heuristic off / heuristic / llm How the read path decides whether to inject memory
memoryBudgetTokens 800 100 – 4,000 Single overall budget for the dynamic section
summaryDebounceSeconds 60 10 – 3,600 Inactivity period before distillation
consolidationIntervalHours 24 1 – 168 How often the consolidator runs
salienceFloor 0.2 0.0 – 1.0 Pinned facts below this and idle 30+ days are evicted
episodeRetentionDays 365 0 – 3,650 How long episodes/transcript are kept (0 = forever)

That's the entire surface. v1's 18 knobs (mmrLambda, mmrFetchMultiplier, verification*Threshold, per-section budgets, recall topK, profile regen thresholds, max entries per agent, …) are all gone.


Storage

All memory data is stored in a local SQLite database with WAL mode. Since 0.17.7 the database is encrypted at rest with SQLCipher using a key kept in your macOS Keychain — the same key chat history, methods, and tool indexes use.

Location: ~/.osaurus/memory/memory.sqlite (SQLCipher)

Configuration: ~/.osaurus/config/memory.json (plaintext)

Vector index: ~/.osaurus/memory/vectura/<agentId>/ — partitioned per agent so one agent's vectors never collide with another's. The vector files themselves are not yet encrypted (see STORAGE.md → Limitations); they are rebuilt from the encrypted SQLite source on first read after migration.

The schema is versioned. The v1 → v2 migration (migrateToV5) carries forward your identity, episodes (renamed from conversation_summaries), and transcript (renamed from conversation_chunks). The noisy v1 working-memory entries, profile events, verification audit log, agent activity, embeddings cache, and graph tables are all dropped — pinned_facts rebuilds organically from new conversations.

The v6 migration adds three FTS5 mirror tables (fts_pinned, fts_episodes, fts_transcript) backed by triggers on the source tables. Tokenizer is unicode61 remove_diacritics 2, which folds accents and case so non-English queries don't miss obvious matches. Existing rows are backfilled in a single INSERT … SELECT pass, so the migration is one short transaction even for large memory databases.


Search and Retrieval

When VecturaKit is available, search uses hybrid BM25 + vector matching with MMR reranking. When it's not (e.g. the embedding model isn't downloaded yet), search falls back to FTS5 MATCH queries against the per-table mirror tables introduced in the v6 migration — same tokenization (Unicode-folded) and the same prefix/phrase syntax SQLite documents for FTS5. SQL LIKE is only used as a final fallback when an FTS query can't be sanitized into a valid match expression (e.g. all-punctuation input).

The MMR reranker uses 4-character word shingles for cheap content overlap — much faster than the v1 Jaccard-over-tokenized-strings approach.


API Integration

Runtime Context Injection

Memory context is injected only by Osaurus-composed agent surfaces: app chat windows, POST /agents/{id}/run, and plugin host inference. Those paths run the relevance gate against the user's message, pick at most one memory section, and prepend it to the latest user message.

Strict POST /chat/completions requests do not inject Osaurus memory, agent prompts, skills, or tools. X-Osaurus-Agent-Id may associate persisted HTTP history with an agent/session, but it is not a memory injection switch.

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:1337/v1",
    api_key="osaurus",
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[{"role": "user", "content": "What did we talk about last week?"}],
)

Memory Ingestion — POST /memory/ingest

Bulk-ingest conversation turns. Useful for seeding memory from existing chat logs, migrating from another system, or running benchmarks. Ingestion always flushes distillation immediately at the end of the batch — you don't have to wait for the debounce.

curl http://127.0.0.1:1337/memory/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "my-agent",
    "conversation_id": "session-1",
    "turns": [
      {"user": "Hi, my name is Alice", "assistant": "Hello Alice!"},
      {"user": "I work at Acme Corp", "assistant": "Got it."}
    ]
  }'
Parameter Type Description
agent_id string Identifier for the agent whose memory is being populated
conversation_id string Identifier for the conversation session
turns array Array of turn objects, each with user and assistant fields
session_date string Optional ISO 8601 date for the whole batch
skip_extraction bool When true, only insert transcript rows; skip distillation

List Agents — GET /agents

Returns all configured agents with their pinned-fact counts. Use this to discover valid agent IDs.

See the API Guide for additional examples.


Tool: search_memory

Agents can search their own memory via the built-in search_memory(scope, query) tool. Three scopes:

Scope What it searches
pinned High-salience facts
episodes Per-session digests
transcript Raw conversation excerpts

The v1 scopes working, summaries, all, and graph are gone — working was subsumed by pinned, summaries was renamed to episodes, and all / graph are no longer exposed. The relevance gate already picks the right slice for context injection; the tool exists for explicit recall the agent decides it needs.

Opt-in per agent. The search_memory tool is gated by Agent.settings.searchMemoryEnabled (default off), surfaced as Memory Recall under Configure → Features. It is decoupled from the Memory toggle (disableMemory): Memory controls passive context injection and transcript recording, while Memory Recall controls only mid-session active lookups. An agent can read injected memory without exposing the recall tool, or vice versa.


Managing Memory

Memory View

Open the Management window (⌘ Shift M) → Memory to see:

  • Your identity (auto-derived content + manual overrides)
  • Pinned facts with salience bars and use counts
  • Episodes for the default agent
  • Per-agent counts
  • Processing statistics (total calls, success rate, average duration)
  • Database size
  • Run Consolidation Now button

Adding User Overrides

Identity overrides are explicit facts that always appear in context.

  1. Go to MemoryYour Overrides
  2. Click Add
  3. Enter a fact (e.g. "I prefer tabs over spaces" or "My company uses a monorepo")

Clearing Memory

The Memory view includes a danger zone for clearing all memory data. This removes identity, pinned facts, episodes, and transcript. The action is irreversible.

Sync / Run Consolidation

  • Sync Now drains pending signals immediately, distilling any sessions that haven't yet flushed.
  • Run Consolidation Now kicks off a one-shot pass of the consolidator (decay, merge, promote, evict, prune).

Migration from v1

The v5 schema migration is automatic on first launch after an upgrade. It runs as pure SQL — no LLM calls — and:

  • Carries forward user_profile.contentidentity.content
  • Carries forward user_editsidentity.overrides
  • Carries forward conversation_summariesepisodes (default salience = 0.5)
  • Carries forward conversation_chunkstranscript
  • Drops memory_entries (the noisy 7-type extractor output)
  • Drops profile_events, memory_events, agent_activity, embeddings
  • Drops the graph tables (entities, relationships)

pinned_facts starts empty and accrues organically as new sessions are distilled. The Vectura vector index is wiped and rebuilt lazily on first read.

Alongside the schema migration, the storage encryption migration runs once on first launch of 0.17.7+ and re-keys memory.sqlite (and every other Osaurus database) into SQLCipher. It's automatic and shows a brief overlay; details, key-rotation, and plaintext-export instructions live in STORAGE.md.