LLMs suffer from stylistic inertia in long roleplay sessions. Once a tone, pacing, or prose style is established over several turns, the model tends to perpetuate it regardless of narrative shifts. A lighthearted conversation that turns tragic will often retain the cadence and vocabulary of the earlier tone because the weight of prior context anchors the model's generation.
Static system prompts cannot solve this. The system prompt is written once and does not adapt to evolving scenes.
An agentic middleware layer sits between the user and the model. It intercepts each user message, runs a short analytical pass to "read the room," then dynamically assembles prompt directives that shape the model's writing before the actual roleplay generation happens.
The user never sees the agentic layer. The writer model doesn't know it's being directed. The result is a roleplay session that naturally adapts its style, tone, and pacing as the narrative evolves.
- Clear direction for Writer: Grounding the story + actively steering the writing style = better output
- Customizability: Customizable prompt injection that's automatically used by Director model
- Anti-slop: Get rid of overused words, phrases, and patterns often seen in LLM outputs
- Anti-repetition: Detect various types of repetition from outputs and surgically fix them
- Length Guard: Actively or passively protect from length degradation as context grows
- Super-regenerate: Normal regens may give samey outputs, ask for a different take
- Magic Rewrite: Rewrite the target message in a user-defined direction
- Compress History: Summarize chat context and move it to a new conversation
- Mobile-compatibility: UI for mobile devices
- Integrated TTS: Easy Text-to-speech that supports multiple providers
- Character Browser: Fetch character cards from various sites on the Internet
The system uses a three-pass architecture, with the agent and writer optionally being the same or different models:
- Director Pass - Tool-calling phase where the LLM selects moods, plot direction, and potentially rewrites user prompts
- Writer Pass - Story generation phase where the LLM writes the actual roleplay response
- Editor Pass - A ReAct loop - Self-audit for slop and length optimization phase. This is surgical, errors will be programmatically detected, the model only needs to write replacement for targeted sentences
In most local setups, the user doesn't have enough resource to load more than one model at a time. Single-Model Mode addresses this by using the same model for both writing and agentic tasks. KV cache is respected by design so prompt reprocessing is avoided.
For the best experience, use Dual-Model Mode. Some harnesses are dropped in this mode so the models should perform better.
For optimal KV cache reuse, the following will remain consistent across passes:
- The system prompt (character card, instructions, etc.) is identical across all passes
- Built once and reused forever
- Includes character description, scenario, example dialogue, and additional instructions
- The conversation history (previous messages) is identical across all passes
- Maintains exact same message content, attachments, and ordering
- The same tool definitions must be sent in each LLM call for kv cache reuse
- Inconsistent tool schemas break KV cache alignment
- Prioritize small models - if a feature fails half of the time on Gemma-4-26B4A, it probably doesn't belong here
- Only use agentic functionalities when absolutely needed - we will not have useless tools like
dice_roll - Algorithm-first - if something can be done with an algorithm, don't use LLMs. Avoid making LLMs eyeball for errors
- Keep agentic scope small to reduce hallucination, avoid giving agents too much freedom of choice
- Speed: Multiple passes will obviously have a longer time to final response
- Cost: Neligible cost increase, which comes naturally with multiple passes, somewhat alleviated by KV cache reuse strategy
- A model with solid tool/function calling capabilities (recommended: Gemma 4)
- OpenAI-compatible LLM inference backend API that supports prompt-caching
- Python 3.9+
Full documentation is at https://orbfrontend.github.io/Orb/
Check this out before opening a PR: https://github.com/OrbFrontend/Orb/blob/main/CONTRIBUTING.md
Ideas, help requests, and questions go here: https://github.com/OrbFrontend/Orb/discussions
