Skip to content

[BUG] Guardrail input redaction not persisted to AgentCoreMemorySessionManager — unredacted offensive content replayed on subsequent turns #2119

@mikeys

Description

@mikeys

Checks

  • I have updated to the latest minor and patch version of Strands
  • I have checked the documentation and this is not expected behavior
  • I have searched ./issues and there are no duplicates of my issue

Strands Version

1.28.0

Python Version

3.12.8

Operating System

macOS 14.3

Installation Method

pip

Steps to Reproduce

  1. Configure a Strands agent with Bedrock guardrails enabled (guardrail_redact_input=True, which is the default) and AgentCoreMemorySessionManager for session persistence.

  2. Start a conversation with a legitimate message (e.g., "Suggest a metadata schema for our library").

  3. Send a message that triggers the guardrail (e.g., sexually explicit content). The guardrail correctly blocks this and the agent returns the blockedInputMessaging text.

  4. Send a completely legitimate follow-up message (e.g., "I'd like to have a catalog of books").

  5. The guardrail blocks this legitimate message too, even though it contains no policy-violating content.

  6. The conversation is now permanently stuck — every subsequent message is blocked by the guardrail.

Minimal reproduction:

from strands import Agent
from strands.models.bedrock import BedrockModel
from bedrock_agentcore.memory.integrations.strands import (
    AgentCoreMemorySessionManager,
    AgentCoreMemoryConfig,
)

model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-20250514-v1:0",
    guardrail_id="YOUR_GUARDRAIL_ID",
    guardrail_version="DRAFT",
    # guardrail_redact_input=True  # this is the default
)

session_manager = AgentCoreMemorySessionManager(
    agentcore_memory_config=AgentCoreMemoryConfig(
        memory_id="YOUR_MEMORY_ID",
        actor_id="test_user",
        session_id="test_session",
    ),
)

agent = Agent(model=model, session_manager=session_manager)

# Turn 1: legitimate message — works fine
agent("Hello, help me organize my assets")

# Turn 2: offensive message — correctly blocked by guardrail
agent("inappropriate content here")

# Turn 3: legitimate message — INCORRECTLY blocked
# because the original unredacted offensive content from Turn 2
# was persisted to AgentCore Memory and is replayed into context
agent("Actually, I'd like to organize my photos by date")

Expected Behavior

When a guardrail blocks a message and guardrail_redact_input=True (the default):

  1. The offensive user message should be redacted both in-memory and in the persistent session store.
  2. On subsequent turns, the conversation history loaded from AgentCore Memory should contain only the redacted version ("[User input redacted.]"), not the original offensive text.
  3. Follow-up legitimate messages should not be blocked by the guardrail.

Actual Behavior

The redaction only happens in-memory but is never persisted to AgentCore Memory. Here's the chain of events:

  1. When a guardrail intervenes, Agent._event_stream_handler calls:

    self.messages[-1]["content"] = self._redact_user_content(...)
    if self._session_manager:
        self._session_manager.redact_latest_message(self.messages[-1], self)
  2. RepositorySessionManager.redact_latest_message (line 81-93 in repository_session_manager.py) sets latest_agent_message.redact_message = redact_message and calls self.session_repository.update_message(...).

  3. The problem: AgentCoreMemorySessionManager.update_message (line 490-511 in the bedrock-agentcore package) is effectively a no-op — it only logs at DEBUG level:

    def update_message(self, session_id, agent_id, session_message, **kwargs):
        # ...validation...
        logger.debug(
            "Message update requested for message: %s (AgentCore Memory doesn't support updates)",
            {session_message.message_id},
        )
  4. On the next turn, RepositorySessionManager.initialize loads messages from AgentCore Memory via list_messagesevents_to_messages. Since the redaction was never persisted, the original unredacted offensive content is loaded back into agent.messages.

  5. Even with guardrail_latest_message=True, the full conversation history (including the unredacted offensive text) is sent to Bedrock as context.

  6. Without guardrail_latest_message=True (the default), the guardrail evaluates ALL messages, and the unredacted offensive message causes every subsequent turn to be blocked.

Additional Context

  • The update_message no-op in AgentCoreMemorySessionManager is documented with the comment "AgentCore Memory doesn't support updates", which is technically true — AgentCore Memory is an append-only event store. However, the SDK's redaction mechanism relies on update_message actually persisting the change.
  • The SessionMessage.to_message() method correctly returns redact_message if set, but since update_message is a no-op, the redact_message field is never stored, and when messages are reloaded from AgentCore Memory they have no redact_message set.
  • Issue [BUG] guardrail_redact_input override ltm_msg instead of the last user message #1639 describes a related but different problem (wrong message being redacted when LTM context is present). This issue is about the redaction never being persisted regardless of which message is targeted.
  • We confirmed this behavior with diagnostic hooks that log the full conversation history sent to the model — the original offensive text is visible in history on subsequent turns.

Possible Solution

Since AgentCore Memory is append-only and doesn't support in-place updates, the redaction strategy needs to change. Some options:

  1. Deferred persistence: Buffer the user message in the session manager and only persist it after the model responds. If a guardrail intervenes and redacts the message, persist the redacted version instead. We implemented this as a workaround (a GuardrailSafeSessionManager wrapper) and confirmed it resolves the issue.

  2. Pre-flight guardrail check: Before appending the user message to the session, call the Bedrock ApplyGuardrail API to check the content. If it violates the policy, persist only the redacted version.

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions