Skip to content

[PERFORMANCE]: Optimize llm-guard plugin #1958

@araujof

Description

@araujof

Description

The LLM Guard plugin (plugins/external/llmguard/llmguardplugin) applies ML-based guardrails using the LLMGuard library to scan prompts for security threats, injections, and other risks. This analysis identifies blocking I/O operations, CPU-intensive computations in the async path, and inefficient algorithms that significantly impact request latency.

Critical Performance Issues

The LLM Guard plugin has severe performance bottlenecks primarily caused by:

  1. Synchronous Redis operations blocking on every cache access
  2. Blocking ML inference in LLMGuard scanner library
  3. Sequential scanner execution instead of parallel
  4. CPU-intensive operations (pickle, Levenshtein) in async path

The plugin executes ML inference and I/O synchronously in the request path, fundamentally incompatible with async/await architecture.

Without these fixes, the plugin will:

  • Block the event loop on every request
  • Prevent concurrent request processing
  • Create severe latency spikes (500ms+)
  • Limit throughput to sequential processing

1. Blocking Redis Operations in Hot Path

Location: cache.py:62, 67, 83, 98

Severity: CRITICAL

Issue: All Redis operations use the synchronous redis.Redis client, blocking the async event loop on every cache operation:

self.cache = redis.Redis(host=redis_host, port=redis_port)  # Sync client

def update_cache(self, key: int = None, value: tuple = None) -> tuple[bool]:
    serialized_obj = pickle.dumps(value)  # Blocking serialization
    success_set = self.cache.set(key, serialized_obj)  # Blocking network I/O
    success_expiry = self.cache.expire(key, self.cache_ttl)  # Blocking network I/O
    return success_set, success_expiry

def retrieve_cache(self, key: int = None) -> tuple:
    value = self.cache.get(key)  # Blocking network I/O
    if value:
        retrieved_obj = pickle.loads(value)  # Blocking deserialization
        return retrieved_obj

Impact:

  • Every cache operation blocks the event loop (2 operations per update: set + expire)
  • Network latency to Redis directly adds to request latency
  • Under load, creates severe bottleneck as all requests serialize on Redis operations
  • Used in both pre-hook (lines 177, 225) and post-hook paths

Recommendation:

  • Use redis.asyncio.Redis for async Redis operations
  • Implement connection pooling with redis.asyncio.ConnectionPool
  • Use pipelining to batch set + expire operations into single round-trip
  • Consider using aioredis or redis-py[hiredis] for better performance
  • Make cache operations optional/configurable for low-latency scenarios

Example Fix:

import redis.asyncio as aioredis

class CacheTTLDict:
    def __init__(self, ttl: int = 0):
        self.cache_ttl = ttl
        self.cache = aioredis.from_url(f"redis://{redis_host}:{redis_port}")

    async def update_cache(self, key: int, value: tuple) -> tuple[bool, bool]:
        serialized_obj = pickle.dumps(value)  # Still sync, but fast
        async with self.cache.pipeline() as pipe:
            await pipe.set(key, serialized_obj)
            await pipe.expire(key, self.cache_ttl)
            results = await pipe.execute()
        return results[0], results[1]

2. Blocking Pickle Serialization/Deserialization

Location: cache.py:60, 85

Severity: HIGH

Issue: pickle.dumps() and pickle.loads() are synchronous CPU-intensive operations in the async path:

serialized_obj = pickle.dumps(value)  # Blocking CPU work
retrieved_obj = pickle.loads(value)   # Blocking CPU work

Impact:

  • Serialization blocks event loop proportional to vault tuple size
  • Large vaults (many anonymized entities) cause significant blocking
  • No alternative fast path for small objects

Recommendation:

  • Use asyncio.to_thread() for CPU-intensive pickle operations on large objects
  • Consider faster serialization formats (msgpack, orjson) for structured data
  • Implement size threshold: small objects serialize inline, large objects offload to thread
  • Cache serialized representations if vault doesn't change

3. LLMGuard Scanner Calls Block Event Loop

Location: llmguard.py:234, 259, 264, 285, 305

Severity: CRITICAL

Issue: All LLMGuard library scanner calls are synchronous and potentially CPU-intensive (ML model inference):

# Input filters - synchronous ML inference
for scanner in self.scanners["input"]["filters"]:
    sanitized_prompt, is_valid, risk_score = scanner.scan(input_prompt)  # BLOCKING

# Input sanitizers - synchronous transformation
result = scan_prompt(self.scanners["input"]["sanitizers"], input_prompt)  # BLOCKING

# Vault leak detection - multiple synchronous operations
sanitized_output_de, _, _ = scanner.scan(result[0], input_prompt)  # BLOCKING
input_anonymize_score = word_wise_levenshtein_distance(input_prompt, result[0])  # BLOCKING
input_deanonymize_score = word_wise_levenshtein_distance(result[0], sanitized_output_de)  # BLOCKING

# Output operations
for scanner in self.scanners["output"]["filters"]:
    sanitized_prompt, is_valid, risk_score = scanner.scan(original_input, model_response)  # BLOCKING

Impact:

  • ML model inference can take 10-500ms per scanner
  • Multiple scanners compound the latency (N scanners = N × inference time)
  • Blocks entire event loop preventing other requests from processing
  • CPU utilization spikes block other async tasks
  • No parallelization of independent scanners

Recommendation:

  • Run scanner operations in thread pool using asyncio.to_thread(scanner.scan, ...)
  • Execute independent scanners in parallel using asyncio.gather()
  • Consider batching multiple inputs to scanners for better GPU utilization
  • Add timeout protection to prevent runaway scanner operations
  • Implement result caching for identical inputs (hash-based lookup)
  • Profile which scanners are most expensive and prioritize optimization

4. Expensive Levenshtein Distance Calculation

Location: policy.py:67-95, used in llmguard.py:265-266

Severity: MEDIUM-HIGH

Issue: Word-wise Levenshtein distance has O(n×m) complexity and runs synchronously in hot path:

def word_wise_levenshtein_distance(sentence1, sentence2):
    words1 = sentence1.split()
    words2 = sentence2.split()
    n, m = len(words1), len(words2)
    dp = [[0] * (m + 1) for _ in range(n + 1)]  # O(n×m) memory

    for i in range(1, n + 1):
        for j in range(1, m + 1):  # O(n×m) computation
            if words1[i - 1] == words2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1
    return dp[n][m]

Called twice per input when vault leak detection is enabled (lines 265-266).

Impact:

  • For 100-word prompts: 100×100 = 10,000 operations × 2 calls = 20,000 operations
  • Blocks event loop during computation
  • Creates large temporary 2D arrays
  • Called on every anonymized input with vault leak detection enabled

Recommendation:

  • Run in thread pool using asyncio.to_thread() for long prompts
  • Use optimized C-extension libraries like python-Levenshtein or rapidfuzz
  • Implement quick-reject heuristics (length difference threshold)
  • Cache results for identical prompt pairs
  • Consider approximate string matching for acceptable accuracy
  • Only enable for critical security scenarios

Performance Comparison:

# Current: Pure Python O(n×m)
distance = word_wise_levenshtein_distance(s1, s2)  # ~10ms for 100 words

# Optimized: rapidfuzz C extension
from rapidfuzz.distance import Levenshtein
distance = Levenshtein.distance(s1.split(), s2.split())  # ~0.1ms for 100 words

5. Policy Evaluation Using eval()

Location: policy.py:62

Severity: MEDIUM (Security: HIGH)

Issue: Uses eval() to evaluate policy expressions on every scan:

return eval(compile(tree, "<string>", "eval"), {}, policy_variables)

Impact:

  • AST parsing and compilation on every evaluation
  • eval() overhead even with sanitized expressions
  • Executed on every filter result (input and output)

Security Concern: While AST validation helps, eval() is inherently risky

Recommendation:

  • Pre-compile policy expressions during initialization and cache compiled code
  • Use a dedicated expression evaluator (e.g., simpleeval library)
  • For common policies (AND/OR of all filters), use fast path without eval
  • Consider declarative policy format that compiles to Python functions

Example Optimization:

class GuardrailPolicy:
    def __init__(self):
        self._compiled_cache = {}

    def evaluate(self, policy: str, scan_result: dict):
        if policy not in self._compiled_cache:
            tree = ast.parse(policy, mode="eval")
            # ... validation ...
            self._compiled_cache[policy] = compile(tree, "<string>", "eval")

        policy_variables = {key: value["is_valid"] for key, value in scan_result.items()}
        return eval(self._compiled_cache[policy], {}, policy_variables)

6. Sequential Scanner Execution (No Parallelization)

Location: llmguard.py:233-241, 283-291

Severity: HIGH

Issue: Scanners execute sequentially even though they're independent:

result = {}
for scanner in self.scanners["input"]["filters"]:  # Sequential execution
    sanitized_prompt, is_valid, risk_score = scanner.scan(input_prompt)
    scanner_name = type(scanner).__name__
    result[scanner_name] = {...}

Impact:

  • Total latency = sum of all scanner latencies
  • 5 scanners × 50ms each = 250ms total (could be 50ms if parallel)
  • Wastes CPU cores and GPU resources
  • No benefit from async execution

Recommendation:

  • Execute independent scanners in parallel using asyncio.gather()
  • Run each scanner in thread pool concurrently
  • Aggregate results after all complete
  • Handle failures gracefully (some scanners may fail)

Example Fix:

async def _apply_input_filters(self, input_prompt):
    async def scan_async(scanner):
        return await asyncio.to_thread(scanner.scan, input_prompt)

    tasks = [scan_async(scanner) for scanner in self.scanners["input"]["filters"]]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    result = {}
    for scanner, scan_result in zip(self.scanners["input"]["filters"], results):
        if isinstance(scan_result, Exception):
            logger.error(f"Scanner {scanner} failed: {scan_result}")
            continue
        sanitized_prompt, is_valid, risk_score = scan_result
        result[type(scanner).__name__] = {...}
    return result

7. Inefficient Context Updates with Nested Loops

Location: plugin.py:75-113

Severity: MEDIUM

Issue: Complex nested conditional logic and loops for context updates:

def update_context(context):
    plugin_name = self.__class__.__name__
    if plugin_name not in context.state[self.guardrails_context_key]:
        context.state[self.guardrails_context_key][plugin_name] = {}
    if key not in context.state[self.guardrails_context_key][plugin_name]:
        context.state[self.guardrails_context_key][plugin_name][key] = value
    else:
        if isinstance(value, dict):
            for k, v in value.items():  # Nested loop 1
                if k not in context.state[self.guardrails_context_key][plugin_name][key]:
                    context.state[self.guardrails_context_key][plugin_name][key][k] = v
                else:
                    if isinstance(v, dict):
                        for k_sub, v_sub in v.items():  # Nested loop 2
                            context.state[self.guardrails_context_key][plugin_name][key][k][k_sub] = v_sub

Impact:

  • Multiple dictionary lookups on hot path
  • Nested loops for complex values
  • Called multiple times per request (lines 133, 145, 164, 181, 231, 244)
  • Creates/updates context even when set_guardrails_context=False (logic checked after update)

Recommendation:

  • Check self.lgconfig.set_guardrails_context early and return
  • Use setdefault() to reduce lookups
  • Consider flattened key structure instead of nested dicts
  • Cache plugin_name and guardrails_context_key access
  • Only update context when actually needed

8. Vault Recreation Overhead

Location: llmguard.py:49-69, 254-258

Severity: MEDIUM

Issue: Checks vault expiry on every sanitizer call and recreates vault:

def _create_new_vault_on_expiry(self, vault) -> bool:
    logger.info(f"Vault creation time {vault.creation_time}")  # Unnecessary logging
    logger.info(f"Vault ttl {self.vault_ttl}")
    if datetime.now() - vault.creation_time > timedelta(seconds=self.vault_ttl):  # On every call
        del vault  # Manual deletion
        logger.info("Vault successfully deleted after expiry")
        self._update_input_sanitizers()  # Reinitializes scanners
        return True
    return False

# Called on every sanitizer application:
vault, _, _ = self._retreive_vault()
vault_update_status = self._create_new_vault_on_expiry(vault)  # Every time!

Impact:

  • datetime.now() and timedelta operations on every call
  • Unnecessary vault retrieval when TTL=0 (disabled)
  • Scanner reinitialization overhead when vault expires
  • Excessive logging in hot path

Recommendation:

  • Check TTL once during init; skip expiry check if TTL=0
  • Cache expiry timestamp instead of recalculating each time
  • Use Redis TTL for vault expiry instead of application logic
  • Reduce logging verbosity (debug level, not info)
  • Don't reinitialize scanners, just update vault reference

9. Repeated Scanner Initialization

Location: llmguard.py:162-220

Severity: MEDIUM

Issue: Scanner initialization in __init_scanners() uses get_scanner_by_name() which may be expensive:

for filter_name in policy_filter_names:
    self.scanners["input"]["filters"].append(
        input_scanners.get_scanner_by_name(filter_name, self.lgconfig.input.filters[filter_name])
    )  # May load ML models, download resources

Impact:

  • Happens during plugin initialization (not hot path)
  • ML model loading can take seconds
  • Blocks plugin startup
  • No lazy loading or async initialization

Recommendation:

  • Make initialize() async and await scanner initialization
  • Load scanners lazily on first use
  • Warm up scanners concurrently during startup
  • Cache loaded models across plugin instances if possible

Moderate Performance Issues

10. Excessive Logging in Hot Path

Location: Throughout plugin.py and llmguard.py

Severity: LOW-MEDIUM

Issue: Info-level logging on every operation:

logger.info(f"Processing payload {payload}")  # Line 125
logger.info(f"Applying input guardrail filters on {payload.args[key]}")  # Line 138
logger.info(f"Result of input guardrail filters: {result}")  # Line 141
logger.info(f"Result of policy decision: {decision}")  # Line 143
# ... many more

Impact:

  • String formatting overhead even if logging is disabled
  • I/O operations if logging to file
  • Potential sensitive data leakage in logs

Recommendation:

  • Use debug level for detailed operational logs
  • Use lazy evaluation: logger.debug("Result: %s", result) instead of f-strings
  • Guard expensive logging with if logger.isEnabledFor(logging.DEBUG):
  • Avoid logging payload/result contents in production

11. No Result Caching for Repeated Inputs

Location: All scanner methods in llmguard.py

Severity: MEDIUM

Issue: Same prompts scanned repeatedly without caching:

result = self.llmguard_instance._apply_input_filters(payload.args[key])
# No cache lookup before expensive scanning

Impact:

  • Identical prompts scanned multiple times
  • Wastes CPU/GPU resources
  • Increases latency unnecessarily
  • Common in batch processing or similar queries

Recommendation:

  • Implement LRU cache for scan results keyed by (prompt_hash, scanner_config)
  • Use TTL-based invalidation (e.g., 5 minutes)
  • Consider cache size limits (e.g., 10,000 entries)
  • Make caching configurable

Example:

from functools import lru_cache
import hashlib

def _cache_key(self, prompt: str, scanner_type: str) -> str:
    return f"{scanner_type}:{hashlib.sha256(prompt.encode()).hexdigest()}"

@lru_cache(maxsize=10000)
def _apply_input_filters_cached(self, prompt: str) -> dict:
    return self._apply_input_filters(prompt)

12. Repeated Context State Checks

Location: plugin.py:206-219

Severity: LOW-MEDIUM

Issue: Multiple redundant checks for context keys:

if self.guardrails_context_key in context.state:
    original_prompt = context.state[self.guardrails_context_key]["original_prompt"] if "original_prompt" in context.state[self.guardrails_context_key] else ""
    vault_id = context.state[self.guardrails_context_key]["vault_cache_id"] if "vault_cache_id" in context.state[self.guardrails_context_key] else None
else:
    context.state[self.guardrails_context_key] = {}
if self.guardrails_context_key in context.global_context.state:  # Duplicate check
    # Same logic repeated for global context

Impact:

  • Multiple dictionary lookups for same keys
  • Repeated checks in both local and global context
  • Code duplication

Recommendation:

  • Use .get() with defaults: context.state.get(key, {}).get("original_prompt", "")
  • Cache context dictionary reference
  • Extract common logic to helper method

13. Inefficient Vault Retrieval

Location: llmguard.py:83-104

Severity: LOW

Issue: Loops through all sanitizers to find vault:

length = len(self.scanners["input"]["sanitizers"])
for i in range(length):  # Linear search
    scanner_name = type(self.scanners["input"]["sanitizers"][i]).__name__
    if scanner_name in sanitizer_names:
        # ... access vault

Impact:

  • Linear search through scanners on every call
  • Repeated type introspection: type(scanner).__name__
  • Variable i used outside loop (line 104) - potential bug

Recommendation:

  • Build scanner name → scanner index mapping during initialization
  • Cache vault reference instead of retrieving each time
  • Fix potential bug where i may be undefined if loop doesn't execute

14. Exception Handling Without Specificity

Location: llmguard.py:102, 120, 138, 168, 187, 197, 209

Severity: LOW

Issue: Broad exception catching that swallows errors:

except Exception as e:
    logger.error(f"Error retrieving scanner {scanner_name}: {e}")

Impact:

  • Hides bugs and makes debugging difficult
  • Continues execution with partial failures
  • No error propagation to caller

Recommendation:

  • Catch specific exceptions (ValueError, KeyError, etc.)
  • Propagate critical errors to caller
  • Use try-except only for expected error conditions

Minor Performance Considerations

15. String Operations in Hot Path

Location: plugin.py:149, 168, 248

Issue: String formatting for error messages created unconditionally:

description="{threat} detected in the prompt".format(threat=list(decision[2].keys())[0])

Impact: Minimal, but creates temporary strings and lists

Recommendation: Pre-format common error messages

16. Type Introspection for Scanner Names

Location: llmguard.py:235, 286

Issue: type(scanner).__name__ called for every scanner

Impact: Small overhead, but could be cached

Recommendation: Store scanner name as attribute during initialization

17. Manual Deletion with del

Location: llmguard.py:64, 116

Issue: Explicit del doesn't guarantee immediate memory release

Recommendation: Rely on Python's garbage collection; remove del statements

Architectural Recommendations

1. Async-First Design

Convert entire plugin to async/await pattern:

  • Redis operations → async
  • Scanner calls → run in thread pool
  • Pickle operations → offload for large objects
  • Context updates → async if needed

2. Scanner Execution Pipeline

Implement efficient scanner pipeline:

  1. Cache lookup (fast path)
  2. Parallel independent scanner execution
  3. Result aggregation
  4. Policy evaluation (pre-compiled)
  5. Cache storage

3. Lazy Initialization

  • Load scanners on first use, not during __init__
  • Defer vault creation until needed
  • Initialize Redis connection pool asynchronously

4. Observability & Monitoring

  • Add metrics for scanner execution time
  • Track cache hit rates
  • Monitor vault expiry and recreation
  • Profile ML inference latency
  • Add timeout alerts

Performance Testing Recommendations

Load Testing Scenarios

  1. Baseline: Single scanner, simple prompts
  2. Multiple Scanners: 5+ scanners, typical prompts
  3. Large Prompts: 1000+ word prompts with vault leak detection
  4. Cache Behavior: Repeated vs unique prompts
  5. Vault Expiry: Performance during vault recreation

Metrics to Track

  • End-to-end latency: Pre-hook and post-hook separately
  • Scanner latency: Per-scanner breakdown
  • Redis latency: Get/set/pipeline operations
  • Cache hit rate: For result caching
  • CPU utilization: During ML inference
  • Memory usage: Vault size, scanner model memory

Profiling Tools

  • cProfile: Identify expensive functions
  • py-spy: Low-overhead async profiling
  • memory_profiler: Track memory allocations
  • Redis SLOWLOG: Identify slow cache operations
  • asyncio debug mode: Detect blocking operations

Implementation Priority

Phase 1 - Critical

  1. Async Redis operations - Single biggest bottleneck
  2. Parallelize scanner execution - 3-5x speedup potential
  3. Offload ML inference to threads - Prevent event loop blocking
  4. Fix pickle blocking - Use async for large objects

Phase 2 - High Impact

  1. Optimize Levenshtein calculation (use C extension)
  2. Implement scan result caching
  3. Pre-compile policy expressions
  4. Reduce context update overhead
  5. Optimize vault expiry checking

Phase 3 - Incremental (Future)

  1. Lazy scanner initialization
  2. Reduce logging overhead
  3. Optimize vault retrieval
  4. Improve error handling specificity
  5. Add comprehensive metrics

Related Files

  • plugins/external/llmguard/llmguardplugin/plugin.py - Main plugin implementation
  • plugins/external/llmguard/llmguardplugin/llmguard.py - LLMGuard wrapper and scanner logic
  • plugins/external/llmguard/llmguardplugin/cache.py - Redis caching layer
  • plugins/external/llmguard/llmguardplugin/policy.py - Policy evaluation and utilities
  • plugins/external/llmguard/llmguardplugin/schema.py - Configuration schema

Sub-issues

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingperformancePerformance related itemsplugins

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions