[PERFORMANCE]: Optimize llm-guard plugin

## Description

The LLM Guard plugin (`plugins/external/llmguard/llmguardplugin`) applies ML-based guardrails using the LLMGuard library to scan prompts for security threats, injections, and other risks. This analysis identifies blocking I/O operations, CPU-intensive computations in the async path, and inefficient algorithms that significantly impact request latency.

## Critical Performance Issues

The LLM Guard plugin has severe performance bottlenecks primarily caused by:
1. **Synchronous Redis operations** blocking on every cache access
2. **Blocking ML inference** in LLMGuard scanner library
3. **Sequential scanner execution** instead of parallel
4. **CPU-intensive operations** (pickle, Levenshtein) in async path

The plugin executes ML inference and I/O synchronously in the request path, fundamentally incompatible with async/await architecture. 

Without these fixes, the plugin will:
- Block the event loop on every request
- Prevent concurrent request processing
- Create severe latency spikes (500ms+)
- Limit throughput to sequential processing


### 1. Blocking Redis Operations in Hot Path

**Location**: `cache.py:62, 67, 83, 98`

**Severity**: CRITICAL

**Issue**: All Redis operations use the synchronous `redis.Redis` client, blocking the async event loop on every cache operation:

```python
self.cache = redis.Redis(host=redis_host, port=redis_port)  # Sync client

def update_cache(self, key: int = None, value: tuple = None) -> tuple[bool]:
    serialized_obj = pickle.dumps(value)  # Blocking serialization
    success_set = self.cache.set(key, serialized_obj)  # Blocking network I/O
    success_expiry = self.cache.expire(key, self.cache_ttl)  # Blocking network I/O
    return success_set, success_expiry

def retrieve_cache(self, key: int = None) -> tuple:
    value = self.cache.get(key)  # Blocking network I/O
    if value:
        retrieved_obj = pickle.loads(value)  # Blocking deserialization
        return retrieved_obj
```

**Impact**:
- Every cache operation blocks the event loop (2 operations per update: set + expire)
- Network latency to Redis directly adds to request latency
- Under load, creates severe bottleneck as all requests serialize on Redis operations
- Used in both pre-hook (lines 177, 225) and post-hook paths

**Recommendation**:
- Use `redis.asyncio.Redis` for async Redis operations
- Implement connection pooling with `redis.asyncio.ConnectionPool`
- Use pipelining to batch `set` + `expire` operations into single round-trip
- Consider using `aioredis` or `redis-py[hiredis]` for better performance
- Make cache operations optional/configurable for low-latency scenarios

**Example Fix**:
```python
import redis.asyncio as aioredis

class CacheTTLDict:
    def __init__(self, ttl: int = 0):
        self.cache_ttl = ttl
        self.cache = aioredis.from_url(f"redis://{redis_host}:{redis_port}")

    async def update_cache(self, key: int, value: tuple) -> tuple[bool, bool]:
        serialized_obj = pickle.dumps(value)  # Still sync, but fast
        async with self.cache.pipeline() as pipe:
            await pipe.set(key, serialized_obj)
            await pipe.expire(key, self.cache_ttl)
            results = await pipe.execute()
        return results[0], results[1]
```

### 2. Blocking Pickle Serialization/Deserialization

**Location**: `cache.py:60, 85`

**Severity**: HIGH

**Issue**: `pickle.dumps()` and `pickle.loads()` are synchronous CPU-intensive operations in the async path:

```python
serialized_obj = pickle.dumps(value)  # Blocking CPU work
retrieved_obj = pickle.loads(value)   # Blocking CPU work
```

**Impact**:
- Serialization blocks event loop proportional to vault tuple size
- Large vaults (many anonymized entities) cause significant blocking
- No alternative fast path for small objects

**Recommendation**:
- Use `asyncio.to_thread()` for CPU-intensive pickle operations on large objects
- Consider faster serialization formats (msgpack, orjson) for structured data
- Implement size threshold: small objects serialize inline, large objects offload to thread
- Cache serialized representations if vault doesn't change

### 3. LLMGuard Scanner Calls Block Event Loop

**Location**: `llmguard.py:234, 259, 264, 285, 305`

**Severity**: CRITICAL

**Issue**: All LLMGuard library scanner calls are synchronous and potentially CPU-intensive (ML model inference):

```python
# Input filters - synchronous ML inference
for scanner in self.scanners["input"]["filters"]:
    sanitized_prompt, is_valid, risk_score = scanner.scan(input_prompt)  # BLOCKING

# Input sanitizers - synchronous transformation
result = scan_prompt(self.scanners["input"]["sanitizers"], input_prompt)  # BLOCKING

# Vault leak detection - multiple synchronous operations
sanitized_output_de, _, _ = scanner.scan(result[0], input_prompt)  # BLOCKING
input_anonymize_score = word_wise_levenshtein_distance(input_prompt, result[0])  # BLOCKING
input_deanonymize_score = word_wise_levenshtein_distance(result[0], sanitized_output_de)  # BLOCKING

# Output operations
for scanner in self.scanners["output"]["filters"]:
    sanitized_prompt, is_valid, risk_score = scanner.scan(original_input, model_response)  # BLOCKING
```

**Impact**:
- ML model inference can take 10-500ms per scanner
- Multiple scanners compound the latency (N scanners = N × inference time)
- Blocks entire event loop preventing other requests from processing
- CPU utilization spikes block other async tasks
- No parallelization of independent scanners

**Recommendation**:
- Run scanner operations in thread pool using `asyncio.to_thread(scanner.scan, ...)`
- Execute independent scanners in parallel using `asyncio.gather()`
- Consider batching multiple inputs to scanners for better GPU utilization
- Add timeout protection to prevent runaway scanner operations
- Implement result caching for identical inputs (hash-based lookup)
- Profile which scanners are most expensive and prioritize optimization

### 4. Expensive Levenshtein Distance Calculation

**Location**: `policy.py:67-95`, used in `llmguard.py:265-266`

**Severity**: MEDIUM-HIGH

**Issue**: Word-wise Levenshtein distance has O(n×m) complexity and runs synchronously in hot path:

```python
def word_wise_levenshtein_distance(sentence1, sentence2):
    words1 = sentence1.split()
    words2 = sentence2.split()
    n, m = len(words1), len(words2)
    dp = [[0] * (m + 1) for _ in range(n + 1)]  # O(n×m) memory

    for i in range(1, n + 1):
        for j in range(1, m + 1):  # O(n×m) computation
            if words1[i - 1] == words2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1
    return dp[n][m]
```

Called twice per input when vault leak detection is enabled (lines 265-266).

**Impact**:
- For 100-word prompts: 100×100 = 10,000 operations × 2 calls = 20,000 operations
- Blocks event loop during computation
- Creates large temporary 2D arrays
- Called on every anonymized input with vault leak detection enabled

**Recommendation**:
- Run in thread pool using `asyncio.to_thread()` for long prompts
- Use optimized C-extension libraries like `python-Levenshtein` or `rapidfuzz`
- Implement quick-reject heuristics (length difference threshold)
- Cache results for identical prompt pairs
- Consider approximate string matching for acceptable accuracy
- Only enable for critical security scenarios

**Performance Comparison**:
```python
# Current: Pure Python O(n×m)
distance = word_wise_levenshtein_distance(s1, s2)  # ~10ms for 100 words

# Optimized: rapidfuzz C extension
from rapidfuzz.distance import Levenshtein
distance = Levenshtein.distance(s1.split(), s2.split())  # ~0.1ms for 100 words
```

### 5. Policy Evaluation Using eval()

**Location**: `policy.py:62`

**Severity**: MEDIUM (Security: HIGH)

**Issue**: Uses `eval()` to evaluate policy expressions on every scan:

```python
return eval(compile(tree, "<string>", "eval"), {}, policy_variables)
```

**Impact**:
- AST parsing and compilation on every evaluation
- `eval()` overhead even with sanitized expressions
- Executed on every filter result (input and output)

**Security Concern**: While AST validation helps, `eval()` is inherently risky

**Recommendation**:
- Pre-compile policy expressions during initialization and cache compiled code
- Use a dedicated expression evaluator (e.g., `simpleeval` library)
- For common policies (AND/OR of all filters), use fast path without eval
- Consider declarative policy format that compiles to Python functions

**Example Optimization**:
```python
class GuardrailPolicy:
    def __init__(self):
        self._compiled_cache = {}

    def evaluate(self, policy: str, scan_result: dict):
        if policy not in self._compiled_cache:
            tree = ast.parse(policy, mode="eval")
            # ... validation ...
            self._compiled_cache[policy] = compile(tree, "<string>", "eval")

        policy_variables = {key: value["is_valid"] for key, value in scan_result.items()}
        return eval(self._compiled_cache[policy], {}, policy_variables)
```

### 6. Sequential Scanner Execution (No Parallelization)

**Location**: `llmguard.py:233-241, 283-291`

**Severity**: HIGH

**Issue**: Scanners execute sequentially even though they're independent:

```python
result = {}
for scanner in self.scanners["input"]["filters"]:  # Sequential execution
    sanitized_prompt, is_valid, risk_score = scanner.scan(input_prompt)
    scanner_name = type(scanner).__name__
    result[scanner_name] = {...}
```

**Impact**:
- Total latency = sum of all scanner latencies
- 5 scanners × 50ms each = 250ms total (could be 50ms if parallel)
- Wastes CPU cores and GPU resources
- No benefit from async execution

**Recommendation**:
- Execute independent scanners in parallel using `asyncio.gather()`
- Run each scanner in thread pool concurrently
- Aggregate results after all complete
- Handle failures gracefully (some scanners may fail)

**Example Fix**:
```python
async def _apply_input_filters(self, input_prompt):
    async def scan_async(scanner):
        return await asyncio.to_thread(scanner.scan, input_prompt)

    tasks = [scan_async(scanner) for scanner in self.scanners["input"]["filters"]]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    result = {}
    for scanner, scan_result in zip(self.scanners["input"]["filters"], results):
        if isinstance(scan_result, Exception):
            logger.error(f"Scanner {scanner} failed: {scan_result}")
            continue
        sanitized_prompt, is_valid, risk_score = scan_result
        result[type(scanner).__name__] = {...}
    return result
```

### 7. Inefficient Context Updates with Nested Loops

**Location**: `plugin.py:75-113`

**Severity**: MEDIUM

**Issue**: Complex nested conditional logic and loops for context updates:

```python
def update_context(context):
    plugin_name = self.__class__.__name__
    if plugin_name not in context.state[self.guardrails_context_key]:
        context.state[self.guardrails_context_key][plugin_name] = {}
    if key not in context.state[self.guardrails_context_key][plugin_name]:
        context.state[self.guardrails_context_key][plugin_name][key] = value
    else:
        if isinstance(value, dict):
            for k, v in value.items():  # Nested loop 1
                if k not in context.state[self.guardrails_context_key][plugin_name][key]:
                    context.state[self.guardrails_context_key][plugin_name][key][k] = v
                else:
                    if isinstance(v, dict):
                        for k_sub, v_sub in v.items():  # Nested loop 2
                            context.state[self.guardrails_context_key][plugin_name][key][k][k_sub] = v_sub
```

**Impact**:
- Multiple dictionary lookups on hot path
- Nested loops for complex values
- Called multiple times per request (lines 133, 145, 164, 181, 231, 244)
- Creates/updates context even when `set_guardrails_context=False` (logic checked after update)

**Recommendation**:
- Check `self.lgconfig.set_guardrails_context` early and return
- Use `setdefault()` to reduce lookups
- Consider flattened key structure instead of nested dicts
- Cache plugin_name and guardrails_context_key access
- Only update context when actually needed

### 8. Vault Recreation Overhead

**Location**: `llmguard.py:49-69, 254-258`

**Severity**: MEDIUM

**Issue**: Checks vault expiry on every sanitizer call and recreates vault:

```python
def _create_new_vault_on_expiry(self, vault) -> bool:
    logger.info(f"Vault creation time {vault.creation_time}")  # Unnecessary logging
    logger.info(f"Vault ttl {self.vault_ttl}")
    if datetime.now() - vault.creation_time > timedelta(seconds=self.vault_ttl):  # On every call
        del vault  # Manual deletion
        logger.info("Vault successfully deleted after expiry")
        self._update_input_sanitizers()  # Reinitializes scanners
        return True
    return False

# Called on every sanitizer application:
vault, _, _ = self._retreive_vault()
vault_update_status = self._create_new_vault_on_expiry(vault)  # Every time!
```

**Impact**:
- `datetime.now()` and `timedelta` operations on every call
- Unnecessary vault retrieval when TTL=0 (disabled)
- Scanner reinitialization overhead when vault expires
- Excessive logging in hot path

**Recommendation**:
- Check TTL once during init; skip expiry check if TTL=0
- Cache expiry timestamp instead of recalculating each time
- Use Redis TTL for vault expiry instead of application logic
- Reduce logging verbosity (debug level, not info)
- Don't reinitialize scanners, just update vault reference

### 9. Repeated Scanner Initialization

**Location**: `llmguard.py:162-220`

**Severity**: MEDIUM

**Issue**: Scanner initialization in `__init_scanners()` uses `get_scanner_by_name()` which may be expensive:

```python
for filter_name in policy_filter_names:
    self.scanners["input"]["filters"].append(
        input_scanners.get_scanner_by_name(filter_name, self.lgconfig.input.filters[filter_name])
    )  # May load ML models, download resources
```

**Impact**:
- Happens during plugin initialization (not hot path)
- ML model loading can take seconds
- Blocks plugin startup
- No lazy loading or async initialization

**Recommendation**:
- Make `initialize()` async and await scanner initialization
- Load scanners lazily on first use
- Warm up scanners concurrently during startup
- Cache loaded models across plugin instances if possible

## Moderate Performance Issues

### 10. Excessive Logging in Hot Path

**Location**: Throughout `plugin.py` and `llmguard.py`

**Severity**: LOW-MEDIUM

**Issue**: Info-level logging on every operation:

```python
logger.info(f"Processing payload {payload}")  # Line 125
logger.info(f"Applying input guardrail filters on {payload.args[key]}")  # Line 138
logger.info(f"Result of input guardrail filters: {result}")  # Line 141
logger.info(f"Result of policy decision: {decision}")  # Line 143
# ... many more
```

**Impact**:
- String formatting overhead even if logging is disabled
- I/O operations if logging to file
- Potential sensitive data leakage in logs

**Recommendation**:
- Use debug level for detailed operational logs
- Use lazy evaluation: `logger.debug("Result: %s", result)` instead of f-strings
- Guard expensive logging with `if logger.isEnabledFor(logging.DEBUG):`
- Avoid logging payload/result contents in production

### 11. No Result Caching for Repeated Inputs

**Location**: All scanner methods in `llmguard.py`

**Severity**: MEDIUM

**Issue**: Same prompts scanned repeatedly without caching:

```python
result = self.llmguard_instance._apply_input_filters(payload.args[key])
# No cache lookup before expensive scanning
```

**Impact**:
- Identical prompts scanned multiple times
- Wastes CPU/GPU resources
- Increases latency unnecessarily
- Common in batch processing or similar queries

**Recommendation**:
- Implement LRU cache for scan results keyed by (prompt_hash, scanner_config)
- Use TTL-based invalidation (e.g., 5 minutes)
- Consider cache size limits (e.g., 10,000 entries)
- Make caching configurable

**Example**:
```python
from functools import lru_cache
import hashlib

def _cache_key(self, prompt: str, scanner_type: str) -> str:
    return f"{scanner_type}:{hashlib.sha256(prompt.encode()).hexdigest()}"

@lru_cache(maxsize=10000)
def _apply_input_filters_cached(self, prompt: str) -> dict:
    return self._apply_input_filters(prompt)
```

### 12. Repeated Context State Checks

**Location**: `plugin.py:206-219`

**Severity**: LOW-MEDIUM

**Issue**: Multiple redundant checks for context keys:

```python
if self.guardrails_context_key in context.state:
    original_prompt = context.state[self.guardrails_context_key]["original_prompt"] if "original_prompt" in context.state[self.guardrails_context_key] else ""
    vault_id = context.state[self.guardrails_context_key]["vault_cache_id"] if "vault_cache_id" in context.state[self.guardrails_context_key] else None
else:
    context.state[self.guardrails_context_key] = {}
if self.guardrails_context_key in context.global_context.state:  # Duplicate check
    # Same logic repeated for global context
```

**Impact**:
- Multiple dictionary lookups for same keys
- Repeated checks in both local and global context
- Code duplication

**Recommendation**:
- Use `.get()` with defaults: `context.state.get(key, {}).get("original_prompt", "")`
- Cache context dictionary reference
- Extract common logic to helper method

### 13. Inefficient Vault Retrieval

**Location**: `llmguard.py:83-104`

**Severity**: LOW

**Issue**: Loops through all sanitizers to find vault:

```python
length = len(self.scanners["input"]["sanitizers"])
for i in range(length):  # Linear search
    scanner_name = type(self.scanners["input"]["sanitizers"][i]).__name__
    if scanner_name in sanitizer_names:
        # ... access vault
```

**Impact**:
- Linear search through scanners on every call
- Repeated type introspection: `type(scanner).__name__`
- Variable `i` used outside loop (line 104) - potential bug

**Recommendation**:
- Build scanner name → scanner index mapping during initialization
- Cache vault reference instead of retrieving each time
- Fix potential bug where `i` may be undefined if loop doesn't execute

### 14. Exception Handling Without Specificity

**Location**: `llmguard.py:102, 120, 138, 168, 187, 197, 209`

**Severity**: LOW

**Issue**: Broad exception catching that swallows errors:

```python
except Exception as e:
    logger.error(f"Error retrieving scanner {scanner_name}: {e}")
```

**Impact**:
- Hides bugs and makes debugging difficult
- Continues execution with partial failures
- No error propagation to caller

**Recommendation**:
- Catch specific exceptions (ValueError, KeyError, etc.)
- Propagate critical errors to caller
- Use try-except only for expected error conditions

## Minor Performance Considerations

### 15. String Operations in Hot Path

**Location**: `plugin.py:149, 168, 248`

**Issue**: String formatting for error messages created unconditionally:

```python
description="{threat} detected in the prompt".format(threat=list(decision[2].keys())[0])
```

**Impact**: Minimal, but creates temporary strings and lists

**Recommendation**: Pre-format common error messages

### 16. Type Introspection for Scanner Names

**Location**: `llmguard.py:235, 286`

**Issue**: `type(scanner).__name__` called for every scanner

**Impact**: Small overhead, but could be cached

**Recommendation**: Store scanner name as attribute during initialization

### 17. Manual Deletion with del

**Location**: `llmguard.py:64, 116`

**Issue**: Explicit `del` doesn't guarantee immediate memory release

**Recommendation**: Rely on Python's garbage collection; remove `del` statements

## Architectural Recommendations

### 1. Async-First Design

Convert entire plugin to async/await pattern:
- Redis operations → async
- Scanner calls → run in thread pool
- Pickle operations → offload for large objects
- Context updates → async if needed

### 2. Scanner Execution Pipeline

Implement efficient scanner pipeline:
1. Cache lookup (fast path)
2. Parallel independent scanner execution
3. Result aggregation
4. Policy evaluation (pre-compiled)
5. Cache storage

### 3. Lazy Initialization

- Load scanners on first use, not during `__init__`
- Defer vault creation until needed
- Initialize Redis connection pool asynchronously

### 4. Observability & Monitoring

- Add metrics for scanner execution time
- Track cache hit rates
- Monitor vault expiry and recreation
- Profile ML inference latency
- Add timeout alerts

## Performance Testing Recommendations

### Load Testing Scenarios

1. **Baseline**: Single scanner, simple prompts
2. **Multiple Scanners**: 5+ scanners, typical prompts
3. **Large Prompts**: 1000+ word prompts with vault leak detection
4. **Cache Behavior**: Repeated vs unique prompts
5. **Vault Expiry**: Performance during vault recreation

### Metrics to Track

- **End-to-end latency**: Pre-hook and post-hook separately
- **Scanner latency**: Per-scanner breakdown
- **Redis latency**: Get/set/pipeline operations
- **Cache hit rate**: For result caching
- **CPU utilization**: During ML inference
- **Memory usage**: Vault size, scanner model memory

### Profiling Tools

- **cProfile**: Identify expensive functions
- **py-spy**: Low-overhead async profiling
- **memory_profiler**: Track memory allocations
- **Redis SLOWLOG**: Identify slow cache operations
- **asyncio debug mode**: Detect blocking operations

## Implementation Priority

### Phase 1 - Critical 

1. **Async Redis operations** - Single biggest bottleneck
2. **Parallelize scanner execution** - 3-5x speedup potential
3. **Offload ML inference to threads** - Prevent event loop blocking
4. **Fix pickle blocking** - Use async for large objects

### Phase 2 - High Impact 

5. Optimize Levenshtein calculation (use C extension)
6. Implement scan result caching
7. Pre-compile policy expressions
8. Reduce context update overhead
9. Optimize vault expiry checking

### Phase 3 - Incremental (Future)

10. Lazy scanner initialization
11. Reduce logging overhead
12. Optimize vault retrieval
13. Improve error handling specificity
14. Add comprehensive metrics

## Related Files

- `plugins/external/llmguard/llmguardplugin/plugin.py` - Main plugin implementation
- `plugins/external/llmguard/llmguardplugin/llmguard.py` - LLMGuard wrapper and scanner logic
- `plugins/external/llmguard/llmguardplugin/cache.py` - Redis caching layer
- `plugins/external/llmguard/llmguardplugin/policy.py` - Policy evaluation and utilities
- `plugins/external/llmguard/llmguardplugin/schema.py` - Configuration schema


[PERFORMANCE]: Optimize llm-guard plugin #1958

Description

Description

Critical Performance Issues

1. Blocking Redis Operations in Hot Path

2. Blocking Pickle Serialization/Deserialization

3. LLMGuard Scanner Calls Block Event Loop

4. Expensive Levenshtein Distance Calculation

5. Policy Evaluation Using eval()

6. Sequential Scanner Execution (No Parallelization)

7. Inefficient Context Updates with Nested Loops

8. Vault Recreation Overhead

9. Repeated Scanner Initialization

Moderate Performance Issues

10. Excessive Logging in Hot Path

11. No Result Caching for Repeated Inputs

12. Repeated Context State Checks

13. Inefficient Vault Retrieval

14. Exception Handling Without Specificity

Minor Performance Considerations

15. String Operations in Hot Path

16. Type Introspection for Scanner Names

17. Manual Deletion with del

Architectural Recommendations

1. Async-First Design

2. Scanner Execution Pipeline

3. Lazy Initialization

4. Observability & Monitoring

Performance Testing Recommendations

Load Testing Scenarios

Metrics to Track

Profiling Tools

Implementation Priority

Phase 1 - Critical

Phase 2 - High Impact

Phase 3 - Incremental (Future)

Related Files

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions