Skip to content

Conversation

@CLFutureX
Copy link
Contributor

Background:
In the current implementation of LocalFileStore, the read method fetches data from local files in real time every time it is called. This results in significant I/O overhead when iterating through events repeatedly.
Optimization:
Leverage the existing InMemoryFileStore as the cache data structure to implement a caching layer.

@CLFutureX
Copy link
Contributor Author

@xingyaoww hey,PTAL, thanks

@CLFutureX CLFutureX requested review from enyst and xingyaoww December 1, 2025 01:12
@CLFutureX
Copy link
Contributor Author

@xingyaoww hey,Could you please look at it again when you have a chance?

full_path = self.get_full_path(path)

if full_path in self.cache.files:
return self.cache.read(full_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might overwrite files like base_state.json - do you think this will handle it?

@OpenHands please investigate this repo, and find evidence about the above sentence.

Also, another reason we try to do this file-backed event storage is that on rare occasions, there could be too many events in a conversation that could OOM the machine / slow down the entire application.

So if we want to add cache, it would be a TTL cache that can be automatically maintained and evicted when appropriate -- OpenHands, can you look into and tell me if this is a good idea? Do not edit any code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigation Complete: Cache Implementation Analysis for PR #1274

I've completed a thorough investigation of the codebase to address your concerns about file overwrites and the proposed cache implementation. Here are my findings:


Answers to Your Questions

Q1: "Will the cache handle base_state.json overwrites correctly?"

✅ YES - The current PR implementation handles overwrites correctly because:

  1. base_state.json IS overwritten frequently - I found 14 assignments to execution_status alone, which trigger auto-save via __setattr__ in state.py:236-257
  2. All writes go through LocalFileStore - The cache is updated on line 52 of the PR: self.cache.write(full_path, contents)
  3. No bypass paths exist - I verified that no code writes directly to the filesystem for persistence (only one open() call exists in remote_conversation.py for LLM logs in a separate directory)

However, there's a theoretical risk if future code bypasses LocalFileStore.


Q2: "Is a TTL cache a good idea?"

❌ NO - TTL cache is not appropriate for this use case because:

  1. Access patterns are sequential, not time-based - Events are accessed via iteration through the event log, not at random times
  2. No staleness issue - Files are immutable once written; they don't become "stale"
  3. Unpredictable eviction - TTL could evict files just before they're needed again during iteration
  4. Complexity without benefit - Requires tuning TTL values for different workloads

Q3: "Will the current PR cause OOM?"

⚠️ YES - The current PR implementation WILL cause OOM for long conversations because:

  1. Unbounded cache growth - InMemoryFileStore uses a simple dict[str, str] with no eviction policy
  2. Memory estimate for long conversations:
    • 1,000 events = 1-5MB
    • 10,000 events = 10-50MB
    • Grows indefinitely with no upper bound
  3. Defeats the original design - File-backed storage was intentionally designed to prevent OOM in long-running conversations

Recommendations

🎯 Recommended: Selective Caching (Simplest & Safest)

Implementation:

  • Only cache base_state.json (the frequently rewritten file)
  • Do NOT cache event files
  • Max memory footprint: ~5KB (one small JSON file)

Benefits:

  • ✅ Prevents OOM
  • ✅ Optimizes the main pain point (40+ rewrites per conversation)
  • ✅ Preserves file-backed event storage design
  • ✅ No external dependencies
  • ✅ Minimal code change (~10 lines)

Alternative: LRU Cache with Size Limit

If event caching is needed:

  • Use cachetools.LRUCache(maxsize=100)
  • Keeps most recently accessed N files
  • Bounded memory usage (100 files = ~500KB-5MB)
  • Better matches access patterns than TTL

Evidence Documentation

I've created two detailed documents for your review:

  1. /workspace/project/cache_analysis.md - Comprehensive analysis with code examples and recommendations
  2. /workspace/project/evidence_summary.md - Visual evidence chain showing file overwrite behavior and memory growth scenarios

Key findings include:

  • Traced the exact execution path: Field change → __setattr___save_base_state()fs.write() → cache update
  • Documented 14 execution_status state transitions in the codebase
  • Verified no external file writes bypass the FileStore interface
  • Analyzed EventLog access patterns and memory growth scenarios

Summary Table

Approach Memory Performance OOM Risk Recommendation
Current PR Unbounded Best HIGH ❌ Not safe
No cache (main) Minimal Poor (I/O) None ✅ Safe but slow
Selective cache ~5KB Good None ✅✅ BEST
LRU cache Bounded Good None ✅ Good alternative
TTL cache Unbounded* Good Medium ❌ Not appropriate

Conclusion

The current PR correctly handles base_state.json overwrites, but introduces an OOM risk. I recommend implementing selective caching (only base_state.json) as the safest and simplest solution that addresses the performance concern without introducing memory risks.

No code changes were made per your request - this was an investigation and analysis only.

View full conversation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ oh yeah, i meant LRU cache, not TTL 🤕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ oh yeah, i meant LRU cache, not TTL 🤕

It is indeed necessary to balance efficiency and memory usage here, so I've already made the adjustments based on cachetools.

@CLFutureX CLFutureX requested a review from xingyaoww December 6, 2025 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants