-
Notifications
You must be signed in to change notification settings - Fork 55
feat: migrate cache from JSON to SQLite (#46) #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Migrates per-project JSON cache to single SQLite database at ~/.claude/cache.db. This provides a foundation for future tags/bookmarks feature. Changes: - SQLite schema with 8 tables (projects, sessions, cached_files, etc.) + indexes - Thread-safe connections using threading.local() with WAL mode - Lazy migration: auto-migrates existing JSON caches on first access - Preserves existing CacheManager public API (no breaking changes) Test updates: - Added temp_sqlite_db fixture for test isolation - Added 6 new tests: schema verification, JSON migration, thread safety - All 346 tests pass
WalkthroughReplaces JSON-based caching with a SQLite-backed system, introducing a complete database schema, migration logic, thread-safe connection management, and a new exception hierarchy. Cache operations now persist to and query from SQLite instead of filesystem JSON files. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (10)
claude_code_log/cache.py (5)
239-260: Thread-local connection lifecycle is per-thread only; document / guard for non-test usage
_connectionandclose_all_connectionscorrectly provide a thread-localsqlite3.Connectionper thread, butclose_all_connectionsonly closes the connection for the calling thread’sthreading.localstate. Connections created in worker threads are left to GC, which is fine for tests but can surprise in long-lived processes.Two follow-ups to consider:
- Clarify in
close_all_connectionsdocstring that it only affects the current thread.- For robustness, add a comment near
set_db_pathnoting it should be called before any connections are created, or be paired with a per-thread cleanup strategy if used in production code.No immediate bug given current usage (tests set DB path before first
CacheManagerinstance), but worth tightening expectations.
261-299: Database initialization guard is sound; consider handling partial schema_version states
_ensure_database’s use of_init_lockand_db_initializedlooks correct for single-process, multi-threaded use and avoids re-running schema creation.If you later introduce real migrations (CURRENT_SCHEMA_VERSION > 1), you may want to:
- Explicitly wrap any DDL/DML upgrade steps in a transaction before writing a new
schema_versionrow.- Fail fast with a clear
CacheDatabaseErrorwhen encountering a higherschema_versionthan supported by this code, instead of silently treating it as “current”.Nothing to fix now with
CURRENT_SCHEMA_VERSION = 1, just a note for future schema evolution.
568-603: Date filtering semantics correct but can be simplified and extendedThe
load_cached_entries_filteredimplementation:
- Correctly uses
dateparser.parseandparse_timestamp, and normalizes message timestamps to naive datetimes for comparison.- Treats “today”, “yesterday”, and “X days ago” as day ranges, which satisfies the natural-language requirements.
Two small improvements:
- The
to_datebranch sets end-of-day in both theifandelseclauses; you can collapse this into a singleto_dt = to_dt.replace(...)to reduce noise.- If you want more intuitive ranges for phrases like
"last week"or"last month", you might want to normalizefrom_dtto start-of-day similarly whendateparserreturns a midnight time.Not blockers, but would make the logic clearer and more consistent.
528-547: Consider deterministic ordering when loading cached entries
load_cached_entriescurrently selects fromcached_entrieswithout anORDER BY, and then flattens all rows:cursor = self._connection.execute( "SELECT entries_json FROM cached_entries WHERE cached_file_id = ?", (cached_file_id,), )SQLite does not guarantee row order without
ORDER BY, so the resulting entry sequence can differ from the original chronological ordering that was implied by timestamp keys.If downstream consumers assume chronological order (rather than re-sorting by timestamp), you may see subtle behavior changes vs the legacy JSON cache. To make ordering explicit:
- cursor = self._connection.execute( - "SELECT entries_json FROM cached_entries WHERE cached_file_id = ?", - (cached_file_id,), - ) + cursor = self._connection.execute( + "SELECT entries_json FROM cached_entries " + "WHERE cached_file_id = ? " + "ORDER BY timestamp_key", + (cached_file_id,), + )Similar ordering could be added to
load_cached_entries_filteredas well.
988-1007:clear_cachebehavior matches intent; consider reusing_remove_json_cache()
clear_cachecorrectly:
- Deletes the project row (cascading to related tables).
- Resets
_project_idso a new project record is created on next use.- Cleans up any legacy JSON cache directory.
Given you already have
_remove_json_cache()with error handling and logging, you could slightly simplify and centralize behavior by delegating the final block:- # Also clean up any legacy JSON cache if it exists - if self.cache_dir.exists(): - shutil.rmtree(self.cache_dir) + # Also clean up any legacy JSON cache if it exists + self._remove_json_cache()This keeps JSON cache deletion logic in a single place.
test/conftest.py (1)
11-31: SQLite test fixture design is solid; thread cleanup relies on GCThe
temp_sqlite_dbautouse fixture cleanly:
- Points
CacheManagerat a per-test DB file undertmp_path.- Ensures schema is (re)initialized by resetting
_db_initializedafter each test.- Calls
close_all_connections()to drop the main-thread connection between tests.Given
CacheManagerusesthreading.localfor connections, note thatclose_all_connections()only closes the current thread’s connection; any connections created inside worker threads (e.g. in thread-safety tests) are left to Python’s GC. That’s fine for this suite, but if you ever see DB files held open unexpectedly, this is where you’d need stronger cleanup.Otherwise, the fixture is well-structured.
test/test_integration_realistic.py (2)
234-253: Good addition verifying caches are actually created in SQLiteThe added block in
test_clear_cache_with_projects_dirthat walks projects and asserts at least one hascached_filesin SQLite viaCacheManager(...).get_cached_project_data()is a nice end-to-end sanity check that--all-projectsreally populates the new backend, not just HTML.If you want consistency with other tests, you could pass
get_library_version()instead of a hard-coded"1.0.0", but it’s not required here since you only assert that some cache exists.
419-433: SQLite-backed cache verification for real projects looks correctIn
TestCacheWithRealData.test_cache_creation_all_projects, the newCacheManager(project_dir, "1.0.0")+get_cached_project_data()assertions provide a solid integration check that processing real projects populates the SQLite cache (not just HTML files).Using
cached_data.version/.sessionspresence as smoke tests is appropriate; deeper structure checks are already covered elsewhere.test/test_cache.py (1)
42-47: Fixture still patchesget_library_versionthoughCacheManagerdoesn’t use itThe
cache_managerfixture wrapsCacheManager(...)in:with patch("claude_code_log.cache.get_library_version", return_value=mock_version): return CacheManager(temp_project_dir, mock_version)Since
CacheManager.__init__only uses the explicitly passedlibrary_versionargument and never callsget_library_version, this patch is effectively a no-op now.It’s harmless but could be removed (or relocated into tests that exercise code paths which actually call
get_library_version, such as CLI integration tests) to reduce confusion.test/test_cache_integration.py (1)
415-436: Version-upgrade integration scenario exercises cache reuse path
test_cache_version_upgrade_scenarionow:
- Seeds a project with aggregates under an “old” version via
CacheManager(project_dir, "1.0.0").- Runs
convert_jsonl_to_htmlunder a patchedget_library_versionreturning"2.0.0".Even though
_is_cache_version_compatiblecurrently treats all versions as compatible, this acts as a regression test that a version bump doesn’t break conversion with existing SQLite cache present.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
claude_code_log/cache.py(10 hunks)test/conftest.py(1 hunks)test/test_cache.py(8 hunks)test/test_cache_integration.py(9 hunks)test/test_integration_realistic.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
test/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Organize tests into categories with pytest markers to avoid async event loop conflicts: unit tests (no mark), TUI tests (@pytest.mark.tui), browser tests (@pytest.mark.browser), and snapshot tests
Files:
test/conftest.pytest/test_integration_realistic.pytest/test_cache_integration.pytest/test_cache.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use ruff for code formatting and linting, with ruff check --fix for automatic fixes
Use pyright and mypy for type checking in Python code
Target Python 3.10+ with support for modern Python features and type hints
Files:
test/conftest.pytest/test_integration_realistic.pytest/test_cache_integration.pyclaude_code_log/cache.pytest/test_cache.py
claude_code_log/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use dateparser for natural language date parsing to support date range filtering with expressions like 'today', 'yesterday', 'last week', and relative dates
Files:
claude_code_log/cache.py
🧬 Code graph analysis (2)
test/conftest.py (1)
claude_code_log/cache.py (2)
set_db_path(234-237)close_all_connections(255-259)
test/test_cache_integration.py (3)
test/conftest.py (1)
temp_sqlite_db(12-30)test/test_cache.py (1)
cache_manager(43-46)claude_code_log/cache.py (3)
CacheManager(200-1093)get_cached_project_data(899-986)update_project_aggregates(811-854)
🔇 Additional comments (13)
claude_code_log/cache.py (1)
1055-1093:get_cache_statsis robust; consider exposing more metrics if needed
get_cache_statssafely handles missing project rows and returns a simple, dictionary-based view of stats. This is a good fit for reporting/CLI.If you later need richer introspection (e.g. per-project DB size, average messages per session), consider:
- Either extending this method with optional fields, or
- Adding a separate, more detailed stats method to avoid overloading the basic API.
No changes required now; current implementation is clean and defensive.
test/test_integration_realistic.py (1)
465-478: Version stored in SQLite is verified appropriatelyThe
test_cache_version_storedchanges correctly assert that:
- A cache exists after processing.
- The
versionfield inProjectCacheis populated (and, implicitly, sourced from the DB).This aligns with the new schema’s
projects.library_versioncolumn and provides a simple regression guard for version persistence.test/test_cache.py (6)
96-104: Initialization expectations align with SQLite-backed implementation
test_initialization’s new assertion that_project_idis non-None ensures a project row is always present after constructingCacheManager, which matches the SQLite design. This is a useful guard against regressions where projects might not get created correctly.
128-151: Direct schema inspection test is useful and well-structured
test_timestamp_based_cache_structurequeryingcached_entries.timestamp_keydirectly is a good low-level check that:
- Entries are bucketed by exact timestamps.
- Summary entries are stored under
_no_timestamp.This complements higher-level load tests and gives early signal if the storage format changes.
246-265:clear_cachebehavior is thoroughly validatedThe updated
test_clear_cachecorrectly checks that:
- After saving entries,
is_file_cachedisTrue.- Calling
clear_cache()resets_project_idtoNone.- A new
CacheManagerinstance for the same project sees no cached files.This closely matches the new SQLite semantics where deleting the project row cascades and a subsequent initialization creates a fresh project record.
568-603: Corrupted entry handling test matchesload_cached_entriesbehavior
test_corrupted_cache_datamanually inserts a row intocached_filesand a bogusentries_jsonvalue intocached_entries, then asserts thatload_cached_entriesreturnsNone.This matches the production behavior where JSON decoding errors are caught, a warning is printed, and
Noneis returned. The test provides good regression coverage for that error path.
639-671: SQLite schema creation tests cover core tables and indexes well
TestSQLiteSchema’s two tests verify:
- All expected tables (
schema_version,projects,working_directories,cached_files,file_sessions,sessions,cached_entries,tags) are present.- All key indexes (esp. on
cached_entries,sessions,cached_files,working_directories) exist.This is excellent defensive coverage for future migrations or refactors that might accidentally drop or rename objects.
827-881: Thread-safety tests provide strong coverage of concurrent writes
TestThreadSafety.test_concurrent_cache_writesandtest_thread_local_connection_isolationdo a good job of:
- Stressing concurrent
update_session_cachecalls across multiple threads, asserting no exceptions and that all sessions are persisted.- Verifying that each thread receives a distinct SQLite connection object (via different
id(manager._connection)values).These tests nicely validate the design choice around
threading.localand WAL mode.test/test_cache_integration.py (5)
89-113: CLI--no-cachebehavior test correctly updated for SQLite backendThe updated
test_cli_no_cache_flagnow:
- Verifies that a normal CLI run creates SQLite-backed cache via
CacheManager.get_cached_project_data().- Acknowledges that
--no-cacheskips using the cache during that run, but does not prevent the underlying database file from existing.This matches the new design and avoids brittle assertions about file presence.
114-140:--clear-cacheintegration test now validates SQLite state, not filesystem
test_cli_clear_cache_flag’s new logic is good:
- After an initial run, it checks that
get_cached_project_data()is non-None.- After invoking
--clear-cache, a newCacheManagerinstance sees a project with zerocached_files.That’s a clear, end-to-end verification that the CLI is correctly delegating to
CacheManager.clear_cache()and that the SQLite-backed cache is reset.
141-173: All-projects caching assertion correctly targets SQLite-backed cacheThe updated
test_cli_all_projects_cachingnow asserts, for each synthetic project, that:
- A
CacheManagerinstance can reconstructProjectCache.- Each project has at least one cached file in
cached_files.This is a good regression test that
--all-projectsexercises the new backend across multiple projects.
195-208: Converter integration test verifies cache population without over-specifying internals
test_convert_jsonl_to_html_with_cachenow checks:
- First run with
use_cache=Truecreates HTML.- SQLite cache exists and contains at least one
cached_filesentry for the project.Second run simply asserts successful conversion, which is sufficient; detailed “cache hit” behavior is better left to unit tests. This strikes a good balance for integration scope.
225-254: Project hierarchy processing test correctly checks per-project caches
test_process_projects_hierarchy_with_cachewas updated to:
- Run
process_projects_hierarchy(..., use_cache=True).- Verify for each project that a
CacheManagercan reconstruct non-NoneProjectCache.This ensures the multi-project processing path wires up the new cache backend for every project, not just the HTML outputs.
| # Migrate cached entries from separate JSON file | ||
| entry_file = self.cache_dir / f"{Path(file_name).stem}.json" | ||
| if entry_file.exists(): | ||
| try: | ||
| with open(entry_file, "r", encoding="utf-8") as f: | ||
| entries_by_timestamp = json.load(f) | ||
| for timestamp_key, entries in entries_by_timestamp.items(): | ||
| self._connection.execute( | ||
| """ | ||
| INSERT INTO cached_entries (cached_file_id, timestamp_key, entries_json) | ||
| VALUES (?, ?, ?) | ||
| """, | ||
| (cached_file_id, timestamp_key, json.dumps(entries)), | ||
| ) | ||
| except Exception as e: | ||
| print( | ||
| f"Warning: Failed to migrate entries from {entry_file}: {e}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-file JSON cache migration likely misses legacy *.jsonl.json files
In _migrate_json_cache_if_needed, per-file cache is looked up as:
entry_file = self.cache_dir / f"{Path(file_name).stem}.json"For a JSON index entry like "test.jsonl", this resolves to cache/test.json, but your new test data (test_json_to_sqlite_migration) writes the legacy per-file cache as cache/test.jsonl.json. As a result, existing *.jsonl.json files will be silently skipped during migration and their timestamp-keyed entries never populate cached_entries, even though cached_files metadata is migrated.
That can leave SQLite with file metadata but no actual entries for previously cached projects, and load_cached_entries will return an empty list rather than falling back to regenerating from the JSONL file.
Consider supporting both legacy naming patterns so old caches are fully migrated:
- # Migrate cached entries from separate JSON file
- entry_file = self.cache_dir / f"{Path(file_name).stem}.json"
- if entry_file.exists():
+ # Migrate cached entries from separate JSON file
+ legacy_entry_files = [
+ # Legacy pattern: "<file_name>.json" -> e.g. "test.jsonl.json"
+ self.cache_dir / f"{file_name}.json",
+ # Alternative pattern: "<stem>.json" -> e.g. "test.json"
+ self.cache_dir / f"{Path(file_name).stem}.json",
+ ]
+ entry_file = next(
+ (p for p in legacy_entry_files if p.exists()),
+ None,
+ )
+
+ if entry_file is not None:
try:
with open(entry_file, "r", encoding="utf-8") as f:
entries_by_timestamp = json.load(f)This keeps migration robust across both historical and any future per-file cache naming schemes.
| def test_json_to_sqlite_migration(self, temp_project_dir, temp_sqlite_db): | ||
| """Test migration from legacy JSON cache to SQLite.""" | ||
| # 1. Create legacy JSON cache structure | ||
| cache_dir = temp_project_dir / "cache" | ||
| cache_dir.mkdir() | ||
|
|
||
| # Create index.json with project data | ||
| index_data = { | ||
| "version": "1.0.0", | ||
| "cache_created": "2023-01-01T10:00:00Z", | ||
| "last_updated": "2023-01-01T11:00:00Z", | ||
| "total_message_count": 50, | ||
| "total_input_tokens": 500, | ||
| "total_output_tokens": 1000, | ||
| "total_cache_creation_tokens": 25, | ||
| "total_cache_read_tokens": 10, | ||
| "earliest_timestamp": "2023-01-01T10:00:00Z", | ||
| "latest_timestamp": "2023-01-01T11:00:00Z", | ||
| "sessions": { | ||
| "session1": { | ||
| "session_id": "session1", | ||
| "summary": "Test session", | ||
| "first_timestamp": "2023-01-01T10:00:00Z", | ||
| "last_timestamp": "2023-01-01T11:00:00Z", | ||
| "message_count": 5, | ||
| "first_user_message": "Hello", | ||
| "total_input_tokens": 100, | ||
| "total_output_tokens": 200, | ||
| } | ||
| }, | ||
| "cached_files": { | ||
| "test.jsonl": { | ||
| "file_path": str(temp_project_dir / "test.jsonl"), | ||
| "source_mtime": 1672574400.0, | ||
| "cached_mtime": 1672574500.0, | ||
| "message_count": 5, | ||
| "session_ids": ["session1"], | ||
| } | ||
| }, | ||
| "working_directories": ["/test/dir"], | ||
| } | ||
| (cache_dir / "index.json").write_text(json.dumps(index_data), encoding="utf-8") | ||
|
|
||
| # Create per-file cache | ||
| file_cache_data = { | ||
| "2023-01-01T10:00:00Z": [ | ||
| { | ||
| "type": "user", | ||
| "uuid": "user1", | ||
| "timestamp": "2023-01-01T10:00:00Z", | ||
| "sessionId": "session1", | ||
| "version": "1.0.0", | ||
| "parentUuid": None, | ||
| "isSidechain": False, | ||
| "userType": "user", | ||
| "cwd": "/test", | ||
| "message": {"role": "user", "content": "Hello"}, | ||
| } | ||
| ] | ||
| } | ||
| (cache_dir / "test.jsonl.json").write_text( | ||
| json.dumps(file_cache_data), encoding="utf-8" | ||
| ) | ||
|
|
||
| # 2. Initialize CacheManager (triggers migration) | ||
| cache_manager = CacheManager(temp_project_dir, "1.0.0") | ||
|
|
||
| # 3. Verify data in SQLite matches original JSON | ||
| cached_data = cache_manager.get_cached_project_data() | ||
| assert cached_data is not None | ||
| assert cached_data.total_message_count == 50 | ||
| assert cached_data.total_input_tokens == 500 | ||
| assert "session1" in cached_data.sessions | ||
| assert cached_data.sessions["session1"].summary == "Test session" | ||
|
|
||
| # 4. Verify JSON directory deleted | ||
| assert not cache_dir.exists(), ( | ||
| "JSON cache directory should be deleted after migration" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON→SQLite migration tests are good but assume specific per-file naming
The JSON migration tests are valuable:
test_json_to_sqlite_migrationvalidates that project aggregates, sessions, cached_files, and working_directories are migrated and that the legacy JSON cache directory is removed.test_migration_skips_if_already_in_sqliteensures existing SQLite data is not overwritten by stale JSON, while still deleting the JSON cache.
One subtle but important point: test_json_to_sqlite_migration writes per-file cache as:
(cache_dir / "test.jsonl.json").write_text(...)This assumes legacy per-file caches were named "<jsonl_filename>.json", e.g. test.jsonl.json. As noted in the cache module review, _migrate_json_cache_if_needed currently only looks for Path(file_name).stem + ".json" (cache/test.json), so these test fixtures won’t actually be picked up by migration, and cached_entries will remain empty.
Once you adjust the migration code to consider both f"{file_name}.json" and f"{Path(file_name).stem}.json", these tests will more accurately reflect real-world legacy naming and ensure entries are migrated, not just metadata.
🤖 Prompt for AI Agents
In test/test_cache.py around lines 704-783, the test writes a per-file legacy
cache as "test.jsonl.json" but the migration implementation only looks for cache
files named "{file_name}.json" (e.g. "test.jsonl.json" vs "test.json"), so
entries are not discovered; update the migration logic in
_migrate_json_cache_if_needed to check both f"{file_name}.json" and
f"{Path(file_name).stem}.json" (i.e., consider both the full filename + ".json"
and the stem + ".json") when locating per-file JSON caches, load whichever
exists, and proceed with migration; ensure both patterns are searched in the
same order and that deletion of the JSON cache directory still occurs after
successful migration.
|
Oh wow, incredible timing, I implemented the same thing yesterday night and just pushed to: #59 They also seem to be pretty similar from a high level, but I've added a bit more tests and replaced a few more bits that depend on cache. Please have a look, let's figure out together what would be the best approach! |
Yours looks better imo. For example, I didn't include a seperate migration (just added into the cache file). Closing this one. Looking forward to yours merging, then would be happy to work on tags / toc / etc., from #46 etc., Think a web server / app would be great too. |
A follow up to the conversation in #46 - as a first step, migrating JSON cache to Sqlite.
What this PR does
~/.claude/cache.dbCacheManagerpublic API (no breaking changes)Changes
threading.local()with WAL modeMigration Strategy
When
CacheManageris initialized for a project:Users don't need to do anything—migration is automatic and transparent.
Test plan
Open Questions
~/.claude/cache.db. Should it be~/.claude/projects/cache.dbinstead?CLAUDE_CODE_LOG_SQLITE_CACHE=0env var for rollback?Summary by CodeRabbit
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.