-
Notifications
You must be signed in to change notification settings - Fork 3
feat(indexing): enhance search, chunking and file watching #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Major improvements to indexing and search functionality: - Add scoring explanations and custom weights - Improve document chunking with better overlap handling - Enhance file watching reliability - Add debug features and logging - Improve test coverage and error handling Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to a1843c3 in 2 minutes and 18 seconds
More details
- Looked at
1501lines of code in13files - Skipped
1files when reviewing. - Skipped posting
0drafted comments based on config settings.
Workflow ID: wflow_N9D9GylvlYY6OX61
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
Major improvements to indexing and file watching systems: - Refactor search functionality in indexer - Add dedicated methods for result grouping and scoring - Improve code organization and readability - Better handling of chunk grouping - Enhance file watcher reliability - Add robust error handling and retries - Improve file processing logic - Better logging for debugging - Improve test stability - Use unique collection names per test - Add better assertions and timeouts - More reliable test cleanup Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 95b6585 in 48 seconds
More details
- Looked at
808lines of code in5files - Skipped
0files when reviewing. - Skipped posting
3drafted comments based on config settings.
1. gptme_rag/indexing/watcher.py:68
- Draft comment:
Usingtime.sleepto wait for file writes can be unreliable. Consider implementing a more robust method to ensure the file is fully written, such as checking file size stability over a short period. - Reason this comment was not posted:
Confidence changes required:50%
The_queue_updatemethod inwatcher.pydirectly callstime.sleepto wait for the file to be fully written. This could be improved by using a more robust method to ensure the file is ready, such as checking file size stability over a short period.
2. gptme_rag/indexing/watcher.py:218
- Draft comment:
Consider making the initial wait time for exponential backoff configurable to allow flexibility in different environments. - Reason this comment was not posted:
Confidence changes required:30%
Inwatcher.py, the_update_index_with_retriesmethod uses exponential backoff for retries, but the initial wait time is hardcoded. It might be beneficial to make this configurable for flexibility in different environments.
3. tests/test_indexing.py:78
- Draft comment:
Ensure that the collection is reset before each test that modifies the index to maintain a clean state. This is done here but should be consistent across all relevant tests. - Reason this comment was not posted:
Confidence changes required:20%
Intest_indexing.py, thetest_indexer_add_documentsfunction includes a reset of the collection before adding documents. This is a good practice to ensure a clean state, but it should be consistently applied across all tests that modify the index.
Workflow ID: wflow_rKmO5ZLJTHWJmxRk
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
- Add shared indexer fixture with automatic cleanup - Add cleanup_chroma fixture to reset ChromaDB between tests - Refactor test files to use shared fixtures - Remove duplicated setup code - Use tmp_path fixture instead of custom temp_dir Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on d79047d in 29 seconds
More details
- Looked at
573lines of code in4files - Skipped
0files when reviewing. - Skipped posting
8drafted comments based on config settings.
1. tests/test_chunking.py:54
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
2. tests/test_chunking.py:56
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
3. tests/test_chunking.py:58
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
4. tests/test_chunking.py:61
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
5. tests/test_chunking.py:63
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
6. tests/test_chunking.py:66
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
7. tests/test_chunking.py:70
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
8. tests/test_chunking.py:72
- Draft comment:
Usingprintstatements for debugging in tests is not recommended. Consider using logging or assertions to verify test conditions. - Reason this comment was not posted:
Confidence changes required:50%
The use ofprintstatements for debugging in tests is not a best practice. It is better to use logging or assertions to verify test conditions.
Workflow ID: wflow_Aj1IjtJ2rW0dg7N6
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
Replace debug print statements with proper logging in indexer. Clean up test files by removing debug prints and improving assertions. Add more descriptive error messages in tests. Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 77fbd13 in 25 seconds
More details
- Looked at
198lines of code in4files - Skipped
0files when reviewing. - Skipped posting
2drafted comments based on config settings.
1. tests/test_chunking.py:108
- Draft comment:
Consider replacing print statements with logging for consistency and better control over output. This applies to other test files as well. - Reason this comment was not posted:
Confidence changes required:50%
The PR description mentions removing print statements and adding logging. The test files still contain print statements, which should be replaced with logging for consistency and better control over output.
2. tests/test_document_processor.py:95
- Draft comment:
Consider replacing print statements with logging for consistency and better control over output. This applies to other test files as well. - Reason this comment was not posted:
Confidence changes required:50%
The PR description mentions removing print statements and adding logging. The test files still contain print statements, which should be replaced with logging for consistency and better control over output.
Workflow ID: wflow_OW32mc6LKAHWMenl
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on 7f59ae2 in 20 seconds
More details
- Looked at
65lines of code in1files - Skipped
0files when reviewing. - Skipped posting
0drafted comments based on config settings.
Workflow ID: wflow_Zn6mz4NUZ3Qm2xMM
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
- Add progress bar using tqdm for document indexing - Refactor document collection and processing for better efficiency - Improve error handling and logging - Add JSON support for scoring weights - Split indexing into collection and processing phases Co-authored-by: Bob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Incremental review on a4a63d0 in 1 minute and 1 seconds
More details
- Looked at
430lines of code in3files - Skipped
1files when reviewing. - Skipped posting
3drafted comments based on config settings.
1. gptme_rag/cli.py:138
- Draft comment:
Redundant import ofjson. It's already imported at the top of the file. Remove this import to clean up the code. - Reason this comment was not posted:
Confidence changes required:50%
The import ofjsonincli.pyis redundant as it is already imported at the top of the file. This should be removed to clean up the code.
2. gptme_rag/indexing/indexer.py:178
- Draft comment:
Converting the generatoradd_documents_progressto a list is unnecessary and can be removed to improve performance. - Reason this comment was not posted:
Confidence changes required:50%
Theadd_documentsmethod inindexer.pycallsadd_documents_progressand converts its result to a list, which is unnecessary sinceadd_documents_progressis a generator. This conversion should be removed to improve performance.
3. gptme_rag/indexing/indexer.py:859
- Draft comment:
The condition should be>instead of>=when checking if the file limit is reached. This ensures the warning is only logged when the limit is actually exceeded. - Reason this comment was not posted:
Confidence changes required:50%
Inindexer.py, the_get_valid_filesmethod checks if the number of valid files is greater than or equal to the file limit, but the warning message suggests the limit is already reached. The condition should be>instead of>=. This applies to the_get_valid_filesmethod.
Workflow ID: wflow_Ai5mijsS4NPLuRsw
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
Major improvements to indexing and search functionality:
Important
Enhances indexing and search with scoring explanations, improved chunking, reliable file watching, and better test coverage.
Indexer.search().document_processor.py.watcher.py.indexer.pyandwatcher.py.cli.pyto support new search options like scoring explanations and custom weights.test_chunking.py,test_indexing.py, andtest_watcher.py.typecheckcommand toMakefile.This description was created by
for a4a63d0. It will automatically update as commits are pushed.