-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add CSV splitter with side-by-side BFS detection #8795
base: main
Are you sure you want to change the base?
Conversation
hey @alex-stoica thanks for the quick work on this! A few questions for you:
|
|
||
|
||
@component | ||
class CSVSplitter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd advocate for changing the name of the component to CSVDocumentSplitter
to be line with our naming of our other splitters to emphasize it works on Haystack Documents.
Pull Request Test Coverage Report for Build 13091333319Details
💛 - Coveralls |
@sjrl the BFS approach in the For example, in my tests
BFS is more accurate than a simple split threshold because it detects connected non-empty cells in both vertical and horizontal directions. However, in the case where you know you don't have side tables (no "CSV3" in from the first photo) you can leave |
hey @alex-stoica I've given this a bit more thought and I wanted to provide a slightly alternative approach to accomplishing this task which is to use recursive splitting by threshold:
I'm still working on sample code to share to try out this approach, but while I do that I did want to get your opinion on this. |
Hey @sjrl first of all, thanks for your valuable feedback! You pointed out some subtle issues that I missed. Regarding your proposal, the connected-components approach via BFS doesn’t really account for the fact that tables are rectangular. Hence, my guess is that an alternative leveraging this information would be more performant. I’m not sure whether your idea would be optimal, but my bet is that it might be more readable. Strictly from a performance perspective, my intuition suggests that a diagonal-based traversal might be faster in some cases: | | A | B | C | D | E | F |
|-----|-----|-----|-----|-----|-----|-----|
| 1 | x1 | x2 | - | - | - | - |
| 2 | - | x4 | - | - | - | - |
| 3 | - | - | - | - | - | - |
| 4 | - | - | - | y1 | y2 | y3 |
| 5 | - | - | - | - | y5 | y6 |
| 6 | - | - | - | y7 | y8 | - | It might be faster to jump directly from x1 to x4 rather than iterating row by row. In scenarios where x4 is empty, the algorithm would need a fallback to check x2 or x3. However, such an approach would be quite a pain to implement. TL;DR:
|
I’ve moved the files into |
hey @alex-stoica after some internal discussion we think going with the recursive row/column approach makes the most sense for now and see if that covers most use cases. If not we can in the future look to bring your BFS approach in to see if that covers other scenarios. I've opened a Draft PR with the new approach here: #8815 It would be amazing if you could test it to see if it works as expected on your tables. I'll work on bringing over your tests to that PR as well. |
Related Issues
Proposed Changes
CSVSplitter
that can separate a single CSV into multiple smaller “table” blocks (each block will be returned as aDocument
)split_threshold
: if you have purely vertical tables separated by empty rows, the splitter works in a straightforward way, just select how many empty spaces are required to consider for separating the blocksdetect_side_tables=True
. This uses BFS to find horizontally adjacent blocks of non-empty cells (i.e., multiple tables in the same row).test_csv_splitter.py
for both vertical-only and side-by-side scenariosFor reference, in CSVs with side-by-side tables (
CSV3
),detect_side_tables
should beTrue
, and BFS is used to detect the connected components. For a CSV with tables separated only vertically,detect_side_tables
should beFalse
and the component would run much faster.How did you test it?
test_csv_splitter.py
with different CSV layoutsNotes for the reviewer
converters
category, but it might be more suitable inpreprocessors
split_index_meta_key
removal (?) = "csv_split_index" is a little bit useless, just a value inmeta
to tag each newly generatedDocument
with its positional index after splitting.CSVSplitter
already has many params, so we can get rid of it if you decide soChecklist
I have read the contributors guidelines and the code of conduct.
I have updated the related issue with new insights and changes.
I added unit tests and updated docstrings.
I've used one of the conventional commit types for my PR title, for example feat: CSV splitter with side-by-side.
I documented my code (docstrings & comments).
I ran pre-commit hooks and fixed any issues.