feat: Add CSV splitter with side-by-side BFS detection #8795

alex-stoica · 2025-02-01T18:32:48Z

Related Issues

Fixes Create a CSV Document splitter #8784 (Add CSV document splitter with side-by-side table detection)

Proposed Changes

Introduce a CSVSplitter that can separate a single CSV into multiple smaller “table” blocks (each block will be returned as a Document)
Row-based splitting is controlled by split_threshold: if you have purely vertical tables separated by empty rows, the splitter works in a straightforward way, just select how many empty spaces are required to consider for separating the blocks
Optional side-by-side detection can be enabled via detect_side_tables=True. This uses BFS to find horizontally adjacent blocks of non-empty cells (i.e., multiple tables in the same row).
By default, BFS is disabled, preserving a classic “vertical only” splitting approach based on empty rows.
Added tests in test_csv_splitter.py for both vertical-only and side-by-side scenarios

For reference, in CSVs with side-by-side tables (CSV3), detect_side_tables should be True, and BFS is used to detect the connected components. For a CSV with tables separated only vertically, detect_side_tables should be False and the component would run much faster.

How did you test it?

Created multiple tests in test_csv_splitter.py with different CSV layouts

Verified locally by running:

hatch run test:unit
pytest test_csv_splitter.py

Notes for the reviewer

Performance: BFS over large CSVs can be costly, but it can be disabled (if no side tables are present) or optimized.
Unintended merging: A single bridging cell merges two blocks, i.e. the side tables need to be completely separated with blank cells in order to not merge them with other cells
Component category: I've decided to place the component in the converters category, but it might be more suitable in preprocessors
split_index_meta_key removal (?) = "csv_split_index" is a little bit useless, just a value in meta to tag each newly generated Document with its positional index after splitting. CSVSplitter already has many params, so we can get rid of it if you decide so

Checklist
I have read the contributors guidelines and the code of conduct.
I have updated the related issue with new insights and changes.
I added unit tests and updated docstrings.
I've used one of the conventional commit types for my PR title, for example feat: CSV splitter with side-by-side.
I documented my code (docstrings & comments).
I ran pre-commit hooks and fixed any issues.

CLAassistant · 2025-02-01T18:32:54Z

All committers have signed the CLA.

sjrl · 2025-02-03T07:46:26Z

hey @alex-stoica thanks for the quick work on this! A few questions for you:

Performance: BFS over large CSVs can be costly, but it can be disabled (if no side tables are present) or optimized.

How expensive is BFS exactly? And for reasonably large CSV tables how long would it take to run your code on it? E.g. are we talking minutes, seconds, microseconds, etc.
Also would you say BFS is more accurate than the simple split threshold? Since technically side-by-side tables could also naively be split by number of empty columns. So I'm wondering if BFS is toggled to be on if it should be used to both detect vertically split and horizontally split tables?

Component category: I've decided to place the component in the converters category, but it might be more suitable in preprocessors

Let's put this into the preprocessors folder as you say and in a new file. It better fits where our other splitters live. Also it better follows our preprocessor patterns which take as input a list of documents and outputs a list of documents.

split_index_meta_key removal (?) = "csv_split_index" is a little bit useless, just a value in meta to tag each newly generated Document with its positional index after splitting. CSVSplitter already has many params, so we can get rid of it if you decide so

I actually think having this sounds like a good idea since we also do this in our DocumentSplitter. What I might suggest though is that we don't make this value changeable as a user and instead use the same meta value we use in DocumentSplitter which is split_id. See here

sjrl · 2025-02-03T07:48:02Z

haystack/components/converters/csv.py

+
+
+@component
+class CSVSplitter:


I'd advocate for changing the name of the component to CSVDocumentSplitter to be line with our naming of our other splitters to emphasize it works on Haystack Documents.

coveralls · 2025-02-03T07:57:19Z

Pull Request Test Coverage Report for Build 13091333319

Details

0 of 0 changed or added relevant lines in 0 files are covered.
21 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.09%) to 91.273%

Files with Coverage Reduction	New Missed Lines	%
components/converters/csv.py	21	88.46%

Totals
Change from base Build 13076295715:	-0.09%
Covered Lines:	8995
Relevant Lines:	9855

💛 - Coveralls

alex-stoica · 2025-02-03T20:49:29Z

@sjrl the BFS approach in the CSVSplitter (will rename it to CSVDocumentSplitter, thx for the suggestion) has a worst-case time complexity of O(N) where N is the total number of cells (rows × columns).

For example, in my tests

A large side-by-side case (with an overall grid of 60002× 1000, where data regions form 4 tables of 30000 × 100 each) ~27.5s
A large standard case (with the same overall grid but 2 tables of 30000× 200 each) ~10.7s
For regular cases (e.g., 200x50), usually both take <1s

BFS is more accurate than a simple split threshold because it detects connected non-empty cells in both vertical and horizontal directions.

However, in the case where you know you don't have side tables (no "CSV3" in from the first photo) you can leave detect_side_tables=False for faster processing

…accordingly

haystack/components/converters/csv.py

sjrl · 2025-02-04T08:40:31Z

hey @alex-stoica I've given this a bit more thought and I wanted to provide a slightly alternative approach to accomplishing this task which is to use recursive splitting by threshold:

Splits by empty rows first
a. Detects empty rows and separates tables stacked vertically.
Splits by empty columns next
a. Detects empty columns and separates tables side-by-side.
Recursively checks again for empty rows
a. Sometimes, splitting columns creates new empty rows, so we split again if needed.
Repeats until no more empty rows/columns exist

I'm still working on sample code to share to try out this approach, but while I do that I did want to get your opinion on this.

alex-stoica · 2025-02-04T15:45:31Z

Hey @sjrl first of all, thanks for your valuable feedback! You pointed out some subtle issues that I missed.

Regarding your proposal, the connected-components approach via BFS doesn’t really account for the fact that tables are rectangular. Hence, my guess is that an alternative leveraging this information would be more performant. I’m not sure whether your idea would be optimal, but my bet is that it might be more readable.

Strictly from a performance perspective, my intuition suggests that a diagonal-based traversal might be faster in some cases:

|     | A   | B   | C   | D   | E   | F   |
|-----|-----|-----|-----|-----|-----|-----|
|  1  | x1  | x2  | -   | -   | -   | -   |
|  2  | -   | x4  | -   | -   | -   | -   |
|  3  | -   | -   | -   | -   | -   | -   |
|  4  | -   | -   | -   | y1  | y2  | y3  |
|  5  | -   | -   | -   | -   | y5  | y6  |
|  6  | -   | -   | -   | y7  | y8  | -   |

It might be faster to jump directly from x1 to x4 rather than iterating row by row. In scenarios where x4 is empty, the algorithm would need a fallback to check x2 or x3. However, such an approach would be quite a pain to implement.

TL;DR:

At first glance, I consider your idea more readable and (maybe) a little bit more performant than the current approach, so it’s surely worth trying.
Strictly for performance purposes, I think some BFS on steroids via diagonal traversal might work better, but that would be harder to implement

…k into feature/csv-splitter

alex-stoica · 2025-02-04T20:32:27Z

I’ve moved the files into preprocessors as discussed. I’m considering whether to continue with BFS-based splitting or switch to your recursive row/column approach, which I'd find more intuitive and readable as I think more about it. If we keep this approach, I can incorporate robust delimiter detection using (either pandas.read_csv or csv.Sniffer() as you mentioned) and then wrap up this PR. If we pivot to your approach, I’m happy to help. Let me know your preference, and I’ll adjust accordingly.

sjrl · 2025-02-05T11:36:01Z

hey @alex-stoica after some internal discussion we think going with the recursive row/column approach makes the most sense for now and see if that covers most use cases. If not we can in the future look to bring your BFS approach in to see if that covers other scenarios.

I've opened a Draft PR with the new approach here: #8815

It would be amazing if you could test it to see if it works as expected on your tables. I'll work on bringing over your tests to that PR as well.

feat: Add CSV splitter with side-by-side BFS detection

5497610

alex-stoica requested a review from a team as a code owner February 1, 2025 18:32

alex-stoica requested review from vblagoje and removed request for a team February 1, 2025 18:32

github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 1, 2025

sjrl reviewed Feb 3, 2025

View reviewed changes

alex-stoica added 2 commits February 3, 2025 23:41

fix: hardcode meta key to 'split_id' in CSVSplitter and update tests …

d4e4972

…accordingly

Merge branch 'main' into feature/csv-splitter

59f9cec