Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add CSV splitter with side-by-side BFS detection #8795

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

alex-stoica
Copy link

Related Issues

Proposed Changes

  • Introduce a CSVSplitter that can separate a single CSV into multiple smaller “table” blocks (each block will be returned as a Document)
  • Row-based splitting is controlled by split_threshold: if you have purely vertical tables separated by empty rows, the splitter works in a straightforward way, just select how many empty spaces are required to consider for separating the blocks
  • Optional side-by-side detection can be enabled via detect_side_tables=True. This uses BFS to find horizontally adjacent blocks of non-empty cells (i.e., multiple tables in the same row).
  • By default, BFS is disabled, preserving a classic “vertical only” splitting approach based on empty rows.
  • Added tests in test_csv_splitter.py for both vertical-only and side-by-side scenarios
image

For reference, in CSVs with side-by-side tables (CSV3), detect_side_tables should be True, and BFS is used to detect the connected components. For a CSV with tables separated only vertically, detect_side_tables should be False and the component would run much faster.

How did you test it?

  • Created multiple tests in test_csv_splitter.py with different CSV layouts
  • Verified locally by running:
    hatch run test:unit
    pytest test_csv_splitter.py
    

Notes for the reviewer

  • Performance: BFS over large CSVs can be costly, but it can be disabled (if no side tables are present) or optimized.
  • Unintended merging: A single bridging cell merges two blocks, i.e. the side tables need to be completely separated with blank cells in order to not merge them with other cells
  • Component category: I've decided to place the component in the converters category, but it might be more suitable in preprocessors
  • split_index_meta_key removal (?) = "csv_split_index" is a little bit useless, just a value in meta to tag each newly generated Document with its positional index after splitting. CSVSplitter already has many params, so we can get rid of it if you decide so

Checklist
I have read the contributors guidelines and the code of conduct.
I have updated the related issue with new insights and changes.
I added unit tests and updated docstrings.
I've used one of the conventional commit types for my PR title, for example feat: CSV splitter with side-by-side.
I documented my code (docstrings & comments).
I ran pre-commit hooks and fixed any issues.

@alex-stoica alex-stoica requested a review from a team as a code owner February 1, 2025 18:32
@alex-stoica alex-stoica requested review from vblagoje and removed request for a team February 1, 2025 18:32
@CLAassistant
Copy link

CLAassistant commented Feb 1, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 1, 2025
@sjrl
Copy link
Contributor

sjrl commented Feb 3, 2025

hey @alex-stoica thanks for the quick work on this! A few questions for you:

Performance: BFS over large CSVs can be costly, but it can be disabled (if no side tables are present) or optimized.

  • How expensive is BFS exactly? And for reasonably large CSV tables how long would it take to run your code on it? E.g. are we talking minutes, seconds, microseconds, etc.

  • Also would you say BFS is more accurate than the simple split threshold? Since technically side-by-side tables could also naively be split by number of empty columns. So I'm wondering if BFS is toggled to be on if it should be used to both detect vertically split and horizontally split tables?

Component category: I've decided to place the component in the converters category, but it might be more suitable in preprocessors

  • Let's put this into the preprocessors folder as you say and in a new file. It better fits where our other splitters live. Also it better follows our preprocessor patterns which take as input a list of documents and outputs a list of documents.

split_index_meta_key removal (?) = "csv_split_index" is a little bit useless, just a value in meta to tag each newly generated Document with its positional index after splitting. CSVSplitter already has many params, so we can get rid of it if you decide so

  • I actually think having this sounds like a good idea since we also do this in our DocumentSplitter. What I might suggest though is that we don't make this value changeable as a user and instead use the same meta value we use in DocumentSplitter which is split_id. See here



@component
class CSVSplitter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advocate for changing the name of the component to CSVDocumentSplitter to be line with our naming of our other splitters to emphasize it works on Haystack Documents.

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 13091333319

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 21 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.09%) to 91.273%

Files with Coverage Reduction New Missed Lines %
components/converters/csv.py 21 88.46%
Totals Coverage Status
Change from base Build 13076295715: -0.09%
Covered Lines: 8995
Relevant Lines: 9855

💛 - Coveralls

@alex-stoica
Copy link
Author

alex-stoica commented Feb 3, 2025

@sjrl the BFS approach in the CSVSplitter (will rename it to CSVDocumentSplitter, thx for the suggestion) has a worst-case time complexity of O(N) where N is the total number of cells (rows × columns).

For example, in my tests

  • A large side-by-side case (with an overall grid of 60002× 1000, where data regions form 4 tables of 30000 × 100 each) ~27.5s
  • A large standard case (with the same overall grid but 2 tables of 30000× 200 each) ~10.7s
    For regular cases (e.g., 200x50), usually both take <1s

BFS is more accurate than a simple split threshold because it detects connected non-empty cells in both vertical and horizontal directions.

However, in the case where you know you don't have side tables (no "CSV3" in from the first photo) you can leave detect_side_tables=False for faster processing

@sjrl
Copy link
Contributor

sjrl commented Feb 4, 2025

hey @alex-stoica I've given this a bit more thought and I wanted to provide a slightly alternative approach to accomplishing this task which is to use recursive splitting by threshold:

  1. Splits by empty rows first
    a. Detects empty rows and separates tables stacked vertically.
  2. Splits by empty columns next
    a. Detects empty columns and separates tables side-by-side.
  3. Recursively checks again for empty rows
    a. Sometimes, splitting columns creates new empty rows, so we split again if needed.
  4. Repeats until no more empty rows/columns exist

I'm still working on sample code to share to try out this approach, but while I do that I did want to get your opinion on this.

@alex-stoica
Copy link
Author

alex-stoica commented Feb 4, 2025

Hey @sjrl first of all, thanks for your valuable feedback! You pointed out some subtle issues that I missed.

Regarding your proposal, the connected-components approach via BFS doesn’t really account for the fact that tables are rectangular. Hence, my guess is that an alternative leveraging this information would be more performant. I’m not sure whether your idea would be optimal, but my bet is that it might be more readable.

Strictly from a performance perspective, my intuition suggests that a diagonal-based traversal might be faster in some cases:

|     | A   | B   | C   | D   | E   | F   |
|-----|-----|-----|-----|-----|-----|-----|
|  1  | x1  | x2  | -   | -   | -   | -   |
|  2  | -   | x4  | -   | -   | -   | -   |
|  3  | -   | -   | -   | -   | -   | -   |
|  4  | -   | -   | -   | y1  | y2  | y3  |
|  5  | -   | -   | -   | -   | y5  | y6  |
|  6  | -   | -   | -   | y7  | y8  | -   |

It might be faster to jump directly from x1 to x4 rather than iterating row by row. In scenarios where x4 is empty, the algorithm would need a fallback to check x2 or x3. However, such an approach would be quite a pain to implement.

TL;DR:

  • At first glance, I consider your idea more readable and (maybe) a little bit more performant than the current approach, so it’s surely worth trying.
  • Strictly for performance purposes, I think some BFS on steroids via diagonal traversal might work better, but that would be harder to implement

@alex-stoica
Copy link
Author

I’ve moved the files into preprocessors as discussed. I’m considering whether to continue with BFS-based splitting or switch to your recursive row/column approach, which I'd find more intuitive and readable as I think more about it. If we keep this approach, I can incorporate robust delimiter detection using (either pandas.read_csv or csv.Sniffer() as you mentioned) and then wrap up this PR. If we pivot to your approach, I’m happy to help. Let me know your preference, and I’ll adjust accordingly.

@sjrl
Copy link
Contributor

sjrl commented Feb 5, 2025

hey @alex-stoica after some internal discussion we think going with the recursive row/column approach makes the most sense for now and see if that covers most use cases. If not we can in the future look to bring your BFS approach in to see if that covers other scenarios.

I've opened a Draft PR with the new approach here: #8815

It would be amazing if you could test it to see if it works as expected on your tables. I'll work on bringing over your tests to that PR as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a CSV Document splitter
4 participants