Skip to content

feat: support TSV/other delimiters and fix Markdown table collisions#2061

Open
trippinganymess wants to merge 5 commits into
microsoft:mainfrom
trippinganymess:TSV-and-other-delimiter-support
Open

feat: support TSV/other delimiters and fix Markdown table collisions#2061
trippinganymess wants to merge 5 commits into
microsoft:mainfrom
trippinganymess:TSV-and-other-delimiter-support

Conversation

@trippinganymess
Copy link
Copy Markdown

@trippinganymess trippinganymess commented Jun 3, 2026

Problem Statement

The current CSV converter lacks support for alternative delimiter formats (like .tsv, .psv, .ssv) and fails to safely escape structural characters (like pipes and newlines) within cell data, leading to corrupted Markdown table rendering.

Proposed Solution

  1. Dynamic Delimiter Resolution -
  • Gives users the overriding power to explicitly define a delimiter via kwargs.
  • If no delimiter is provided, it utilizes csv.Sniffer() to dynamically determine the internal delimiter from the file content (handling edge cases where internal data doesn't match the file extension).
  • If the sniffer fails (csv.Error), it safely falls back to a generic delimiter based on the file's extension or MIME type.
  • added the text/tsv even though it is not in the official IANA list because many legacy system still use it
  1. Iterative Cell Sanitization -
  • Routes all cell data through a new sanitize_cell() helper during string construction to preserve Markdown table integrity.
  • Escapes rogue pipes (| becomes |) to prevent column layout collisions.
  • Flattens newlines (\n becomes a space) and removes carriage returns (\r becomes "") to ensure rows remain strictly horizontal.
  • Used iterative approach to evaluate the safe rows as the loading the whole content (all rows at once) could cause heap memory spike.
  1. Minor Chores
    Fixed a small text duplication in the documentation of _base_converter.py.

resolves #2019 and #2022

…ssv, .psv)

Refactored the  to extend it's capabilities to resolve .tsv, .psv, .ssv files into markdown. The implementation now give user the overriding power to specify the delimiter used in the file, if the dilimiter is not specified then a sniffer function is used to determine the delimiter, In case that fails and result into csv.Error then we fallback to check the extension and MIMETYPE to determine the delimiter
…e, carriage return character

This is applied iteratively during row construction to prevent Markdown layout collisions without spiking heap memory.
@trippinganymess
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@trippinganymess
Copy link
Copy Markdown
Author

I know the sniffer() function is not robust and can lead to false positive or false negatives but it seemed better than the default option, to solve the robustness problem maybe we can use a dialect voting would could provide greater reliability.

here is what I am thinking :

  1. get a dialect prediction from beginning sample.
  2. get a dialect prediction from the ending sample.
  3. if both the dialect match, we pass it prediction as the delimiter.
  4. In case they don't match then picking a third sample from the file and then performing a majority vote between all three sample.

If that is something which seems like a better approach please let me know and I will implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)

1 participant