feat: Support reading CSV files with inconsistent column counts #17553

EeshanBembi · 2025-09-13T17:36:41Z

Summary

Enables DataFusion to read directories containing CSV files with
different numbers of columns by implementing schema union during
inference.

Previously, attempting to read multiple CSV files with different
column counts would fail with:
Arrow error: Csv error: incorrect number of fields for line 1,
expected 17 got 20

This was particularly problematic for evolving datasets where newer
files include additional columns (e.g., railway services data where
newer files added platform information).

Changes

Enhanced CSV schema inference: Modified
infer_schema_from_stream to create union schema from all files
instead of rejecting files with different column counts
Backward compatible: Existing functionality unchanged,
requires explicit opt-in via truncated_rows(true)
Comprehensive testing: Added unit tests for schema building
logic and integration test with real CSV scenarios

Usage

// Read CSV directory with mixed column counts
let df = ctx.read_csv(
    "path/to/csv/directory/",
    CsvReadOptions::new().truncated_rows(true)
).await?;

Test Results

✅ All existing tests pass (368/368 DataFusion lib tests)
✅ All CSV functionality intact (125/125 CSV tests)
✅ New integration test verifies fix with 3-column and 6-column
CSV files
✅ Schema inference creates union schema with proper null handling

Example

Before this fix:

services_2024.csv: 3 columns → ❌ Error when reading together
services_2025.csv: 6 columns → ❌ "incorrect number of fields"

After this fix:

Both files → ✅ Union schema with 6 columns
Missing columns filled with nulls automatically

Closes #17516

Jefffrey

Looks like some CI checks to address as well; would suggest running the standard cargo clippy, fmt & test commands locally before pushing so we don't need to wait for CI checks on GitHub to catch these

datafusion/datasource-csv/src/tests.rs

datafusion/datasource-csv/src/file_format.rs

datafusion/core/tests/csv_schema_fix_test.rs

Jefffrey · 2025-09-14T23:28:33Z

datafusion/core/tests/csv_schema_fix_test.rs

+    // Verify we can actually read the data
+    let results = df.collect().await?;
+
+    // Calculate total rows across all batches
+    let total_rows: usize = results.iter().map(|batch| batch.num_rows()).sum();
+    assert_eq!(total_rows, 6, "Should have 6 total rows across all batches");


I feel we should assert the actual row contents as well

Jefffrey · 2025-09-14T23:34:16Z

datafusion/datasource-csv/src/file_format.rs

+                // Handle files with different numbers of columns by extending the schema
+                if fields.len() > column_type_possibilities.len() {
+                    // New columns found - extend our tracking structures
+                    for field in fields.iter().skip(column_type_possibilities.len()) {
+                        column_names.push(field.name().clone());
+                        let mut possibilities = HashSet::new();
+                        if records_read > 0 {
+                            possibilities.insert(field.data_type().clone());
+                        }
+                        column_type_possibilities.push(possibilities);
+                    }
+                }


What happens if file A has columns t1, t3 but file B has columns t1, t2, t3?

Do we only allow files having subset of columns of other files in the exact correct order?

AKA we don't support union by name?

The current implementation performs positional union (not union by name). Files must have columns in the same order, with later files potentially adding new columns at the end. This is consistent with CSV format which doesn't have inherent column naming, column names come from headers and are positional. Union by name would require a different approach and is not implemented in this PR.

datafusion/datasource-csv/src/file_format.rs

datafusion/datasource-csv/src/tests.rs

datafusion/core/tests/csv_schema_fix_test.rs

Addresses all review feedback from PR apache#17553 to improve the CSV schema union implementation that allows reading CSV files with different column counts. Changes based on review: - Moved unit tests from separate tests.rs to bottom of file_format.rs - Updated documentation wording from "now supports" to "can handle" - Removed all println statements from integration test - Added comprehensive assertions for actual row content verification - Simplified HashSet initialization using HashSet::from([...]) syntax - Updated truncated_rows config documentation to reflect expanded purpose - Removed unnecessary min() calculation in column processing loop - Fixed clippy warnings by using enumerate() instead of range loop Technical improvements: - Tests now verify null patterns correctly across union schema - Cleaner iteration logic without redundant bounds checking - Better documentation explaining union schema behavior The feature continues to work as designed: - Creates union schema from all CSV files in a directory - Files with fewer columns have nulls for missing fields - Requires explicit opt-in via truncated_rows(true) - Maintains full backward compatibility

Enable DataFusion to read directories containing CSV files with different numbers of columns by implementing schema union during inference. Changes: - Modified CSV schema inference to create union schema from all files - Extended infer_schema_from_stream to handle varying column counts - Added tests for schema building logic and integration scenarios Requires CsvReadOptions::new().truncated_rows(true) to handle files with fewer columns than the inferred schema. Fixes apache#17516

Addresses all review feedback from PR apache#17553 to improve the CSV schema union implementation that allows reading CSV files with different column counts. Changes based on review: - Moved unit tests from separate tests.rs to bottom of file_format.rs - Updated documentation wording from "now supports" to "can handle" - Removed all println statements from integration test - Added comprehensive assertions for actual row content verification - Simplified HashSet initialization using HashSet::from([...]) syntax - Updated truncated_rows config documentation to reflect expanded purpose - Removed unnecessary min() calculation in column processing loop - Fixed clippy warnings by using enumerate() instead of range loop Technical improvements: - Tests now verify null patterns correctly across union schema - Cleaner iteration logic without redundant bounds checking - Better documentation explaining union schema behavior The feature continues to work as designed: - Creates union schema from all CSV files in a directory - Files with fewer columns have nulls for missing fields - Requires explicit opt-in via truncated_rows(true) - Maintains full backward compatibility

Jefffrey · 2025-09-18T23:32:41Z

datafusion/datasource-csv/src/file_format.rs

+pub(crate) fn build_schema_helper(
+    names: Vec<String>,
+    types: &[HashSet<DataType>],
+) -> Schema {


Suggested change

pub(crate) fn build_schema_helper(

names: Vec<String>,

types: &[HashSet<DataType>],

) -> Schema {

fn build_schema_helper(names: Vec<String>, types: &[HashSet<DataType>]) -> Schema {

Jefffrey · 2025-09-18T23:34:12Z

datafusion/core/tests/csv_schema_fix_test.rs

+    // Verify the actual content of the data
+    // Since we don't know the exact order of rows, just verify the overall structure


I suggest using assert_snapshot!() macro to check the contents, see a reference here:

datafusion/datafusion/core/tests/sql/select.rs

Lines 22 to 51 in c1ca3c4

#[tokio::test]

async fn test_list_query_parameters() -> Result<()> {

let tmp_dir = TempDir::new()?;

let partition_count = 4;

let ctx = create_ctx_with_partition(&tmp_dir, partition_count).await?;

let results = ctx

.sql("SELECT * FROM test WHERE c1 = $1")

.await?

.with_param_values(vec![ScalarValue::from(3i32)])?

.collect()

.await?;

assert_snapshot!(batches_to_sort_string(&results), @r"

+----+----+-------+

| c1 | c2 | c3 |

+----+----+-------+

| 3 | 1 | false |

| 3 | 10 | true |

| 3 | 2 | true |

| 3 | 3 | false |

| 3 | 4 | true |

| 3 | 5 | false |

| 3 | 6 | true |

| 3 | 7 | false |

| 3 | 8 | true |

| 3 | 9 | false |

+----+----+-------+

");

Ok(())

}

Jefffrey · 2025-09-18T23:38:45Z

datafusion/common/src/config.rs

+        /// Whether to allow CSV files with varying numbers of columns.
+        /// By default this is set to false and will error if the CSV rows have different lengths.
+        /// When set to true:
+        /// - Allows reading multiple CSV files with different column counts
+        /// - Creates a union schema during inference containing all columns found across files
+        /// - Files with fewer columns will have missing columns filled with null values


Suggested change

/// Whether to allow CSV files with varying numbers of columns.

/// By default this is set to false and will error if the CSV rows have different lengths.

/// When set to true:

/// - Allows reading multiple CSV files with different column counts

/// - Creates a union schema during inference containing all columns found across files

/// - Files with fewer columns will have missing columns filled with null values

/// Whether to allow truncated rows when parsing, both within a single file and across files.

///

/// When set to false (default), reading a single CSV file which has rows of different lengths will

/// error; if reading multiple CSV files with different number of columns, it will also fail.

///

/// When set to true, reading a single CSV file with rows of different lengths will pad the truncated

/// rows with null values for the missing columns; if reading multiple CSV files with different number

/// of columns, it creates a union schema containing all columns found across the files, and will

/// pad any files missing columns with null values for their rows.

Just to make it more obvious this config has a dual purpose.

Jefffrey · 2025-09-18T23:42:38Z

datafusion/datasource-csv/src/file_format.rs

-                column_type_possibilities.iter_mut().zip(&fields).for_each(
-                    |(possibilities, field)| {
-                        possibilities.insert(field.data_type().clone());
-                    },
-                );
+                // Update type possibilities for columns that exist in this file
+                // Only process fields that exist in both the current file and our tracking structures
+                for (field_idx, field) in fields.iter().enumerate() {
+                    if field_idx < column_type_possibilities.len() {
+                        column_type_possibilities[field_idx]
+                            .insert(field.data_type().clone());
+                    }
+                }


It might be less confusing if we keep the original code:

column_type_possibilities.iter_mut().zip(&fields).for_each( |(possibilities, field)| { possibilities.insert(field.data_type().clone()); }, );

But move it before the Handle files with different numbers of columns by extending the schema check (since zip will terminate on the shortest iterator), so we first extend possibilities for existing columns, then afterwards add the new columns.

Another way could be to do both in a single iteration; something like

for idx, field in enumerate(fields) if idx < len(possibilities) add_possibility else add_new_field

Thoughts?

Jefffrey · 2025-09-18T23:47:22Z

datafusion/datasource-csv/src/file_format.rs

-                if fields.len() != column_type_possibilities.len() {
-                    return exec_err!(
-                            "Encountered unequal lengths between records on CSV file whilst inferring schema. \
-                             Expected {} fields, found {} fields at record {}",
-                            column_type_possibilities.len(),
-                            fields.len(),
-                            record_number + 1
-                        );


Would it be beneficial to keep this error message for when truncated_rows config is false? What does the error message look like right now if we read CSV files with differing column counts where this new behaviour is not enabled?

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Sep 13, 2025

Jefffrey reviewed Sep 14, 2025

View reviewed changes

github-actions bot added the common Related to common crate label Sep 18, 2025

EeshanBembi added 2 commits September 19, 2025 03:12

EeshanBembi force-pushed the fix/csv-inconsistent-column-counts branch from 81b2dd9 to a0ad856 Compare September 18, 2025 21:44

EeshanBembi and others added 2 commits September 19, 2025 03:15

Merge branch 'main' into fix/csv-inconsistent-column-counts

ed77e4b

Apply cargo fmt formatting fixes

c813dcb

Jefffrey reviewed Sep 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support reading CSV files with inconsistent column counts #17553

feat: Support reading CSV files with inconsistent column counts #17553

Uh oh!

EeshanBembi commented Sep 13, 2025

Uh oh!

Jefffrey left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

Jefffrey Sep 14, 2025

Uh oh!

EeshanBembi Sep 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jefffrey Sep 18, 2025

Uh oh!

Jefffrey Sep 18, 2025

Uh oh!

Jefffrey Sep 18, 2025

Uh oh!

Jefffrey Sep 18, 2025

Uh oh!

Jefffrey Sep 18, 2025

Uh oh!

Uh oh!

		// Verify the actual content of the data
		// Since we don't know the exact order of rows, just verify the overall structure

	#[tokio::test]
	async fn test_list_query_parameters() -> Result<()> {
	let tmp_dir = TempDir::new()?;
	let partition_count = 4;
	let ctx = create_ctx_with_partition(&tmp_dir, partition_count).await?;

	let results = ctx
	.sql("SELECT * FROM test WHERE c1 = $1")
	.await?
	.with_param_values(vec![ScalarValue::from(3i32)])?
	.collect()
	.await?;
	assert_snapshot!(batches_to_sort_string(&results), @r"
	+----+----+-------+
	\| c1 \| c2 \| c3 \|
	+----+----+-------+
	\| 3 \| 1 \| false \|
	\| 3 \| 10 \| true \|
	\| 3 \| 2 \| true \|
	\| 3 \| 3 \| false \|
	\| 3 \| 4 \| true \|
	\| 3 \| 5 \| false \|
	\| 3 \| 6 \| true \|
	\| 3 \| 7 \| false \|
	\| 3 \| 8 \| true \|
	\| 3 \| 9 \| false \|
	+----+----+-------+
	");
	Ok(())
	}

-        /// Whether to allow CSV files with varying numbers of columns.
-        /// By default this is set to false and will error if the CSV rows have different lengths.
-        /// When set to true:
-        /// - Allows reading multiple CSV files with different column counts
-        /// - Creates a union schema during inference containing all columns found across files
-        /// - Files with fewer columns will have missing columns filled with null values
+        /// Whether to allow truncated rows when parsing, both within a single file and across files.
+        ///
+        /// When set to false (default), reading a single CSV file which has rows of different lengths will
+        /// error; if reading multiple CSV files with different number of columns, it will also fail.
+        ///
+        /// When set to true, reading a single CSV file with rows of different lengths will pad the truncated
+        /// rows with null values for the missing columns; if reading multiple CSV files with different number
+        /// of columns, it creates a union schema containing all columns found across the files, and will
+        /// pad any files missing columns with null values for their rows.

feat: Support reading CSV files with inconsistent column counts #17553

Are you sure you want to change the base?

feat: Support reading CSV files with inconsistent column counts #17553

Uh oh!

Conversation

EeshanBembi commented Sep 13, 2025

Summary

Changes

Usage

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!