Skip to content

Can't read a directory of CSV files / CSV schema evolution: incorrect number of fields for line 1, expected 17 got 20 #17516

@alamb

Description

@alamb

Describe the bug

I was trying to see how fast the Datafusion CSV parser was by using the example from https://duckdb.org/2025/09/08/duckdb-on-the-framework-laptop-13 but DataFusion refused to load it for a few reasons

To Reproduce

wget https://blobs.duckdb.org/nl-railway/railway-services-80-months.zip
unzip railway-services-80-months.zip

Then run

datafusion-cli
andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ datafusion-cli
DataFusion CLI v49.0.2
> select * from 'services';
Arrow error: Csv error: incorrect number of fields for line 1, expected 17 got 20

Expected behavior

I expect the directory to be treated as a single table, correctly

Note that selecting from individual files work fine:

> select * from 'services/services-2025-01.csv' limit 10;
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| Service:RDT-ID | Service:Date | Service:Type | Service:Company | Service:Train number | Service:Completely cancelled | Service:Partly cancelled | Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name | Stop:Arrival time   | Stop:Arrival delay | Stop:Arrival cancelled | Stop:Departure time       | Stop:Departure delay | Stop:Departure cancelled |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| 15122556       | 2025-01-07   | Intercity    | NS              | 3022                 | false                        | false                    | 0                     | 136517277   | AMRN              | Alkmaar Noord     | 2025-01-07T08:52:00 | 1                  | false                  | 2025-01-07T09:52:00+01:00 | 1                    | false                    |
...
| 15122557       | 2025-01-07   | Sprinter     | NS              | 6929                 | false                        | false                    | 0                     | 136517286   | ZTM               | Zoetermeer        | 2025-01-07T06:55:00 | 0                  | false                  | 2025-01-07T07:55:00+01:00 | 0                    | false                    |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
10 row(s) fetched.
Elapsed 0.019 seconds.

Also, if you remove the few most recent

rm services/services-2025-04.csv
rm services/services-2025-05.csv
rm services/services-2025-06.csv
rm services/services-2025-07.csv
rm services/services-2025-08.csv

It works fine:

> select * from 'services' limit 10;
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| Service:RDT-ID | Service:Date | Service:Type | Service:Company | Service:Train number | Service:Completely cancelled | Service:Partly cancelled | Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name        | Stop:Arrival time         | Stop:Arrival delay | Stop:Arrival cancelled | Stop:Departure time       | Stop:Departure delay | Stop:Departure cancelled |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| 15566803       | 2025-03-15   | Intercity    | NS              | 3970                 | false                        | false                    | 0                     | 140690195   | UT                | Utrecht Centraal         | 2025-03-15T20:17:00+01:00 | 0                  | false                  | 2025-03-15T20:19:00+01:00 | 0                    | false                    |
...
| 15566803       | 2025-03-15   | Intercity    | NS              | 3970                 | false                        | false                    | 0                     | 140690204   | BKF               | Bovenkarspel Flora       | 2025-03-15T21:49:00+01:00 | 0                  | false                  | 2025-03-15T21:49:00+01:00 | 0                    | false                    |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
10 row(s) fetched.
Elapsed 0.015 seconds.
> copy 'services' to 'services.parquet';
+-----------+
| count     |
+-----------+
| 135465619 |
+-----------+
1 row(s) fetched.
Elapsed 9.661 seconds.

Additional context

I suspect that the issue is that 'services/services-2025-07.csv' and others has 20 columns while the other files have 17 columns and for some reason the CSV format is not adapting the schema the same way as the parquet can

> describe 'services/services-2024.csv';
+------------------------------+-----------+-------------+
| column_name                  | data_type | is_nullable |
+------------------------------+-----------+-------------+
| Service:RDT-ID               | Int64     | YES         |
| Service:Date                 | Date32    | YES         |
| Service:Type                 | Utf8      | YES         |
| Service:Company              | Utf8      | YES         |
| Service:Train number         | Int64     | YES         |
| Service:Completely cancelled | Boolean   | YES         |
| Service:Partly cancelled     | Boolean   | YES         |
| Service:Maximum delay        | Int64     | YES         |
| Stop:RDT-ID                  | Int64     | YES         |
| Stop:Station code            | Utf8      | YES         |
| Stop:Station name            | Utf8      | YES         |
| Stop:Arrival time            | Utf8      | YES         |
| Stop:Arrival delay           | Utf8      | YES         |
| Stop:Arrival cancelled       | Utf8      | YES         |
| Stop:Departure time          | Utf8      | YES         |
| Stop:Departure delay         | Utf8      | YES         |
| Stop:Departure cancelled     | Utf8      | YES         |
+------------------------------+-----------+-------------+
17 row(s) fetched.
Elapsed 0.008 seconds.

> describe 'services/services-2025-07.csv';
+------------------------------+-----------+-------------+
| column_name                  | data_type | is_nullable |
+------------------------------+-----------+-------------+
| Service:RDT-ID               | Int64     | YES         |
| Service:Date                 | Date32    | YES         |
| Service:Type                 | Utf8      | YES         |
| Service:Company              | Utf8      | YES         |
| Service:Train number         | Int64     | YES         |
| Service:Completely cancelled | Boolean   | YES         |
| Service:Partly cancelled     | Boolean   | YES         |
| Service:Maximum delay        | Int64     | YES         |
| Stop:RDT-ID                  | Int64     | YES         |
| Stop:Station code            | Utf8      | YES         |
| Stop:Station name            | Utf8      | YES         |
| Stop:Arrival time            | Utf8      | YES         |
| Stop:Arrival delay           | Utf8      | YES         |
| Stop:Arrival cancelled       | Utf8      | YES         |
| Stop:Departure time          | Utf8      | YES         |
| Stop:Departure delay         | Utf8      | YES         |
| Stop:Departure cancelled     | Utf8      | YES         |
| Stop:Platform change         | Boolean   | YES         | <-- New columns!
| Stop:Planned platform        | Utf8      | YES         |
| Stop:Actual platform         | Utf8      | YES         |
+------------------------------+-----------+-------------+
20 row(s) fetched.
Elapsed 0.005 seconds.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions