-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I was trying to see how fast the Datafusion CSV parser was by using the example from https://duckdb.org/2025/09/08/duckdb-on-the-framework-laptop-13 but DataFusion refused to load it for a few reasons
To Reproduce
wget https://blobs.duckdb.org/nl-railway/railway-services-80-months.zip
unzip railway-services-80-months.zipThen run
datafusion-cliandrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ datafusion-cli
DataFusion CLI v49.0.2
> select * from 'services';
Arrow error: Csv error: incorrect number of fields for line 1, expected 17 got 20Expected behavior
I expect the directory to be treated as a single table, correctly
Note that selecting from individual files work fine:
> select * from 'services/services-2025-01.csv' limit 10;
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| Service:RDT-ID | Service:Date | Service:Type | Service:Company | Service:Train number | Service:Completely cancelled | Service:Partly cancelled | Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name | Stop:Arrival time | Stop:Arrival delay | Stop:Arrival cancelled | Stop:Departure time | Stop:Departure delay | Stop:Departure cancelled |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| 15122556 | 2025-01-07 | Intercity | NS | 3022 | false | false | 0 | 136517277 | AMRN | Alkmaar Noord | 2025-01-07T08:52:00 | 1 | false | 2025-01-07T09:52:00+01:00 | 1 | false |
...
| 15122557 | 2025-01-07 | Sprinter | NS | 6929 | false | false | 0 | 136517286 | ZTM | Zoetermeer | 2025-01-07T06:55:00 | 0 | false | 2025-01-07T07:55:00+01:00 | 0 | false |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+-------------------+---------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
10 row(s) fetched.
Elapsed 0.019 seconds.Also, if you remove the few most recent
rm services/services-2025-04.csv
rm services/services-2025-05.csv
rm services/services-2025-06.csv
rm services/services-2025-07.csv
rm services/services-2025-08.csvIt works fine:
> select * from 'services' limit 10;
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| Service:RDT-ID | Service:Date | Service:Type | Service:Company | Service:Train number | Service:Completely cancelled | Service:Partly cancelled | Service:Maximum delay | Stop:RDT-ID | Stop:Station code | Stop:Station name | Stop:Arrival time | Stop:Arrival delay | Stop:Arrival cancelled | Stop:Departure time | Stop:Departure delay | Stop:Departure cancelled |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
| 15566803 | 2025-03-15 | Intercity | NS | 3970 | false | false | 0 | 140690195 | UT | Utrecht Centraal | 2025-03-15T20:17:00+01:00 | 0 | false | 2025-03-15T20:19:00+01:00 | 0 | false |
...
| 15566803 | 2025-03-15 | Intercity | NS | 3970 | false | false | 0 | 140690204 | BKF | Bovenkarspel Flora | 2025-03-15T21:49:00+01:00 | 0 | false | 2025-03-15T21:49:00+01:00 | 0 | false |
+----------------+--------------+--------------+-----------------+----------------------+------------------------------+--------------------------+-----------------------+-------------+-------------------+--------------------------+---------------------------+--------------------+------------------------+---------------------------+----------------------+--------------------------+
10 row(s) fetched.
Elapsed 0.015 seconds.> copy 'services' to 'services.parquet';
+-----------+
| count |
+-----------+
| 135465619 |
+-----------+
1 row(s) fetched.
Elapsed 9.661 seconds.Additional context
I suspect that the issue is that 'services/services-2025-07.csv' and others has 20 columns while the other files have 17 columns and for some reason the CSV format is not adapting the schema the same way as the parquet can
> describe 'services/services-2024.csv';
+------------------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+------------------------------+-----------+-------------+
| Service:RDT-ID | Int64 | YES |
| Service:Date | Date32 | YES |
| Service:Type | Utf8 | YES |
| Service:Company | Utf8 | YES |
| Service:Train number | Int64 | YES |
| Service:Completely cancelled | Boolean | YES |
| Service:Partly cancelled | Boolean | YES |
| Service:Maximum delay | Int64 | YES |
| Stop:RDT-ID | Int64 | YES |
| Stop:Station code | Utf8 | YES |
| Stop:Station name | Utf8 | YES |
| Stop:Arrival time | Utf8 | YES |
| Stop:Arrival delay | Utf8 | YES |
| Stop:Arrival cancelled | Utf8 | YES |
| Stop:Departure time | Utf8 | YES |
| Stop:Departure delay | Utf8 | YES |
| Stop:Departure cancelled | Utf8 | YES |
+------------------------------+-----------+-------------+
17 row(s) fetched.
Elapsed 0.008 seconds.
> describe 'services/services-2025-07.csv';
+------------------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+------------------------------+-----------+-------------+
| Service:RDT-ID | Int64 | YES |
| Service:Date | Date32 | YES |
| Service:Type | Utf8 | YES |
| Service:Company | Utf8 | YES |
| Service:Train number | Int64 | YES |
| Service:Completely cancelled | Boolean | YES |
| Service:Partly cancelled | Boolean | YES |
| Service:Maximum delay | Int64 | YES |
| Stop:RDT-ID | Int64 | YES |
| Stop:Station code | Utf8 | YES |
| Stop:Station name | Utf8 | YES |
| Stop:Arrival time | Utf8 | YES |
| Stop:Arrival delay | Utf8 | YES |
| Stop:Arrival cancelled | Utf8 | YES |
| Stop:Departure time | Utf8 | YES |
| Stop:Departure delay | Utf8 | YES |
| Stop:Departure cancelled | Utf8 | YES |
| Stop:Platform change | Boolean | YES | <-- New columns!
| Stop:Planned platform | Utf8 | YES |
| Stop:Actual platform | Utf8 | YES |
+------------------------------+-----------+-------------+
20 row(s) fetched.
Elapsed 0.005 seconds.Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working