Skip to content

Can't read a directory of parquet files: 'Stop:Arrival time' because the from data_type = Timestamp(Second, None) does not equal Utf8 #17517

@alamb

Description

@alamb

Describe the bug

I was playing around with the Datafusion CSV parser by using the example from https://duckdb.org/2025/09/08/duckdb-on-the-framework-laptop-13 but DataFusion refused to load it into parquet

To Reproduce

Get the data

wget https://blobs.duckdb.org/nl-railway/railway-services-80-months.zip
unzip railway-services-80-months.zip

Then run

mkdir services-parquet
datafusion-cli

Convert each file to parquet:

COPY 'services/services-2019.csv' TO 'services-parquet/services-2019.parquet';
COPY 'services/services-2020.csv' TO 'services-parquet/services-2020.parquet';
COPY 'services/services-2021.csv' TO 'services-parquet/services-2021.parquet';
COPY 'services/services-2022.csv' TO 'services-parquet/services-2022.parquet';
COPY 'services/services-2023.csv' TO 'services-parquet/services-2023.parquet';
COPY 'services/services-2024.csv' TO 'services-parquet/services-2024.parquet';
COPY 'services/services-2025-01.csv' TO 'services-parquet/services-2025-01.parquet';
COPY 'services/services-2025-02.csv' TO 'services-parquet/services-2025-02.parquet';
COPY 'services/services-2025-03.csv' TO 'services-parquet/services-2025-03.parquet';
COPY 'services/services-2025-04.csv' TO 'services-parquet/services-2025-04.parquet';
COPY 'services/services-2025-05.csv' TO 'services-parquet/services-2025-05.parquet';
COPY 'services/services-2025-06.csv' TO 'services-parquet/services-2025-07.parquet';
COPY 'services/services-2025-07.csv' TO 'services-parquet/services-2025-07.parquet';
COPY 'services/services-2025-08.csv' TO 'services-parquet/services-2025-08.parquet';

And then run

DataFusion CLI v49.0.2
> select * from 'services-parquet' limit 10;
Arrow error: Schema error: Fail to merge schema field 'Stop:Arrival time' because the from data_type = Timestamp(Second, None) does not equal Utf8

Expected behavior

I expect to be able to read the data corrrectly

Additional context

One error is that the the type of the Stop: ArrivalTime has been converted to something different in some of the different files. Sometimes it is a timestamp and sometimes a string:

> describe 'services-parquet/services-2020.parquet';
+------------------------------+-----------+-------------+
| column_name                  | data_type | is_nullable |
+------------------------------+-----------+-------------+
| Service:RDT-ID               | Int64     | YES         |
| Service:Date                 | Date32    | YES         |
| Service:Type                 | Utf8View  | YES         |
| Service:Company              | Utf8View  | YES         |
| Service:Train number         | Int64     | YES         |
| Service:Completely cancelled | Boolean   | YES         |
| Service:Partly cancelled     | Boolean   | YES         |
| Service:Maximum delay        | Int64     | YES         |
| Stop:RDT-ID                  | Int64     | YES         |
| Stop:Station code            | Utf8View  | YES         |
| Stop:Station name            | Utf8View  | YES         |
| Stop:Arrival time            | Utf8View  | YES         |
| Stop:Arrival delay           | Utf8View  | YES         |
| Stop:Arrival cancelled       | Utf8View  | YES         |
| Stop:Departure time          | Utf8View  | YES         |
| Stop:Departure delay         | Utf8View  | YES         |
| Stop:Departure cancelled     | Utf8View  | YES         |
+------------------------------+-----------+-------------+
17 row(s) fetched.
Elapsed 0.009 seconds.

> describe 'services-parquet/services-2021.parquet';
+------------------------------+-------------------------+-------------+
| column_name                  | data_type               | is_nullable |
+------------------------------+-------------------------+-------------+
| Service:RDT-ID               | Int64                   | YES         |
| Service:Date                 | Date32                  | YES         |
| Service:Type                 | Utf8View                | YES         |
| Service:Company              | Utf8View                | YES         |
| Service:Train number         | Int64                   | YES         |
| Service:Completely cancelled | Boolean                 | YES         |
| Service:Partly cancelled     | Boolean                 | YES         |
| Service:Maximum delay        | Int64                   | YES         |
| Stop:RDT-ID                  | Int64                   | YES         |
| Stop:Station code            | Utf8View                | YES         |
| Stop:Station name            | Utf8View                | YES         |
| Stop:Arrival time            | Timestamp(Second, None) | YES         |.  <--- Note this field type is different
| Stop:Arrival delay           | Int64                   | YES         |
| Stop:Arrival cancelled       | Boolean                 | YES         |
| Stop:Departure time          | Utf8View                | YES         |
| Stop:Departure delay         | Utf8View                | YES         |
| Stop:Departure cancelled     | Utf8View                | YES         |
+------------------------------+-------------------------+-------------+
17 row(s) fetched.
Elapsed 0.008 seconds.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions