Skip to content

refactor: separate get_result_type from coerce_type#6221

Merged
jackwener merged 4 commits intoapache:mainfrom
jackwener:seperate
May 5, 2023
Merged

refactor: separate get_result_type from coerce_type#6221
jackwener merged 4 commits intoapache:mainfrom
jackwener:seperate

Conversation

@jackwener
Copy link
Copy Markdown
Member

@jackwener jackwener commented May 4, 2023

Which issue does this PR close?

related to #5927

Rationale for this change

coerce_type current mix many motivation in it.

  • get_result_type
  • get_common_type
  • check type convert

I prepare to split it into get_result_type and get_common_type

about check type convertion #6223

What changes are included in this PR?

separate get_result_type from coerce_type.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates labels May 4, 2023
Comment on lines -2056 to -2099
let message = format!(
"Expression {case:?} expected to error due to impossible coercion"
);
assert!(logical_plan.is_err(), "{}", message);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type checking is done in the Type coercion analyzer rule rather than directly in the build

@jackwener jackwener changed the title feat: separate get_result_type and coerce_type feat: separate get_result_type from coerce_type May 4, 2023
@jackwener
Copy link
Copy Markdown
Member Author

Next, we should separate getCommonType from coerce_type

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 4, 2023
@jackwener jackwener marked this pull request as ready for review May 4, 2023 08:25
@jackwener jackwener changed the title feat: separate get_result_type from coerce_type refactor: separate get_result_type from coerce_type May 4, 2023
Comment on lines +118 to +124
| Operator::NotEq
| Operator::Lt
| Operator::Gt
| Operator::GtEq
| Operator::LtEq
| Operator::IsDistinctFrom
Copy link
Copy Markdown
Member Author

@jackwener jackwener May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original, when we call get_data_type, it will return common type instead of result type (Boolean) in coerce_type. So original it's wrong.

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great step forward -- thank you @jackwener

I think this also closes #3419

Perhaps we can update the comments here https://github.com/apache/arrow-datafusion/blob/4297547df6dc297d692ca82566cfdf135d4730b5/datafusion/expr/src/type_coercion/binary.rs#L111-L121 and update it to say that 'coerce_types' is returns the type the arguments should be coerced to

I am happy to do so in a follow on PR

Thank you for taking this first step to clean up the coercion logic ❤️

|| is_interval(lhs_type)
|| is_interval(rhs_type) =>
{
temporal_add_sub_coercion(lhs_type, rhs_type, op)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I think it would make the code easier to understand and maintain if we can disentangle coerce_types from get_result_types. To start in this PR, calling into coerce_types is fine.

Eventually I think it would keep the logic much simpler if get_result_type is only called on expressions that have their inputs properly coerced.

@@ -77,7 +77,7 @@ impl PhysicalExpr for DateTimeIntervalExpr {
}

fn data_type(&self, input_schema: &Schema) -> Result<DataType> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also love to remove DateTimeIntervalExpr out of its own thing and into binary.rs with the other operators, but for now 👍

@@ -42,7 +42,7 @@ pub fn binary_operator_data_type(
let result_type = if !any_decimal(lhs_type, rhs_type) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I als find binary_oprator_data_type to be confusing and redundant with get_result_type but we can fix that in a follow on PR perhaps

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, I will do in following PR

@jackwener
Copy link
Copy Markdown
Member Author

Thanks @alamb review carefully.

@jackwener
Copy link
Copy Markdown
Member Author

jackwener commented May 5, 2023

I prepare to merge this PR because there are some following PR.

Anyone who wants to review this PR can continue, and they they can also review these content in the following PR

@jackwener jackwener merged commit 439863f into apache:main May 5, 2023
@jackwener jackwener deleted the seperate branch May 5, 2023 03:11
@tustvold
Copy link
Copy Markdown
Contributor

One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion. This has non-trivial implications for decimal arithmetic, in particular for division, as type coercion is not idempotent. I'm still thinking about how to handle this better, but thought I would mention it in case you've already given this some thought

@jackwener
Copy link
Copy Markdown
Member Author

One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion.

Agree with it.

It may be related with #6261

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 27, 2023

One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion.

In my opinion, get_result_type should always be called after type coercion and if the function doesn't support the current input types, it should error.

Basically I think once the "Analyzer" pass has run on the expr tree the rest of the code shouldn't have to guess / coerce types anymore -- the types would always line up or the query would be rejected.

I think this is the standard approach in other databases (and, eg. the type checking / semantic pass in compilers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants