InList: fix bug for comparing with Null in the list using the set optimization by liukun4515 · Pull Request #2809 · apache/datafusion

liukun4515 · 2022-06-29T02:50:36Z

Which issue does this PR close?

In this pr: #2156, @Ted-Jiang add the set to optimize the Inlist.
But the implementation does not consider the NULL case for IN or NOT IN expr.

In the SQL system like Mysql or spark, NULL is not equal NULL and NULL can't be compared with other data type.
If A compare with NULL, the result must be NULL.

select 1 not in (2,NULL);
NULL 
select 2 in (1,NULL)
NULL
select NULL in (1,NULL)
NULL
select NULL not in (1,NULL)
NULL

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 · 2022-06-29T02:53:37Z

datafusion/physical-expr/src/expressions/in_list.rs

            })
            .collect::<Vec<_>>();
-
+        // TODO do we need to replace this below logic by `in_list_primitive`？


Do we need to replace below logic by the macro collection_contains_check?

I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?

I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?

Yes, I want to replace the logic in the in_list_primitive by the macro collection_contains_check.
I will replace the duplicated code in the in_list_primitive.

liukun4515 · 2022-06-29T02:54:44Z

datafusion/physical-expr/src/expressions/in_list.rs

-                    .collect::<BooleanArray>(),
-            )));
+                    .map(|vop| {
+                        match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) {


The f32 is not implemented the eq trait, we can't use the set::<f32>.contains

We just use the set::<Scalarvalue::Float32/Float64>

Elsewhere in DataFusion (including in ScalarValue) we use ordered_float to compare floating point numbers

It might be possible to use set::<OrderedFloat<f32>>, which would be more space efficient (fewer bytes than ScalarValue) as well as faster (as the comparison doens't have to dispatch on the type each time)

https://github.com/apache/arrow-datafusion/blob/88b88d4360054a85982987aa07b3f3afd2db7d70/datafusion/common/src/scalar.rs#L33

@alamb I will optimize this with follow-up pr with issue #2831

codecov-commenter · 2022-06-29T03:26:32Z

Codecov Report

Merging #2809 (f2730a6) into master (b47ab7c) will increase coverage by 0.09%.
The diff coverage is 85.79%.

@@            Coverage Diff             @@
##           master    #2809      +/-   ##
==========================================
+ Coverage   85.18%   85.27%   +0.09%     
==========================================
  Files         275      275              
  Lines       48564    48675     +111     
==========================================
+ Hits        41367    41508     +141     
+ Misses       7197     7167      -30

Impacted Files	Coverage Δ
...atafusion/physical-expr/src/expressions/in_list.rs	`81.83% <85.79%> (+13.63%)`	⬆️
datafusion/expr/src/logical_plan/plan.rs	`74.31% <0.00%> (-0.20%)`	⬇️
datafusion/expr/src/window_frame.rs	`93.27% <0.00%> (+0.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b47ab7c...f2730a6. Read the comment docs.

viirya · 2022-06-29T18:58:47Z

I don't get it clearly what this is targeted to fix. The issue it closes is "support decimal in (NULL)", but by a quick look, seems the change is more than that. Although there is a few description, but I cannot get it from it too.

Could you rephrase the issue this is going to fix?

liukun4515 · 2022-06-30T03:00:06Z

I don't get it clearly what this is targeted to fix. The issue it closes is "support decimal in (NULL)", but by a quick look, seems the change is more than that. Although there is a few description, but I cannot get it from it too.

Could you rephrase the issue this is going to fix?

sorry for the confused description.

I have described the issue in #2817 and the description of this pull request.

PTAL @viirya

alamb

Thank you @liukun4515 -- I reviewed the tests carefully and I think this looks good.

The code "feels" a little more complicated than needed but I think it is doing what it needs to and the test coverage is good.

I wonder if it is worth just 1 test in sql_integration somewhere that ensures this implementation is hooked up correctly rather than just relying on tests in in_list.rs

I think it would be good if @Ted-Jiang was also able to review this PR prior to merging it, but it isn't necessary.

alamb · 2022-07-02T08:56:31Z

datafusion/physical-expr/src/expressions/in_list.rs

+            lit(ScalarValue::Boolean(Some(true))),
+            lit(ScalarValue::Boolean(None)),
+        ];
+        for _ in 1..(OPTIMIZER_INSET_THRESHOLD + 1) {


I don't understand the choice of bounds of 1 ... OPTIMIZER_INSET_THRESHOLD + 1 -- why not 0..OPTIMIZER_INSET_THRESHOLD? Not that this way is wrong, I just don't understand it

change to 0..OPTIMIZER_INSET_THRESHOLD

alamb · 2022-07-02T08:57:59Z

datafusion/physical-expr/src/expressions/in_list.rs

+        let col_a = col("a", &schema)?;
+        let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?;
+
+        // expression: "a in (0,3,4....)"


Suggested change

// expression: "a in (0,3,4....)"

// expression: "a in (0,Null,3,4....)"

alamb · 2022-07-02T09:00:24Z

datafusion/physical-expr/src/expressions/in_list.rs

+            batch,
+            list.clone(),
+            &false,
+            vec![Some(true), None, None],


alamb · 2022-07-02T09:01:08Z

datafusion/physical-expr/src/expressions/in_list.rs

+        let col_a = col("a", &schema)?;
+        let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?;
+
+        // expression: "a in (0.0,3.0,4.0 ....)"


Suggested change

// expression: "a in (0.0,3.0,4.0 ....)"

// expression: "a in (0.0,Null,3.0,4.0 ....)"

alamb · 2022-07-02T09:03:30Z

datafusion/physical-expr/src/expressions/in_list.rs


        Ok(())
    }
+


I really like the test coverage

alamb · 2022-07-02T09:07:13Z

datafusion/physical-expr/src/expressions/in_list.rs

-                    .collect::<BooleanArray>(),
-            )));
+                    .map(|vop| {
+                        match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) {


Elsewhere in DataFusion (including in ScalarValue) we use ordered_float to compare floating point numbers

It might be possible to use set::<OrderedFloat<f32>>, which would be more space efficient (fewer bytes than ScalarValue) as well as faster (as the comparison doens't have to dispatch on the type each time)

https://github.com/apache/arrow-datafusion/blob/88b88d4360054a85982987aa07b3f3afd2db7d70/datafusion/common/src/scalar.rs#L33

alamb · 2022-07-02T09:15:38Z

datafusion/physical-expr/src/expressions/in_list.rs

            })
            .collect::<Vec<_>>();
-
+        // TODO do we need to replace this below logic by `in_list_primitive`？


I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?

liukun4515 · 2022-07-03T03:08:47Z

Thank you @liukun4515 -- I reviewed the tests carefully and I think this looks good.

The code "feels" a little more complicated than needed but I think it is doing what it needs to and the test coverage is good.

I wonder if it is worth just 1 test in sql_integration somewhere that ensures this implementation is hooked up correctly rather than just relying on tests in in_list.rs

I think it would be good if @Ted-Jiang was also able to review this PR prior to merging it, but it isn't necessary.

#2832 issue to track it

viirya · 2022-07-03T07:04:14Z

datafusion/physical-expr/src/expressions/in_list.rs

+    collection_contains_check!(array, native_set, negated, contains_null)
+}
+
+fn set_contains_utf<OffsetSize: OffsetSizeTrait>(


Suggested change

fn set_contains_utf<OffsetSize: OffsetSizeTrait>(

fn make_set_contains_utf8<OffsetSize: OffsetSizeTrait>(

viirya · 2022-07-03T07:05:37Z

datafusion/physical-expr/src/expressions/in_list.rs

                DataType::Boolean => {
                    let array = array.as_any().downcast_ref::<BooleanArray>().unwrap();
-                    set_contains_with_negated!(array, set, self.negated)
+                    Ok(set_contains_for_primitive!(


I'm wondering why set_contains_for_primitive is macro, but set_contains_utf and make_set_contains_decimal are methods?

set_contains_utf and make_set_contains_decimal are just apply to a specified data type case which are not compatible with other primitive case.
set_contains_for_primitive will apply to many similar case and same logic

viirya

Looks good to me. The logic handling null looks correct to me. I have a few comments about method naming and macro/method.

viirya · 2022-07-03T17:29:29Z

datafusion/physical-expr/src/expressions/in_list.rs

-            return Ok(ColumnarValue::Array(Arc::new(
+macro_rules! collection_contains_check {
+    ($ARRAY:expr, $VALUES:expr, $NEGATED:expr, $CONTAINS_NULL:expr) => {{
+        let bool_array = if $NEGATED {


Two $NEGATED cases look very similar, is it possible to combine them to save code?

e.g.

if $CONTAINS_NULL { $ARRAY .iter() .map(|vop| match vop.map(|v| { if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) } }) { Some(true) if $NEGATED => None, Some(false) if !$NEGATED => None, x => x, }) .collect::<BooleanArray>() } else { $ARRAY .iter() .map(|vop| vop.map(|v| { if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) } }) .collect::<BooleanArray>() }

Thanks @viirya
these two branches can be merged.

@viirya @alamb
each loop will check the $NEGATED and has two branches for switching.
Is this can use the vectorization?
I am not sure about this, so I will remain current implementation.
We can discuss it until getting the conclusion.

#2833 track this.

We have definitely seen cases where

if $NEGATED { // loop } else { // loop }

Was faster than

loop { if $NEGATED { // .. } else { // .. } }

But I think the only way to find out if that applies in this case would be with a benchmark

inlist: remove check path for UTF8::(None) for NULL value

5cc4207

github-actions bot added the physical-expr Changes to the physical-expr crates label Jun 29, 2022

liukun4515 mentioned this pull request Jun 29, 2022

support data type coerced and decimal in INLIST expr #2755

Closed

7 tasks

liukun4515 commented Jun 29, 2022

View reviewed changes

liukun4515 changed the title ~~InList: fix bug for compare with Null in the list using the set optimization~~ InList: fix bug for comparing with Null in the list using the set optimization Jun 29, 2022

liukun4515 requested review from alamb, andygrove and viirya June 29, 2022 04:17

liukun4515 force-pushed the inlist_set_bug branch from d7541f3 to 56c79ca Compare June 30, 2022 03:15

alamb approved these changes Jul 2, 2022

View reviewed changes

alamb mentioned this pull request Jul 2, 2022

Clean up InList code to use functions rather than macros #2826

Closed

This was referenced Jul 3, 2022

Optimization InList: compare the float data type using OrderedFloat<T> #2831

Closed

InList: add more integration test for Inlist #2832

Open

liukun4515 added 3 commits July 3, 2022 11:15

fix bug: inlist set for null case

ea55d9f

Merge remote-tracking branch 'upstream/master' into inlist_set_bug

b4a258e

address comments

384277b

liukun4515 force-pushed the inlist_set_bug branch from 56c79ca to 384277b Compare July 3, 2022 03:39

viirya reviewed Jul 3, 2022

View reviewed changes

viirya approved these changes Jul 3, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into inlist_set_bug

f2730a6

viirya reviewed Jul 3, 2022

View reviewed changes

liukun4515 mentioned this pull request Jul 4, 2022

InList: merge check branch #2833

Closed

liukun4515 merged commit 57f47ab into apache:master Jul 4, 2022

liukun4515 deleted the inlist_set_bug branch July 4, 2022 13:21

jonmmease mentioned this pull request Jul 29, 2022

Update to DataFusion 10 vega/vegafusion#148

Merged

	// expression: "a in (0,3,4....)"
	// expression: "a in (0,Null,3,4....)"

	// expression: "a in (0.0,3.0,4.0 ....)"
	// expression: "a in (0.0,Null,3.0,4.0 ....)"

	fn set_contains_utf<OffsetSize: OffsetSizeTrait>(
	fn make_set_contains_utf8<OffsetSize: OffsetSizeTrait>(

Conversation

liukun4515 commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liukun4515 Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

viirya commented Jun 29, 2022

Uh oh!

liukun4515 commented Jun 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liukun4515 commented Jul 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

liukun4515 commented Jun 29, 2022 •

edited

Loading

liukun4515 Jun 29, 2022 •

edited

Loading

codecov-commenter commented Jun 29, 2022 •

edited

Loading

liukun4515 commented Jun 30, 2022 •

edited

Loading