InList: fix bug for comparing with Null in the list using the set optimization#2809
InList: fix bug for comparing with Null in the list using the set optimization#2809liukun4515 merged 5 commits intoapache:masterfrom
Conversation
| }) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| // TODO do we need to replace this below logic by `in_list_primitive`? |
There was a problem hiding this comment.
Do we need to replace below logic by the macro collection_contains_check?
There was a problem hiding this comment.
I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?
There was a problem hiding this comment.
I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?
Yes, I want to replace the logic in the in_list_primitive by the macro collection_contains_check.
I will replace the duplicated code in the in_list_primitive.
| .collect::<BooleanArray>(), | ||
| ))); | ||
| .map(|vop| { | ||
| match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) { |
There was a problem hiding this comment.
The f32 is not implemented the eq trait, we can't use the set::<f32>.contains
There was a problem hiding this comment.
We just use the set::<Scalarvalue::Float32/Float64>
There was a problem hiding this comment.
Elsewhere in DataFusion (including in ScalarValue) we use ordered_float to compare floating point numbers
It might be possible to use set::<OrderedFloat<f32>>, which would be more space efficient (fewer bytes than ScalarValue) as well as faster (as the comparison doens't have to dispatch on the type each time)
Codecov Report
@@ Coverage Diff @@
## master #2809 +/- ##
==========================================
+ Coverage 85.18% 85.27% +0.09%
==========================================
Files 275 275
Lines 48564 48675 +111
==========================================
+ Hits 41367 41508 +141
+ Misses 7197 7167 -30
Continue to review full report at Codecov.
|
|
I don't get it clearly what this is targeted to fix. The issue it closes is "support decimal in (NULL)", but by a quick look, seems the change is more than that. Although there is a few description, but I cannot get it from it too. Could you rephrase the issue this is going to fix? |
sorry for the confused description. I have described the issue in #2817 and the description of this pull request. PTAL @viirya |
d7541f3 to
56c79ca
Compare
alamb
left a comment
There was a problem hiding this comment.
Thank you @liukun4515 -- I reviewed the tests carefully and I think this looks good.
The code "feels" a little more complicated than needed but I think it is doing what it needs to and the test coverage is good.
I wonder if it is worth just 1 test in sql_integration somewhere that ensures this implementation is hooked up correctly rather than just relying on tests in in_list.rs
I think it would be good if @Ted-Jiang was also able to review this PR prior to merging it, but it isn't necessary.
| lit(ScalarValue::Boolean(Some(true))), | ||
| lit(ScalarValue::Boolean(None)), | ||
| ]; | ||
| for _ in 1..(OPTIMIZER_INSET_THRESHOLD + 1) { |
There was a problem hiding this comment.
I don't understand the choice of bounds of 1 ... OPTIMIZER_INSET_THRESHOLD + 1 -- why not 0..OPTIMIZER_INSET_THRESHOLD? Not that this way is wrong, I just don't understand it
There was a problem hiding this comment.
change to 0..OPTIMIZER_INSET_THRESHOLD
| let col_a = col("a", &schema)?; | ||
| let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?; | ||
|
|
||
| // expression: "a in (0,3,4....)" |
There was a problem hiding this comment.
| // expression: "a in (0,3,4....)" | |
| // expression: "a in (0,Null,3,4....)" |
| batch, | ||
| list.clone(), | ||
| &false, | ||
| vec![Some(true), None, None], |
| let col_a = col("a", &schema)?; | ||
| let batch = RecordBatch::try_new(Arc::new(schema.clone()), vec![Arc::new(a)])?; | ||
|
|
||
| // expression: "a in (0.0,3.0,4.0 ....)" |
There was a problem hiding this comment.
| // expression: "a in (0.0,3.0,4.0 ....)" | |
| // expression: "a in (0.0,Null,3.0,4.0 ....)" |
|
|
||
| Ok(()) | ||
| } | ||
|
|
There was a problem hiding this comment.
I really like the test coverage
| .collect::<BooleanArray>(), | ||
| ))); | ||
| .map(|vop| { | ||
| match vop.map(|v| !$SET_VALUES.contains(&v.try_into().unwrap())) { |
There was a problem hiding this comment.
Elsewhere in DataFusion (including in ScalarValue) we use ordered_float to compare floating point numbers
It might be possible to use set::<OrderedFloat<f32>>, which would be more space efficient (fewer bytes than ScalarValue) as well as faster (as the comparison doens't have to dispatch on the type each time)
| }) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| // TODO do we need to replace this below logic by `in_list_primitive`? |
There was a problem hiding this comment.
I reviewed the logic below and it looked correct to me -- are you asking about the trying to refactor to reduce duplication?
#2832 issue to track it |
56c79ca to
384277b
Compare
| collection_contains_check!(array, native_set, negated, contains_null) | ||
| } | ||
|
|
||
| fn set_contains_utf<OffsetSize: OffsetSizeTrait>( |
There was a problem hiding this comment.
| fn set_contains_utf<OffsetSize: OffsetSizeTrait>( | |
| fn make_set_contains_utf8<OffsetSize: OffsetSizeTrait>( |
| DataType::Boolean => { | ||
| let array = array.as_any().downcast_ref::<BooleanArray>().unwrap(); | ||
| set_contains_with_negated!(array, set, self.negated) | ||
| Ok(set_contains_for_primitive!( |
There was a problem hiding this comment.
I'm wondering why set_contains_for_primitive is macro, but set_contains_utf and make_set_contains_decimal are methods?
There was a problem hiding this comment.
set_contains_utf and make_set_contains_decimal are just apply to a specified data type case which are not compatible with other primitive case.
set_contains_for_primitive will apply to many similar case and same logic
viirya
left a comment
There was a problem hiding this comment.
Looks good to me. The logic handling null looks correct to me. I have a few comments about method naming and macro/method.
| return Ok(ColumnarValue::Array(Arc::new( | ||
| macro_rules! collection_contains_check { | ||
| ($ARRAY:expr, $VALUES:expr, $NEGATED:expr, $CONTAINS_NULL:expr) => {{ | ||
| let bool_array = if $NEGATED { |
There was a problem hiding this comment.
Two $NEGATED cases look very similar, is it possible to combine them to save code?
e.g.
if $CONTAINS_NULL {
$ARRAY
.iter()
.map(|vop| match vop.map(|v| {
if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) }
}) {
Some(true) if $NEGATED => None,
Some(false) if !$NEGATED => None,
x => x,
})
.collect::<BooleanArray>()
} else {
$ARRAY
.iter()
.map(|vop| vop.map(|v| {
if $NEGATED { !$VALUES.contains(&v) } else { $VALUES.contains(&v) }
})
.collect::<BooleanArray>()
}There was a problem hiding this comment.
Thanks @viirya
these two branches can be merged.
There was a problem hiding this comment.
We have definitely seen cases where
if $NEGATED {
// loop
} else {
// loop
}Was faster than
loop {
if $NEGATED {
// ..
} else {
// ..
}
}But I think the only way to find out if that applies in this case would be with a benchmark
Which issue does this PR close?
close #2817
In this pr: #2156, @Ted-Jiang add the set to optimize the Inlist.
But the implementation does not consider the
NULLcase forINorNOT INexpr.In the SQL system like Mysql or spark, NULL is not equal NULL and NULL can't be compared with other data type.
If
Acompare with NULL, the result must be NULL.Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?