Optimize the evaluation of `IN` for large lists using InSet by Ted-Jiang · Pull Request #2156 · apache/datafusion

Ted-Jiang · 2022-04-04T14:07:08Z

Which issue does this PR close?

Closes #2093.

Rationale for this change

@yjshen Thanks for your insight! ❤️
Optimized of In_List clause, when all filter values of In clause are static.
Default list values use Vec to store, it has time complexity O(n) to check contains, In some situation use Set it has complexity O(1)

test sql:

select count(*) from orders where o_orderkey in (2785313,
2785314,
2785315,
2785316,
''' (1000 elements)
2786311);

Master branch:

2786309,
2786310,
2786311);
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 255             |
+-----------------+
1 row in set. Query took 4.713 seconds.

This pr:

2786309,
2786310,
2786311);
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 255             |
+-----------------+
1 row in set. Query took 0.566 seconds.

What changes are included in this PR?

Are there any user-facing changes?

Ted-Jiang · 2022-04-04T14:08:11Z

datafusion/physical-expr/src/expressions/in_list.rs

-            expr,
-            list,
-            negated,
+        if list.len() > OPTIMIZER_INSET_THRESHOLD && check_all_static_filter_expr(&list) {


According to Spark, default set 400.

According to not support switch codeGen change to 10, like spark 2.x

Dandandan · 2022-04-04T15:53:02Z

datafusion/physical-expr/src/expressions/in_list.rs

+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,


Just a note: because of using ScalarValue we are a bit slower than if we could use basic types, like HashSet<u32>, etc. ~~The same apllies to the existing implementation based on Vec~~. I think that could be a couple times faster.

But this makes something for a future PR

@Dandandan Thanks for your information❤️, Is there any specific reason why using ScalarValue is slower?

I think it's because ScalarValue is an enum of an option wrapper of value. So it would be overheads for both computation and memory footprint compared to HashSet of native data values.

Filed #2165

In addition to higher memory usage and dispatching overhead there are two extra sources of overhead

Having to convert all values from array items to ScalarValue

Hashing a Scalarvalue is slower than hashing a native type.

alamb

Thank you @Ted-Jiang -- This is a (important) classic optimization.

I think the code is looking pretty good. The only thing I think this PR needs is some test(s) that exercise the new path.

datafusion/physical-expr/src/expressions/in_list.rs

alamb · 2022-04-05T20:03:21Z

datafusion/physical-expr/src/expressions/in_list.rs

+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,


Filed #2165

datafusion/physical-expr/src/expressions/in_list.rs

alamb · 2022-04-05T20:08:51Z

datafusion/physical-expr/src/expressions/in_list.rs

+        if let Some(in_set) = &self.set {
+            let array = match value {
+                ColumnarValue::Array(array) => array,
+                ColumnarValue::Scalar(scalar) => scalar.to_array(),


This is unfortunate -- turning a scalar into an array just to convert it back to a scalar if using InList

I wonder if we can pass the columnar_value to set_contains_with_negated and only do the conversion when using the Vec (not the Set)?

This probably doesn't really make any sort of performance difference for any case we care about, I just noticed it and thought I would mention it)

Thanks! @alamb i agree it doesn't make any performance, it's rare to match ColumnarValue::Scalar
it also appears in https://github.com/apache/arrow-datafusion/blob/72a1194b9817df5ec7d87df6f5c3e45ed0e1ecd9/datafusion/physical-expr/src/expressions/in_list.rs#L517-L520.
Maybe we can file an issue to improve it.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan · 2022-04-07T07:12:16Z

datafusion/physical-expr/src/expressions/in_list.rs

+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;


I don't think we should follow Spark, but should use some benchmarking to find a good heuristic for this threshold (where it's faster to use a hash set).

The optimal value might be quite a bit higher at this moment.

Agree! i will post a bench result. By the way, without subquery implement, Is there some way easy to do?

X: cost time. Y: filter value numbers

Blue: is use List, orange: is use Set.

I did a benchMark in my local, use select count(*) from orders where o_orderkey in (x1, x2, ..., xn)
Obviously, Set has a fixed gradient, List cost time increases with the parameter number.
The intersection of the two lines is located is between 10～20 （same as Spark set 10).
So, i decided set OPTIMIZER_INSET_THRESHOLD = 10 align with spark.

Cool analysis!
I am wondering if other data types make it a bit different, such as strings / utf8 arrays? I expect the conversion to be a bit slower there, because of the extra conversion/allocations needed.

I have try to use cast 'u32.to_string' get same conclusion. In my opinion, use string may cause more cost in hash.

Cool, you are right! @Dandandan Thanks! utf8 columns threshold near 50. Maybe i will Set it by type👍
btw: where array -> ScalarValue copies / allocates happened in code😂

Amazing, thanks for the comparison. I would also be OK with changing it to some value like 30. Seems for both numeric as for utf8 data a good enough.
At some point we could further optimize the implementation (#2165) after this we can adjust to a (lower) value.

where array -> ScalarValue copies / allocates happened in code

FYI, (String) allocations happen here, when converting to a ScalarValue. https://github.com/apache/arrow-datafusion/pull/2156/files#diff-ff8086fafbfe5021e5f7d51d96aaae2cf65f779ac3fae5fc182f87e956bb0550R214

❤️ @Dandandan Thanks a lot for your info !

yjshen · 2022-04-08T06:16:09Z

datafusion/physical-expr/src/expressions/in_list.rs

+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen by the benchmark at
+/// https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845198369
+/// TODO: add switch codeGen in In_List


Is this line of doc still valid?

Yes change to discuss link in this pr.

alamb · 2022-04-08T18:36:40Z

Thanks everyone who contributed code and review to this PR 🎉

my-vegetable-has-exploded · 2023-12-03T14:52:49Z

datafusion/physical-expr/src/expressions/in_list.rs

+                write!(f, "{} NOT IN ({:?})", self.expr, self.list)
+            }
+        } else if self.set.is_some() {
+            write!(f, "Use {} IN (SET) ({:?})", self.expr, self.list)


Sorry to bother you, is use in the "Use {} IN (SET) ({:?})" a typo? @Ted-Jiang

Ted-Jiang added 3 commits March 30, 2022 21:36

commit 1

8b38a54

Add an InSet as an optimized version for IN_LIST

26957e2

fix clippy

4772ee4

github-actions bot added the datafusion label Apr 4, 2022

Ted-Jiang commented Apr 4, 2022

View reviewed changes

Ted-Jiang added 2 commits April 4, 2022 22:48

fix ut

5306c32

fix fmt

e6399b1

Dandandan reviewed Apr 4, 2022

View reviewed changes

fix clippy

e705fc3

alamb mentioned this pull request Apr 5, 2022

Optimize InList implementation with native types rather than ScalarValue #2165

Closed

alamb reviewed Apr 5, 2022

View reviewed changes

Ted-Jiang and others added 2 commits April 7, 2022 12:57

make clear in explain

72a1194

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

add UT and change threshold

14fdc73

Dandandan reviewed Apr 7, 2022

View reviewed changes

Ted-Jiang closed this Apr 7, 2022

Ted-Jiang reopened this Apr 7, 2022

Ted-Jiang added 2 commits April 7, 2022 18:14

fix clippy

29404e8

change OPTIMIZER_INSET_THRESHOLD

9e0a58a

Dandandan approved these changes Apr 8, 2022

View reviewed changes

yjshen reviewed Apr 8, 2022

View reviewed changes

alamb changed the title ~~Add an InSet as an optimized version for IN_LIST~~ Optimize the evaluation of IN for large lists using InSet Apr 8, 2022

alamb merged commit dec9adc into apache:master Apr 8, 2022

This was referenced Jun 30, 2022

IN/NOT IN List: NULL is not equal to NULL #2817

Closed

InList: fix bug for comparing with Null in the list using the set optimization #2809

Merged

Not evaluate the set expr in the InList for the optimization #2820

Closed

my-vegetable-has-exploded reviewed Dec 3, 2023

View reviewed changes

Conversation

Ted-Jiang commented Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yjshen Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Apr 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Ted-Jiang commented Apr 4, 2022 •

edited

Loading

Ted-Jiang Apr 7, 2022 •

edited

Loading

Dandandan Apr 4, 2022 •

edited

Loading

yjshen Apr 5, 2022 •

edited

Loading

Dandandan Apr 5, 2022 •

edited

Loading

Ted-Jiang Apr 7, 2022 •

edited

Loading

Ted-Jiang Apr 7, 2022 •

edited

Loading

Dandandan Apr 8, 2022 •

edited

Loading