Skip to content

Optimize the evaluation of IN for large lists using InSet#2156

Merged
alamb merged 10 commits intoapache:masterfrom
Ted-Jiang:ISSUE_2093
Apr 8, 2022
Merged

Optimize the evaluation of IN for large lists using InSet#2156
alamb merged 10 commits intoapache:masterfrom
Ted-Jiang:ISSUE_2093

Conversation

@Ted-Jiang
Copy link
Member

@Ted-Jiang Ted-Jiang commented Apr 4, 2022

Which issue does this PR close?

Closes #2093.

Rationale for this change

@yjshen Thanks for your insight! ❤️
Optimized of In_List clause, when all filter values of In clause are static.
​​Default list values use Vec to store, it ​ has time complexity O(n) to check contains, In some situation use Set it has complexity O(1)

test sql:

select count(*) from orders where o_orderkey in (2785313,
2785314,
2785315,
2785316,
''' (1000 elements)
2786311);

Master branch:

2786309,
2786310,
2786311);
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 255             |
+-----------------+
1 row in set. Query took 4.713 seconds.

This pr:

2786309,
2786310,
2786311);
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 255             |
+-----------------+
1 row in set. Query took 0.566 seconds.

What changes are included in this PR?

Are there any user-facing changes?

expr,
list,
negated,
if list.len() > OPTIMIZER_INSET_THRESHOLD && check_all_static_filter_expr(&list) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Spark, default set 400.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to not support switch codeGen change to 10, like spark 2.x

/// InSet
#[derive(Debug)]
pub struct InSet {
set: HashSet<ScalarValue>,
Copy link
Contributor

@Dandandan Dandandan Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: because of using ScalarValue we are a bit slower than if we could use basic types, like HashSet<u32>, etc. The same apllies to the existing implementation based on Vec. I think that could be a couple times faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this makes something for a future PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan Thanks for your information❤️, Is there any specific reason why using ScalarValue is slower?

Copy link
Member

@yjshen yjshen Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's because ScalarValue is an enum of an option wrapper of value. So it would be overheads for both computation and memory footprint compared to HashSet of native data values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #2165

Copy link
Contributor

@Dandandan Dandandan Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to higher memory usage and dispatching overhead there are two extra sources of overhead

  • Having to convert all values from array items to ScalarValue
  • Hashing a Scalarvalue is slower than hashing a native type.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Ted-Jiang -- This is a (important) classic optimization.

I think the code is looking pretty good. The only thing I think this PR needs is some test(s) that exercise the new path.

/// InSet
#[derive(Debug)]
pub struct InSet {
set: HashSet<ScalarValue>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #2165

if let Some(in_set) = &self.set {
let array = match value {
ColumnarValue::Array(array) => array,
ColumnarValue::Scalar(scalar) => scalar.to_array(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunate -- turning a scalar into an array just to convert it back to a scalar if using InList

I wonder if we can pass the columnar_value to set_contains_with_negated and only do the conversion when using the Vec (not the Set)?

This probably doesn't really make any sort of performance difference for any case we care about, I just noticed it and thought I would mention it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @alamb i agree it doesn't make any performance, it's rare to match ColumnarValue::Scalar
it also appears in https://github.com/apache/arrow-datafusion/blob/72a1194b9817df5ec7d87df6f5c3e45ed0e1ecd9/datafusion/physical-expr/src/expressions/in_list.rs#L517-L520.
Maybe we can file an issue to improve it.

Ted-Jiang and others added 2 commits April 7, 2022 12:57
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
/// Value chosen to be consistent with Spark
/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
/// TODO: add switch codeGen in In_List
static OPTIMIZER_INSET_THRESHOLD: usize = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should follow Spark, but should use some benchmarking to find a good heuristic for this threshold (where it's faster to use a hash set).

The optimal value might be quite a bit higher at this moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree! i will post a bench result. By the way, without subquery implement, Is there some way easy to do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

X: cost time. Y: filter value numbers

Blue: is use List, orange: is use Set.

I did a benchMark in my local, use select count(*) from orders where o_orderkey in (x1, x2, ..., xn)
Obviously, Set has a fixed gradient, List cost time increases with the parameter number.
The intersection of the two lines is located is between 10~20 (same as Spark set 10).
So, i decided set OPTIMIZER_INSET_THRESHOLD = 10 align with spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool analysis!
I am wondering if other data types make it a bit different, such as strings / utf8 arrays? I expect the conversion to be a bit slower there, because of the extra conversion/allocations needed.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have try to use cast 'u32.to_string' get same conclusion. In my opinion, use string may cause more cost in hash.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Cool, you are right! @Dandandan Thanks! utf8 columns threshold near 50. Maybe i will Set it by type👍

btw: where array -> ScalarValue copies / allocates happened in code😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks for the comparison. I would also be OK with changing it to some value like 30. Seems for both numeric as for utf8 data a good enough.
At some point we could further optimize the implementation (#2165) after this we can adjust to a (lower) value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@Dandandan Dandandan Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where array -> ScalarValue copies / allocates happened in code

FYI, (String) allocations happen here, when converting to a ScalarValue. https://github.com/apache/arrow-datafusion/pull/2156/files#diff-ff8086fafbfe5021e5f7d51d96aaae2cf65f779ac3fae5fc182f87e956bb0550R214

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ @Dandandan Thanks a lot for your info !

@Ted-Jiang Ted-Jiang closed this Apr 7, 2022
@Ted-Jiang Ted-Jiang reopened this Apr 7, 2022
/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
/// Value chosen by the benchmark at
/// https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845198369
/// TODO: add switch codeGen in In_List
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line of doc still valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes change to discuss link in this pr.

@alamb alamb changed the title Add an InSet as an optimized version for IN_LIST Optimize the evaluation of IN for large lists using InSet Apr 8, 2022
@alamb
Copy link
Contributor

alamb commented Apr 8, 2022

Thanks everyone who contributed code and review to this PR 🎉

write!(f, "{} NOT IN ({:?})", self.expr, self.list)
}
} else if self.set.is_some() {
write!(f, "Use {} IN (SET) ({:?})", self.expr, self.list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to bother you, is use in the "Use {} IN (SET) ({:?})" a typo? @Ted-Jiang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an InSet function as an optimized version for IN

5 participants