Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 19 additions & 23 deletions datafusion/functions/src/string/concat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -88,37 +88,33 @@ impl ScalarUDFImpl for ConcatFunc {
&self.signature
}

/// Match the return type to the input types to avoid unnecessary casts. On
/// mixed inputs, prefer Utf8View; prefer LargeUtf8 over Utf8 to avoid
/// potential overflow on LargeUtf8 input.
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
use DataType::*;
let mut dt = &Utf8;
arg_types.iter().for_each(|data_type| {
if data_type == &Utf8View {
dt = data_type;
}
if data_type == &LargeUtf8 && dt != &Utf8View {
dt = data_type;
}
});

Ok(dt.to_owned())
if arg_types.contains(&Utf8View) {
Ok(Utf8View)
} else if arg_types.contains(&LargeUtf8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a thought about this. I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32) https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.LargeUtf8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting point. I believe the typical precedence today is Utf8View > LargeUtf8 > Utf8, partly on the grounds that "StringArray to StringViewArray is cheap but not vice versa". I can see arguments for both sides; if we want to reconsider this, seems like a distinct issue?

/// Coercion rules for string view types (Utf8/LargeUtf8/Utf8View):
/// If at least one argument is a string view, we coerce to string view
/// based on the observation that StringArray to StringViewArray is cheap but not vice versa.
///
/// Between Utf8 and LargeUtf8, we coerce to LargeUtf8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely it should be another issue as it likely occurs in a few places. I am fairly certain I am correct on the proper type ordering here but in the wild I doubt it would be encountered much - just how many columns would have > 2 billion bytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is really only pertinent in areas that pick a return type based on multiple columns. For the typical case where the udf is operating on a single column the existing logic should be fine - such as in btrim

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)

I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB

Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LargeUtf8 should take precedence over Utf8View because you cannot necessarily fit data from a LargeUtf8 column into Utf8View (i64 vs i32)

I think the only type of data that can't be stored in a Utf8View that a LargeUtf8 an handle is individual strings that are longer than 2GB

Otherwise, data from a LargeUtf8 will work just fine in Utf8View (the view will have multiple buffers rather than one large one)

Indeed, that was the point I was trying to get across. It's a rare ... but possible. Though honestly I expect DF would fail somewhere else pretty quickly if a column with data that big was ever encountered.

Ok(LargeUtf8)
} else {
Ok(Utf8)
}
}

/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
/// concat('abcde', 2, NULL, 22) = 'abcde222'
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
let ScalarFunctionArgs { args, .. } = args;

let mut return_datatype = DataType::Utf8;
args.iter().for_each(|col| {
if col.data_type() == DataType::Utf8View {
return_datatype = col.data_type();
}
if col.data_type() == DataType::LargeUtf8
&& return_datatype != DataType::Utf8View
{
return_datatype = col.data_type();
}
});
let return_datatype = if args.iter().any(|c| c.data_type() == DataType::Utf8View)
{
DataType::Utf8View
} else if args.iter().any(|c| c.data_type() == DataType::LargeUtf8) {
DataType::LargeUtf8
} else {
DataType::Utf8
};

let array_len = args.iter().find_map(|x| match x {
ColumnarValue::Array(array) => Some(array.len()),
Expand Down Expand Up @@ -247,7 +243,7 @@ impl ScalarUDFImpl for ConcatFunc {
builder.append_offset();
}

let string_array = builder.finish();
let string_array = builder.finish(None);
Ok(ColumnarValue::Array(Arc::new(string_array)))
}
DataType::LargeUtf8 => {
Expand Down
Loading