Conversation
|
This PR solves for:
Decisions taken:
What is not in the scope of this PR?
|
|
@Fokko This PR is ready for review |
| ) | ||
|
|
||
|
|
||
| def get_sort_indices_arrow_table(arrow_table: pa.Table, sort_seq: List[Tuple[str, PyArrowSortOptions]]) -> List[int]: |
There was a problem hiding this comment.
Just wanted to clarify on the separate implementation for sort_indices other than the one provided by pyarrow.
This is because pyarrow sort_indices or Sort Options only supports one order for null placement across keys.
More details here:
- https://arrow.apache.org/docs/python/generated/pyarrow.compute.sort_indices.html
- https://arrow.apache.org/docs/python/generated/pyarrow.compute.SortOptions.html
While, the iceberg spec doesn't discriminate of having different null ordering across keys: https://iceberg.apache.org/spec/#sort-orders
This function specifically helps to implement the above functionality by sorting across keys and utilizing the stable nature of the sort_indices algo from pyarrow.
We can raise another issue to improve the performance of this function.
In future, if pyarrow sort_indices does support different null ordering across, we can mark this function as obsolete and keep the implementation clean in the iceberg table append and overwrite methods.
Fixes: #271