-
Notifications
You must be signed in to change notification settings - Fork 81
unpinning dask #1006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unpinning dask #1006
Conversation
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1006 +/- ##
==========================================
+ Coverage 92.08% 92.11% +0.02%
==========================================
Files 48 49 +1
Lines 7446 7537 +91
==========================================
+ Hits 6857 6943 +86
- Misses 589 594 +5
🚀 New features to boost your workflow:
|
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small comments, tough to comment on the usage so I tried to focus on the new code. Happy to rereview!
| # We have to do this because as_known() does not preserve the order anymore in latest dask versions | ||
| # TODO discuss whether we can always expect the index from before to be monotonically increasing, because | ||
| # then we don't have to check order. | ||
| if index: | ||
| data[VALUES_COLUMN] = data[VALUES_COLUMN].cat.set_categories(data.index, ordered=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that we could report this to dask? Maybe it is an unintended change. Or was it more that the order was never guaranteed in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will discuss it during their community meeting. I would have to dive a bit deeper into the exact cause, but they themselves don't seem to define set_categories so to me it seems like it comes from pandas dataframe but then the pandas dataframe only works per partition, but I am not certain about that. I did not want to spend too much time on it for now though as they can point me in the right direction much quicker.
|
@LucaMarconato think everything is in order here now. Do you see remaining blockers? |
|
Thanks @melonora, I'll review the changes. |
|
Action point for new PR is to ensure that the indices for dask dataframes are monotonically increasing across multiple partitions. Right now, it may not be the case, and as a result partitions may overlap, which creates problems when evaluating the Dask graph. However, this will be a separate PR. |
This PR unpins dask by adding a
.attrsaccessor for both dask Series and DataFrame. It pretty much functions as a drop-in replacement for the.attrsattribute in previous Dask versions. This PR is a prerequisite for unlocking zarr v3 sharding api.This will also unlock spatialdata+rapids-singlecell in the same environment.
Note regarding Dask versions supported: 2025.2.0 onwards dropped all Dask legacy support. Furthermore, after extensive testing, I saw no other way than not to support version 2025.1.0 as we have a couple of operations that cause dependency issues in the Dask graphs. Using 2025.2.0 onwards, this will provide users with an error message that we can resolve, whereas previously, the computation would hang.
For now, some minor computations would be done in memory due to this mixed graph dependency problem. I can reconstruct the proper graph so computation can be completely lazy again, but this would work for a follow-up PR.
Lastly, for writing,
optimize_graphhas been set toFalse. Not doing this causes 'permission denied' errors on Windows, particularly when performing atomic writes, where partial files must be renamed once writing is completed. In general, there were some differences between the various OSs, so I enabled testing for the lowest version of Dask we support and the latest version. I also now include Windows in the CI.The next follow-up work will be to open a PR to Dask that enables sharding support there. Once that is completed, we can create the sharding support in SpatialData.