Skip to content

Conversation

@melonora
Copy link
Collaborator

@melonora melonora commented Oct 27, 2025

This PR unpins dask by adding a .attrs accessor for both dask Series and DataFrame. It pretty much functions as a drop-in replacement for the .attrs attribute in previous Dask versions. This PR is a prerequisite for unlocking zarr v3 sharding api.
This will also unlock spatialdata+rapids-singlecell in the same environment.

Note regarding Dask versions supported: 2025.2.0 onwards dropped all Dask legacy support. Furthermore, after extensive testing, I saw no other way than not to support version 2025.1.0 as we have a couple of operations that cause dependency issues in the Dask graphs. Using 2025.2.0 onwards, this will provide users with an error message that we can resolve, whereas previously, the computation would hang.

For now, some minor computations would be done in memory due to this mixed graph dependency problem. I can reconstruct the proper graph so computation can be completely lazy again, but this would work for a follow-up PR.
Lastly, for writing, optimize_graph has been set to False. Not doing this causes 'permission denied' errors on Windows, particularly when performing atomic writes, where partial files must be renamed once writing is completed. In general, there were some differences between the various OSs, so I enabled testing for the lowest version of Dask we support and the latest version. I also now include Windows in the CI.

The next follow-up work will be to open a PR to Dask that enables sharding support there. Once that is completed, we can create the sharding support in SpatialData.

@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 92.37288% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.11%. Comparing base (79cf8c9) to head (72121d3).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
src/spatialdata/models/_accessor.py 90.00% 7 Missing ⚠️
src/spatialdata/_core/query/spatial_query.py 88.88% 1 Missing ⚠️
src/spatialdata/_io/_utils.py 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1006      +/-   ##
==========================================
+ Coverage   92.08%   92.11%   +0.02%     
==========================================
  Files          48       49       +1     
  Lines        7446     7537      +91     
==========================================
+ Hits         6857     6943      +86     
- Misses        589      594       +5     
Files with missing lines Coverage Δ
src/spatialdata/__init__.py 100.00% <100.00%> (+3.57%) ⬆️
src/spatialdata/_core/_deepcopy.py 98.38% <100.00%> (+0.02%) ⬆️
src/spatialdata/_core/operations/rasterize.py 90.16% <100.00%> (+0.13%) ⬆️
src/spatialdata/_core/operations/transform.py 91.22% <100.00%> (+0.03%) ⬆️
src/spatialdata/_core/spatialdata.py 91.93% <100.00%> (-0.01%) ⬇️
src/spatialdata/_io/io_raster.py 93.93% <100.00%> (ø)
src/spatialdata/datasets.py 100.00% <100.00%> (ø)
src/spatialdata/models/models.py 88.57% <ø> (-0.05%) ⬇️
src/spatialdata/transformations/_utils.py 94.77% <ø> (-0.04%) ⬇️
src/spatialdata/_core/query/spatial_query.py 95.46% <88.88%> (-0.19%) ⬇️
... and 2 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@melonora melonora mentioned this pull request Oct 30, 2025
@melonora melonora requested a review from ilan-gold October 30, 2025 09:11
Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small comments, tough to comment on the usage so I tried to focus on the new code. Happy to rereview!

Comment on lines +671 to +675
# We have to do this because as_known() does not preserve the order anymore in latest dask versions
# TODO discuss whether we can always expect the index from before to be monotonically increasing, because
# then we don't have to check order.
if index:
data[VALUES_COLUMN] = data[VALUES_COLUMN].cat.set_categories(data.index, ordered=True)
Copy link
Member

@LucaMarconato LucaMarconato Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that we could report this to dask? Maybe it is an unintended change. Or was it more that the order was never guaranteed in the first place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will discuss it during their community meeting. I would have to dive a bit deeper into the exact cause, but they themselves don't seem to define set_categories so to me it seems like it comes from pandas dataframe but then the pandas dataframe only works per partition, but I am not certain about that. I did not want to spend too much time on it for now though as they can point me in the right direction much quicker.

@melonora
Copy link
Collaborator Author

@LucaMarconato think everything is in order here now. Do you see remaining blockers?

@LucaMarconato
Copy link
Member

Thanks @melonora, I'll review the changes.

@melonora
Copy link
Collaborator Author

Action point for new PR is to ensure that the indices for dask dataframes are monotonically increasing across multiple partitions. Right now, it may not be the case, and as a result partitions may overlap, which creates problems when evaluating the Dask graph. However, this will be a separate PR.

@melonora melonora merged commit 53b9438 into scverse:main Nov 22, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants