Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
dd38ad4
add attrs accesor
melonora Oct 23, 2025
4feb491
change deprecated Index access
melonora Oct 23, 2025
1c042ea
add accessor to init
melonora Oct 23, 2025
b733de2
remove query planning
melonora Oct 23, 2025
e53e215
additional changes to accessor
melonora Oct 25, 2025
3ae1d29
divisions is not settable anymore
melonora Oct 25, 2025
19684fb
add fixes
melonora Oct 27, 2025
51b733e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 27, 2025
88fe003
fix rasterize points
melonora Oct 28, 2025
d8b2cc4
fix rasterize points
melonora Oct 28, 2025
239f693
copy partitioned attrs
melonora Oct 28, 2025
afad6bd
fix mypy
melonora Oct 29, 2025
e017ca7
fix last mypy error
melonora Oct 29, 2025
d11655a
Apply suggestion from @melonora
melonora Oct 30, 2025
8253eb8
Apply suggestion from @melonora
melonora Oct 30, 2025
078469a
Apply suggestion from @melonora
melonora Oct 30, 2025
65839b4
deduplicate
melonora Oct 30, 2025
1a7bfbf
deduplicate
melonora Oct 30, 2025
a7a6018
Merge branch 'main' into dataframe_accessor
LucaMarconato Oct 31, 2025
f1fb487
.attrs is now always an accessor, never a dict
LucaMarconato Nov 2, 2025
d7d0b4d
simplify wrapper logic:
LucaMarconato Nov 2, 2025
66a6095
revert after loc/iloc indexer
melonora Nov 2, 2025
2c81509
clean-up, simplify accessor logic
LucaMarconato Nov 2, 2025
47670da
remove asserts
melonora Nov 2, 2025
8b18989
Merge branch 'dataframe_accessor' of https://github.com/melonora/spat…
LucaMarconato Nov 2, 2025
e3c8bc8
remove asserts
melonora Nov 2, 2025
868a5a2
remove asserts
melonora Nov 2, 2025
d43753c
simplify accessor logic by reducing number of classes
LucaMarconato Nov 2, 2025
f7e0caa
Merge branch 'dataframe_accessor' of https://github.com/melonora/spat…
LucaMarconato Nov 2, 2025
e0ab1d8
rename wrap_with_attrs
LucaMarconato Nov 2, 2025
4257183
remove comment
melonora Nov 2, 2025
b765da2
remove comment
melonora Nov 2, 2025
e49a580
wrapping methods for dd.Series
LucaMarconato Nov 2, 2025
48b38b9
Merge branch 'dataframe_accessor' of https://github.com/melonora/spat…
LucaMarconato Nov 2, 2025
ed6b457
add dask tests for accessor
LucaMarconato Nov 2, 2025
9763016
fix index.compute() attrs missing
LucaMarconato Nov 2, 2025
0415c89
change fix .attrs on index
melonora Nov 2, 2025
1541265
change fix .attrs on index
melonora Nov 2, 2025
dc40fde
wrap dd.Series.loc
LucaMarconato Nov 2, 2025
b5206aa
Merge branch 'dataframe_accessor' of https://github.com/melonora/spat…
LucaMarconato Nov 2, 2025
3770407
remove old code, add comments
melonora Nov 2, 2025
e2905e5
remove old code, add comments
melonora Nov 2, 2025
f46e225
move accesor code
melonora Nov 2, 2025
f76672e
change git workflow
melonora Nov 2, 2025
f68d55d
some fixes
melonora Nov 2, 2025
9f26549
remove old test code
melonora Nov 2, 2025
39dc8fb
test dask among os
melonora Nov 3, 2025
1a83439
fix
melonora Nov 3, 2025
18fdb70
fix
melonora Nov 3, 2025
d43cac1
fix
melonora Nov 3, 2025
a78c680
revert changes
melonora Nov 3, 2025
8d5251b
fix
melonora Nov 3, 2025
b9a228a
adjust
melonora Nov 3, 2025
7efabfe
adjust dask pin
melonora Nov 3, 2025
3ed65bd
adjust dask pin
melonora Nov 3, 2025
1824296
fix dask backing files and windows permissions
melonora Nov 3, 2025
42c2452
fix dask mixed graph problem
melonora Nov 3, 2025
93b48be
temporary fix indexing
melonora Nov 3, 2025
1813c84
fix rasterize
melonora Nov 4, 2025
50374bb
adjust github workflow
melonora Nov 4, 2025
fafede5
move 3.13 to include
melonora Nov 4, 2025
a06302d
make more concise
melonora Nov 4, 2025
990891a
Apply suggestion from @melonora
melonora Nov 4, 2025
72121d3
fix str representation
melonora Nov 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 20 additions & 14 deletions .github/workflows/test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
push:
branches: [main]
tags:
- "v*" # Push events to matching v*, i.e. v1.0, v20.15.10
- "v*"
pull_request:
branches: "*"

Expand All @@ -13,26 +13,24 @@ jobs:
runs-on: ${{ matrix.os }}
defaults:
run:
shell: bash -e {0} # -e to fail on error
shell: bash -e {0}

strategy:
fail-fast: false
matrix:
python: ["3.11", "3.13"]
os: [ubuntu-latest]
include:
- os: macos-latest
python: "3.11"
- os: macos-latest
python: "3.12"
pip-flags: "--pre"
name: "Python 3.12 (pre-release)"
- os: windows-latest
python: "3.11"

- {os: windows-latest, python: "3.11", dask-version: "2025.2.0", name: "Dask 2025.2.0"}
- {os: windows-latest, python: "3.11", dask-version: "latest", name: "Dask latest"}
- {os: ubuntu-latest, python: "3.11", dask-version: "2025.2.0", name: "Dask 2025.2.0"}
- {os: ubuntu-latest, python: "3.11", dask-version: "latest", name: "Dask latest"}
- {os: ubuntu-latest, python: "3.13", dask-version: "latest", name: "Dask latest"}
- {os: macos-latest, python: "3.11", dask-version: "2025.2.0", name: "Dask 2025.2.0"}
- {os: macos-latest, python: "3.11", dask-version: "latest", name: "Dask latest"}
- {os: macos-latest, python: "3.12", pip-flags: "--pre", name: "Python 3.12 (pre-release)"}
env:
OS: ${{ matrix.os }}
PYTHON: ${{ matrix.python }}
DASK_VERSION: ${{ matrix.dask-version }}

steps:
- uses: actions/checkout@v2
Expand All @@ -42,7 +40,15 @@ jobs:
version: "latest"
python-version: ${{ matrix.python }}
- name: Install dependencies
run: "uv sync --extra test"
run: |
uv sync --extra test
if [[ -n "${DASK_VERSION}" ]]; then
if [[ "${DASK_VERSION}" == "latest" ]]; then
uv pip install --upgrade dask
else
uv pip install dask==${DASK_VERSION}
fi
fi
- name: Test
env:
MPLBACKEND: agg
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ repos:
rev: v3.5.3
hooks:
- id: prettier
exclude: ^.github/workflows/test.yaml
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.15.0
hooks:
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ dependencies = [
"anndata>=0.9.1",
"click",
"dask-image",
"dask>=2024.10.0,<=2024.11.2",
"dask>=2025.2.0",
"datashader",
"fsspec[s3,http]",
"geopandas>=0.14",
Expand Down
17 changes: 2 additions & 15 deletions src/spatialdata/__init__.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,7 @@
import dask

dask.config.set({"dataframe.query-planning": False})
import dask.dataframe as dd

# Setting `dataframe.query-planning` to False is effective only if run before `dask.dataframe` is initialized. In
# the case in which the user had initilized `dask.dataframe` before, we would have DASK_EXPER_ENABLED set to `True`.
# Here we check that this does not happen.
if hasattr(dd, "DASK_EXPR_ENABLED") and dd.DASK_EXPR_ENABLED:
raise RuntimeError(
"Unsupported backend: dask-expr has been detected as the backend of dask.dataframe. Please "
"use:\nimport dask\ndask.config.set({'dataframe.query-planning': False})\nbefore importing "
"dask.dataframe to disable dask-expr. The support is being worked on, for more information please see"
"https://github.com/scverse/spatialdata/pull/570"
)
from importlib.metadata import version

import spatialdata.models._accessor # noqa: F401

__version__ = version("spatialdata")

__all__ = [
Expand Down
7 changes: 5 additions & 2 deletions src/spatialdata/_core/_deepcopy.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,12 @@ def _(gdf: GeoDataFrame) -> GeoDataFrame:
@deepcopy.register(DaskDataFrame)
def _(df: DaskDataFrame) -> DaskDataFrame:
# bug: the parser may change the order of the columns
new_ddf = PointsModel.parse(df.compute().copy(deep=True))
compute_df = df.compute().copy(deep=True)
new_ddf = PointsModel.parse(compute_df)
# the problem is not .copy(deep=True), but the parser, which discards some metadata https://github.com/scverse/spatialdata/issues/503#issuecomment-2015275322
new_ddf.attrs = _deepcopy(df.attrs)
# We need to use the compute_df here as with deepcopy, df._attrs does not exist anymore.
# print(type(new_ddf.attrs))
new_ddf.attrs.update(_deepcopy(compute_df.attrs))
return new_ddf


Expand Down
10 changes: 9 additions & 1 deletion src/spatialdata/_core/operations/rasterize.py
Original file line number Diff line number Diff line change
Expand Up @@ -653,20 +653,28 @@ def rasterize_shapes_points(

table_name = table_name if table_name is not None else "table"

index = False
if value_key is not None:
kwargs = {"sdata": sdata, "element_name": element_name} if element_name is not None else {"element": data}
data[VALUES_COLUMN] = get_values(value_key, table_name=table_name, **kwargs).iloc[:, 0] # type: ignore[arg-type, union-attr]
elif isinstance(data, GeoDataFrame) or isinstance(data, DaskDataFrame) and return_regions_as_labels is True:
value_key = VALUES_COLUMN
data[VALUES_COLUMN] = data.index.astype("category")
index = True
else:
value_key = VALUES_COLUMN
data[VALUES_COLUMN] = 1

label_index_to_category = None
if VALUES_COLUMN in data and data[VALUES_COLUMN].dtype == "category":
if isinstance(data, DaskDataFrame):
data[VALUES_COLUMN] = data[VALUES_COLUMN].cat.as_known()
# We have to do this because as_known() does not preserve the order anymore in latest dask versions
# TODO discuss whether we can always expect the index from before to be monotonically increasing, because
# then we don't have to check order.
if index:
data[VALUES_COLUMN] = data[VALUES_COLUMN].cat.set_categories(data.index, ordered=True)
Comment on lines +671 to +675
Copy link
Member

@LucaMarconato LucaMarconato Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that we could report this to dask? Maybe it is an unintended change. Or was it more that the order was never guaranteed in the first place?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will discuss it during their community meeting. I would have to dive a bit deeper into the exact cause, but they themselves don't seem to define set_categories so to me it seems like it comes from pandas dataframe but then the pandas dataframe only works per partition, but I am not certain about that. I did not want to spend too much time on it for now though as they can point me in the right direction much quicker.

else:
data[VALUES_COLUMN] = data[VALUES_COLUMN].cat.as_known()
label_index_to_category = dict(enumerate(data[VALUES_COLUMN].cat.categories, start=1))

if return_single_channel is None:
Expand Down
21 changes: 12 additions & 9 deletions src/spatialdata/_core/operations/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
import itertools
import warnings
from functools import singledispatch
from typing import TYPE_CHECKING, Any
from typing import TYPE_CHECKING, Any, cast

import dask.array as da
import dask_image.ndinterp
import numpy as np
import pandas as pd
from dask.array.core import Array as DaskArray
from dask.dataframe import DataFrame as DaskDataFrame
from geopandas import GeoDataFrame
Expand Down Expand Up @@ -432,18 +433,20 @@ def _(
xtransformed = transformation._transform_coordinates(xdata)
transformed = data.drop(columns=list(axes)).copy()
# dummy transformation that will be replaced by _adjust_transformation()
transformed.attrs[TRANSFORM_KEY] = {DEFAULT_COORDINATE_SYSTEM: Identity()}
# TODO: the following line, used in place of the line before, leads to an incorrect aggregation result. Look into
# this! Reported here: ...
# transformed.attrs = {TRANSFORM_KEY: {DEFAULT_COORDINATE_SYSTEM: Identity()}}
assert isinstance(transformed, DaskDataFrame)
default_cs = {DEFAULT_COORDINATE_SYSTEM: Identity()}
transformed.attrs[TRANSFORM_KEY] = default_cs

for ax in axes:
indices = xtransformed["dim"] == ax
new_ax = xtransformed[:, indices]
transformed[ax] = new_ax.data.flatten()
# TODO: discuss with dask team
# This is not nice, but otherwise there is a problem with the joint graph of new_ax and transformed, causing
# a getattr missing dependency of dependent from_dask_array.
new_col = pd.Series(new_ax.data.flatten().compute(), index=transformed.index)
transformed[ax] = new_col

old_transformations = cast(dict[str, Any], get_transformation(data, get_all=True))

old_transformations = get_transformation(data, get_all=True)
assert isinstance(old_transformations, dict)
_set_transformation_for_transformed_elements(
transformed,
old_transformations,
Expand Down
16 changes: 13 additions & 3 deletions src/spatialdata/_core/query/spatial_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -672,14 +672,24 @@ def _(
max_coordinate=max_coordinate_intrinsic,
)

# assert that the number of bounding boxes is correct
assert len(in_intrinsic_bounding_box) == len(min_coordinate)
if not (len_df := len(in_intrinsic_bounding_box)) == (len_bb := len(min_coordinate)):
raise ValueError(f"Number of dataframes `{len_df}` is not equal to the number of bounding boxes `{len_bb}`.")
points_in_intrinsic_bounding_box: list[DaskDataFrame | None] = []
points_pd = points.compute()
attrs = points.attrs.copy()
for mask in in_intrinsic_bounding_box:
if mask.sum() == 0:
points_in_intrinsic_bounding_box.append(None)
else:
points_in_intrinsic_bounding_box.append(points.loc[mask])
# TODO there is a problem when mixing dask dataframe graph with dask array graph. Need to compute for now.
# we can't compute either mask or points as when we calculate either one of them
# test_query_points_multiple_partitions will fail as the mask will be used to index each partition.
# However, if we compute and then create the dask array again we get the mixed dask graph problem.
mask_np = mask.compute()
filtered_pd = points_pd[mask_np]
points_filtered = dd.from_pandas(filtered_pd, npartitions=points.npartitions)
points_filtered.attrs.update(attrs)
points_in_intrinsic_bounding_box.append(points_filtered)
if len(points_in_intrinsic_bounding_box) == 0:
return None

Expand Down
7 changes: 2 additions & 5 deletions src/spatialdata/_core/spatialdata.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
import zarr
from anndata import AnnData
from dask.dataframe import DataFrame as DaskDataFrame
from dask.dataframe import read_parquet
from dask.delayed import Delayed
from dask.dataframe import Scalar, read_parquet
from geopandas import GeoDataFrame
from shapely import MultiPolygon, Polygon
from xarray import DataArray, DataTree
Expand Down Expand Up @@ -1985,9 +1984,7 @@ def h(s: str) -> str:
else:
shape_str = (
"("
+ ", ".join(
[(str(dim) if not isinstance(dim, Delayed) else "<Delayed>") for dim in v.shape]
)
+ ", ".join([(str(dim) if not isinstance(dim, Scalar) else "<Delayed>") for dim in v.shape])
+ ")"
)
descr += f"{h(attr + 'level1.1')}{k!r}: {descr_class} with shape: {shape_str} {dim_string}"
Expand Down
44 changes: 32 additions & 12 deletions src/spatialdata/_io/_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import zarr
from anndata import AnnData
from dask._task_spec import Task
from dask.array import Array as DaskArray
from dask.dataframe import DataFrame as DaskDataFrame
from geopandas import GeoDataFrame
Expand Down Expand Up @@ -301,6 +302,19 @@ def _get_backing_files(element: DaskArray | DaskDataFrame) -> list[str]:
return files


def _find_piece_dict(obj: dict[str, tuple[str | None]] | Task) -> dict[str, tuple[str | None | None]] | None:
"""Recursively search for dict containing the key 'piece' in Dask task specs containing the parquet file path."""
if isinstance(obj, dict):
if "piece" in obj:
return obj
elif hasattr(obj, "args"): # Handles dask._task_spec.* objects like Task and List
for v in obj.args:
result = _find_piece_dict(v)
if result is not None:
return result
return None


def _search_for_backing_files_recursively(subgraph: Any, files: list[str]) -> None:
# see the types allowed for the dask graph here: https://docs.dask.org/en/stable/spec.html

Expand All @@ -327,25 +341,31 @@ def _search_for_backing_files_recursively(subgraph: Any, files: list[str]) -> No
path = getattr(v.store, "path", None) if getattr(v.store, "path", None) else v.store.root
files.append(str(UPath(path).resolve()))
elif name.startswith("read-parquet") or name.startswith("read_parquet"):
if hasattr(v, "creation_info"):
# https://github.com/dask/dask/blob/ff2488aec44d641696e0b7aa41ed9e995c710705/dask/dataframe/io/parquet/core.py#L625
t = v.creation_info["args"]
if not isinstance(t, tuple) or len(t) != 1:
raise ValueError(
f"Unable to parse the parquet file from the dask subgraph {subgraph}. Please "
f"report this bug."
)
parquet_file = t[0]
files.append(str(UPath(parquet_file).resolve()))
elif isinstance(v, tuple) and len(v) > 1 and isinstance(v[1], dict) and "piece" in v[1]:
# Here v is a read_parquet task with arguments and the only value is a dictionary.
if "piece" in v.args[0]:
# https://github.com/dask/dask/blob/ff2488aec44d641696e0b7aa41ed9e995c710705/dask/dataframe/io/parquet/core.py#L870
parquet_file, check0, check1 = v[1]["piece"]
parquet_file, check0, check1 = v.args[0]["piece"]
if not parquet_file.endswith(".parquet") or check0 is not None or check1 is not None:
raise ValueError(
f"Unable to parse the parquet file from the dask subgraph {subgraph}. Please "
f"report this bug."
)
files.append(os.path.realpath(parquet_file))
else:
# This occurs when for example points and images are mixed, the main task still starts with
# read_parquet, but the execution happens through a subgraph which we iterate over to get the
# actual read_parquet task.
for task in v.args[0].values():
# Recursively go through tasks, this is required because differences between dask versions.
piece_dict = _find_piece_dict(task)
if isinstance(piece_dict, dict) and "piece" in piece_dict:
parquet_file, check0, check1 = piece_dict["piece"] # type: ignore[misc]
if not parquet_file.endswith(".parquet") or check0 is not None or check1 is not None:
raise ValueError(
f"Unable to parse the parquet file from the dask subgraph {subgraph}. Please "
f"report this bug."
)
files.append(os.path.realpath(parquet_file))


def _backed_elements_contained_in_path(path: Path, object: SpatialData | SpatialElement | AnnData) -> list[bool]:
Expand Down
4 changes: 3 additions & 1 deletion src/spatialdata/_io/io_raster.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,9 @@ def _write_raster_datatree(
compute=False,
)
# Compute all pyramid levels at once to allow Dask to optimize the computational graph.
da.compute(*dask_delayed)
# Optimize_graph is set to False for now as this causes permission denied errors when during atomic writes
# os.replace is called. These can also be alleviated by using 'single-threaded' scheduler.
da.compute(*dask_delayed, optimize_graph=False)

trans_group = group["labels"][element_name] if raster_type == "labels" else group
overwrite_coordinate_transformations_raster(
Expand Down
2 changes: 1 addition & 1 deletion src/spatialdata/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -365,7 +365,7 @@ def blobs_annotating_element(name: BlobsTypes) -> SpatialData:
instance_id = get_element_instances(sdata[name]).tolist()
else:
index = sdata[name].index
instance_id = index.compute().tolist() if isinstance(index, dask.dataframe.core.Index) else index.tolist()
instance_id = index.compute().tolist() if isinstance(index, dask.dataframe.Index) else index.tolist()
n = len(instance_id)
new_table = AnnData(shape=(n, 0), obs={"region": pd.Categorical([name] * n), "instance_id": instance_id})
new_table = TableModel.parse(new_table, region=name, region_key="region", instance_key="instance_id")
Expand Down
Loading
Loading