Adding MOSAIKS tutorial notebook #70

calebrob6 · 2021-08-24T19:25:45Z

This adds a new tutorial notebook that has an example implementation of the MOSAIKS method from "A generalizable and accessible approach to machine learning with global satellite imagery". It performs an unsupervised feature extraction of S2 imagery, then trains/tests a ridge regression model on population density labels from the extracted image features.

Some notes:

It runs in 30-40 minutes on a GPU instance on the PC Hub
It doesn't have any new library dependencies
It depends on a single external file (containing lat, lon points and associated population density labels from the above paper) -- "https://files.codeocean.com/files/verified/fa908bbc-11f9-4421-8bd3-72a4bf00427f_v2.0/data/int/applications/population/outcomes_sampled_population_CONTUS_16_640_UAR_100000_0.csv?download"

nbviewer link: https://nbviewer.jupyter.org/github/microsoft/PlanetaryComputerExamples/blob/295470443651d9d7005341707df80a59e04df699/tutorials/mosaiks.ipynb

@estherrolf

TomAugspurger · 2021-08-24T21:21:53Z

Very neat! A few random thoughts on the content as I run through it through it in the background:

We set a few of your "GDAL best practicies" environment variables automatically on the hub (https://planetarycomputer.microsoft.com/docs/overview/environment/#environment-variables). I wonder if we should set the others. (might still be OK to include in the notebook, for those running this in other places than the Hub)
We'll have census 2020 data (including boundaries and populations) on the PC soon-ish. Might want to revisit that once that's available if it's an appropriate substitute for that file with population density.
The pattern of querying the STAC endpoint under "Extract features from the imagery around each point" seems generally useful. I think pystac-client would benefit from something like this. I'll think about it a bit and write up an issue.
I'm interested in how the interaction between "List of STAC Items" and PyTorch dataset (/ dataloader) can be improved.

Some details on number 3 above: In particular, I wonder how repeated queries to the STAC API performs relative to a single query with a very complex polygon. The flow would be something like

Union all the lat / lon points into a MultiPoint polygon
Submit a single query with the (complex) intersects geometry
Fetch all the Items
Find which Item covers which point(s)

If you're OK with it, I might push some cleanup / simplifications to this branch. We'll want something that runs in a handful of minutes, perhaps by limiting the number of points a bit.

calebrob6 · 2021-08-24T22:00:22Z

If you're OK with it, I might push some cleanup / simplifications to this branch. We'll want something that runs in a handful of minutes, perhaps by limiting the number of points a bit.

Definitely change whatever you want!

I just tested decreasing the number of points by a factor of 10: this makes little effect on the number of STAC API calls you have to make (still ~1000) which takes 5-10 minutes, but (obviously) speeds up the time taken to actually run the model on the imagery (takes ~1 minute). The end results of the model are worse but the message is still the same.

We set a few of your "GDAL best practices" environment variables automatically ...

I'll run the featurization step w/ and w/out these environmental variables set and report back

We'll have census 2020 data (including boundaries and populations) on the PC soon-ish...

Cool! We could definitely use this. The current values come from this dataset (which would also be really useful to have).

The pattern of querying the STAC endpoint

Any speed ups here would be awesome. An alternative look at what I'm trying to do is, "how can I get a single layer of S2 scenes that cover a spatial extent?". Here I don't care about mosaicking them or anything, I just need the least cloudy ones.

I'm interested in how the interaction between "List of STAC Items" and PyTorch dataset (/ dataloader) can be improved.

In torchgeo our GeoDataset takes a big list of GeoTIFFs as input, builds an internal spatial index, and then lets you sample imagery by bounding box (+ time range). If you could efficiently find the subset of S2 scenes that you wanted to sample from up front, then you could wrap most of the logic into a single Dataset that follows the same idea.

calebrob6 · 2021-08-24T23:37:06Z

We set a few of your "GDAL best practices" environment variables automatically ...

Looks like loading time is similar w/ or w/out the additional env variables

TomAugspurger · 2021-09-03T19:53:50Z

https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb has some analysis / thoughts on speeding up the STAC queries. With some effort I was able to get all the STAC items in ~2-3 minutes (haven't associated them with the points yet though).

The basic idea is described at https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb#Option-2:-Parallelize...-carefully., but the tl/dr is to

Partition the data & parallelize the requests (fetching ~4 result sets in parallel, could parallelize more since we're I/O bound rather than CPU bound but would need to watch how the endpoint / database handles that)
Carefully sort the data prior to partitioning. Make sure nearby points are in the same partition to avoid re-fetching an item many times (since a single Item covers many points)

The final table of results is at https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb#Results.

I'm not 100% sure what the next steps are, but for now I think we adopt that partitioning code in the example here and then look to put it in a library / API in the future.

calebrob6 · 2021-10-02T04:04:52Z

Hey @TomAugspurger, I've implemented the RCF model used here in torchgeo (torchgeo/torchgeo#176). Is this notebook still something on the radar to include in the tutorials? (wondering if I should refactor it to use the torchgeo model implementation)

TomAugspurger · 2021-10-11T14:50:49Z

Still on my radar, thanks. I'll push an update some time this week.

TomAugspurger · 2021-10-15T20:16:19Z

@calebrob6 I updated the notebook to use a variant of the strategy in #70 (comment). The core idea is in

def query(points):
    """
    Find a STAC item for points in the `points` DataFrame

    Parameters
    ----------
    points : geopandas.GeoDataFrame
        A GeoDataFrame

    Returns
    -------
    geopandas.GeoDataFrame
        A new geopandas.GeoDataFrame with a `stac_item` column containing the STAC
        item that covers each point.
    """
    intersects = shapely.geometry.mapping(points.unary_union.convex_hull)

    search_start = "2018-01-01"
    search_end = "2019-12-31"
    catalog = pystac_client.Client.open(
        "https://planetarycomputer.microsoft.com/api/stac/v1"
    )

    # The time frame in which we search for non-cloudy imagery
    search = catalog.search(
        collections=["sentinel-2-l2a"],
        intersects=intersects,
        datetime=[search_start, search_end],
        query={"eo:cloud_cover": {"lt": 10}},
        limit=500,
    )
    ic = search.get_all_items_as_dict()

    features = ic["features"]
    features_d = {item["id"]: item for item in features}

    data = {
        "eo:cloud_cover": [],
        "geometry": [],
    }

    index = []

    for item in features:
        data["eo:cloud_cover"].append(item["properties"]["eo:cloud_cover"])
        data["geometry"].append(shapely.geometry.shape(item["geometry"]))
        index.append(item["id"])

    items = geopandas.GeoDataFrame(data, index=index, geometry="geometry").sort_values(
        "eo:cloud_cover"
    )
    point_list = points.geometry.tolist()

    point_items = []
    for point in point_list:
        covered_by = items[items.covers(point)]
        if len(covered_by):
            point_items.append(features_d[covered_by.index[0]])
        else:
            # There weren't any scenes matching our conditions for this point (too cloudy)
            point_items.append(None)

    return points.assign(stac_item=point_items)

This basic idea is to spatially partition the data, and then feed each partition to query in parallel. An earlier version just did the STAC item searching in parallel. Now this also associates each point with a single STAC item (in parallel). That runs in ~75s on the 70,000 points.

calebrob6 · 2021-10-15T20:36:46Z

Awesome!! Makes a ton sense and gives almost a linear speed-up compared to the single-threaded way I was doing previously.

So we don't have a torchgeo release yet, but I've implemented the RFC model defined in cell 4 -- https://torchgeo.readthedocs.io/en/latest/api/models.html#torchgeo.models.RCF. Do you want me to open a PR to clean this up once we have an actual release vs. adding it now with a pip install https://github.com/microsoft/torchgeo.git?

calebrob6 · 2021-10-15T20:39:39Z

Another sensible refactoring step would be to convert the returned STAC items into a stac collection with the population labels, then make a STACDataset (in torchgeo) to use instead of the CustomDataset. This doesn't exist and will take a bit of effort.

TomAugspurger · 2021-10-18T13:39:27Z

I talked with Caleb in Teams, and he plans to submit a PR updating this notebook once a version of torchgeo with this model is released.

Thanks for submitting this PR @calebrob6! It's great to see higher-level applications of STAC. I'm hopeful we can improve the user-experience around large queries like this, but in the meantime it's nice to have a concrete example.

Adding MOSAIKS tutorial notebook

de9df8c

calebrob6 requested review from agentmorris and mmcfarland August 24, 2021 19:25

Making the linters happy

2954704

calebrob6 requested a review from TomAugspurger August 24, 2021 20:17

Re-work search

1bdbed2

Added link to the MOSAIKS project page

4589835

TomAugspurger merged commit 1137874 into main Oct 18, 2021

TomAugspurger deleted the tutorial/mosaiks branch October 18, 2021 13:39

alexlopezcifuentes mentioned this pull request Jul 31, 2025

Question regarding MOSAIKS Implementation #307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding MOSAIKS tutorial notebook #70

Adding MOSAIKS tutorial notebook #70

Uh oh!

calebrob6 commented Aug 24, 2021 •

edited

Loading

Uh oh!

TomAugspurger commented Aug 24, 2021

Uh oh!

calebrob6 commented Aug 24, 2021

Uh oh!

calebrob6 commented Aug 24, 2021

Uh oh!

TomAugspurger commented Sep 3, 2021

Uh oh!

calebrob6 commented Oct 2, 2021

Uh oh!

TomAugspurger commented Oct 11, 2021

Uh oh!

TomAugspurger commented Oct 15, 2021 •

edited

Loading

Uh oh!

calebrob6 commented Oct 15, 2021

Uh oh!

calebrob6 commented Oct 15, 2021

Uh oh!

TomAugspurger commented Oct 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding MOSAIKS tutorial notebook #70

Adding MOSAIKS tutorial notebook #70

Uh oh!

Conversation

calebrob6 commented Aug 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Aug 24, 2021

Uh oh!

calebrob6 commented Aug 24, 2021

Uh oh!

calebrob6 commented Aug 24, 2021

Uh oh!

TomAugspurger commented Sep 3, 2021

Uh oh!

calebrob6 commented Oct 2, 2021

Uh oh!

TomAugspurger commented Oct 11, 2021

Uh oh!

TomAugspurger commented Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calebrob6 commented Oct 15, 2021

Uh oh!

calebrob6 commented Oct 15, 2021

Uh oh!

TomAugspurger commented Oct 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

calebrob6 commented Aug 24, 2021 •

edited

Loading

TomAugspurger commented Oct 15, 2021 •

edited

Loading