Skip to content

Conversation

@calebrob6
Copy link
Member

@calebrob6 calebrob6 commented Aug 24, 2021

This adds a new tutorial notebook that has an example implementation of the MOSAIKS method from "A generalizable and accessible approach to machine learning with global satellite imagery". It performs an unsupervised feature extraction of S2 imagery, then trains/tests a ridge regression model on population density labels from the extracted image features.

Some notes:

nbviewer link: https://nbviewer.jupyter.org/github/microsoft/PlanetaryComputerExamples/blob/295470443651d9d7005341707df80a59e04df699/tutorials/mosaiks.ipynb

@estherrolf

@TomAugspurger
Copy link

Very neat! A few random thoughts on the content as I run through it through it in the background:

  1. We set a few of your "GDAL best practicies" environment variables automatically on the hub (https://planetarycomputer.microsoft.com/docs/overview/environment/#environment-variables). I wonder if we should set the others. (might still be OK to include in the notebook, for those running this in other places than the Hub)
  2. We'll have census 2020 data (including boundaries and populations) on the PC soon-ish. Might want to revisit that once that's available if it's an appropriate substitute for that file with population density.
  3. The pattern of querying the STAC endpoint under "Extract features from the imagery around each point" seems generally useful. I think pystac-client would benefit from something like this. I'll think about it a bit and write up an issue.
  4. I'm interested in how the interaction between "List of STAC Items" and PyTorch dataset (/ dataloader) can be improved.

Some details on number 3 above: In particular, I wonder how repeated queries to the STAC API performs relative to a single query with a very complex polygon. The flow would be something like

  1. Union all the lat / lon points into a MultiPoint polygon
  2. Submit a single query with the (complex) intersects geometry
  3. Fetch all the Items
  4. Find which Item covers which point(s)

If you're OK with it, I might push some cleanup / simplifications to this branch. We'll want something that runs in a handful of minutes, perhaps by limiting the number of points a bit.

@calebrob6
Copy link
Member Author

If you're OK with it, I might push some cleanup / simplifications to this branch. We'll want something that runs in a handful of minutes, perhaps by limiting the number of points a bit.

Definitely change whatever you want!

I just tested decreasing the number of points by a factor of 10: this makes little effect on the number of STAC API calls you have to make (still ~1000) which takes 5-10 minutes, but (obviously) speeds up the time taken to actually run the model on the imagery (takes ~1 minute). The end results of the model are worse but the message is still the same.

We set a few of your "GDAL best practices" environment variables automatically ...

I'll run the featurization step w/ and w/out these environmental variables set and report back

We'll have census 2020 data (including boundaries and populations) on the PC soon-ish...

Cool! We could definitely use this. The current values come from this dataset (which would also be really useful to have).

The pattern of querying the STAC endpoint

Any speed ups here would be awesome. An alternative look at what I'm trying to do is, "how can I get a single layer of S2 scenes that cover a spatial extent?". Here I don't care about mosaicking them or anything, I just need the least cloudy ones.

I'm interested in how the interaction between "List of STAC Items" and PyTorch dataset (/ dataloader) can be improved.

In torchgeo our GeoDataset takes a big list of GeoTIFFs as input, builds an internal spatial index, and then lets you sample imagery by bounding box (+ time range). If you could efficiently find the subset of S2 scenes that you wanted to sample from up front, then you could wrap most of the logic into a single Dataset that follows the same idea.

@calebrob6
Copy link
Member Author

We set a few of your "GDAL best practices" environment variables automatically ...

Looks like loading time is similar w/ or w/out the additional env variables

@TomAugspurger
Copy link

https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb has some analysis / thoughts on speeding up the STAC queries. With some effort I was able to get all the STAC items in ~2-3 minutes (haven't associated them with the points yet though).

The basic idea is described at https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb#Option-2:-Parallelize...-carefully., but the tl/dr is to

  1. Partition the data & parallelize the requests (fetching ~4 result sets in parallel, could parallelize more since we're I/O bound rather than CPU bound but would need to watch how the endpoint / database handles that)
  2. Carefully sort the data prior to partitioning. Make sure nearby points are in the same partition to avoid re-fetching an item many times (since a single Item covers many points)

The final table of results is at https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb#Results.

I'm not 100% sure what the next steps are, but for now I think we adopt that partitioning code in the example here and then look to put it in a library / API in the future.

@calebrob6
Copy link
Member Author

Hey @TomAugspurger, I've implemented the RCF model used here in torchgeo (torchgeo/torchgeo#176). Is this notebook still something on the radar to include in the tutorials? (wondering if I should refactor it to use the torchgeo model implementation)

@TomAugspurger
Copy link

Still on my radar, thanks. I'll push an update some time this week.

@TomAugspurger
Copy link

TomAugspurger commented Oct 15, 2021

@calebrob6 I updated the notebook to use a variant of the strategy in #70 (comment). The core idea is in

def query(points):
    """
    Find a STAC item for points in the `points` DataFrame

    Parameters
    ----------
    points : geopandas.GeoDataFrame
        A GeoDataFrame

    Returns
    -------
    geopandas.GeoDataFrame
        A new geopandas.GeoDataFrame with a `stac_item` column containing the STAC
        item that covers each point.
    """
    intersects = shapely.geometry.mapping(points.unary_union.convex_hull)

    search_start = "2018-01-01"
    search_end = "2019-12-31"
    catalog = pystac_client.Client.open(
        "https://planetarycomputer.microsoft.com/api/stac/v1"
    )

    # The time frame in which we search for non-cloudy imagery
    search = catalog.search(
        collections=["sentinel-2-l2a"],
        intersects=intersects,
        datetime=[search_start, search_end],
        query={"eo:cloud_cover": {"lt": 10}},
        limit=500,
    )
    ic = search.get_all_items_as_dict()

    features = ic["features"]
    features_d = {item["id"]: item for item in features}

    data = {
        "eo:cloud_cover": [],
        "geometry": [],
    }

    index = []

    for item in features:
        data["eo:cloud_cover"].append(item["properties"]["eo:cloud_cover"])
        data["geometry"].append(shapely.geometry.shape(item["geometry"]))
        index.append(item["id"])

    items = geopandas.GeoDataFrame(data, index=index, geometry="geometry").sort_values(
        "eo:cloud_cover"
    )
    point_list = points.geometry.tolist()

    point_items = []
    for point in point_list:
        covered_by = items[items.covers(point)]
        if len(covered_by):
            point_items.append(features_d[covered_by.index[0]])
        else:
            # There weren't any scenes matching our conditions for this point (too cloudy)
            point_items.append(None)

    return points.assign(stac_item=point_items)

This basic idea is to spatially partition the data, and then feed each partition to query in parallel. An earlier version just did the STAC item searching in parallel. Now this also associates each point with a single STAC item (in parallel). That runs in ~75s on the 70,000 points.

@calebrob6
Copy link
Member Author

Awesome!! Makes a ton sense and gives almost a linear speed-up compared to the single-threaded way I was doing previously.

So we don't have a torchgeo release yet, but I've implemented the RFC model defined in cell 4 -- https://torchgeo.readthedocs.io/en/latest/api/models.html#torchgeo.models.RCF. Do you want me to open a PR to clean this up once we have an actual release vs. adding it now with a pip install https://github.com/microsoft/torchgeo.git?

@calebrob6
Copy link
Member Author

Another sensible refactoring step would be to convert the returned STAC items into a stac collection with the population labels, then make a STACDataset (in torchgeo) to use instead of the CustomDataset. This doesn't exist and will take a bit of effort.

@TomAugspurger
Copy link

I talked with Caleb in Teams, and he plans to submit a PR updating this notebook once a version of torchgeo with this model is released.

Thanks for submitting this PR @calebrob6! It's great to see higher-level applications of STAC. I'm hopeful we can improve the user-experience around large queries like this, but in the meantime it's nice to have a concrete example.

@TomAugspurger TomAugspurger merged commit 1137874 into main Oct 18, 2021
@TomAugspurger TomAugspurger deleted the tutorial/mosaiks branch October 18, 2021 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants