-
Notifications
You must be signed in to change notification settings - Fork 210
Adding MOSAIKS tutorial notebook #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Very neat! A few random thoughts on the content as I run through it through it in the background:
Some details on number 3 above: In particular, I wonder how repeated queries to the STAC API performs relative to a single query with a very complex polygon. The flow would be something like
If you're OK with it, I might push some cleanup / simplifications to this branch. We'll want something that runs in a handful of minutes, perhaps by limiting the number of points a bit. |
Definitely change whatever you want! I just tested decreasing the number of points by a factor of 10: this makes little effect on the number of STAC API calls you have to make (still ~1000) which takes 5-10 minutes, but (obviously) speeds up the time taken to actually run the model on the imagery (takes ~1 minute). The end results of the model are worse but the message is still the same.
I'll run the featurization step w/ and w/out these environmental variables set and report back
Cool! We could definitely use this. The current values come from this dataset (which would also be really useful to have).
Any speed ups here would be awesome. An alternative look at what I'm trying to do is, "how can I get a single layer of S2 scenes that cover a spatial extent?". Here I don't care about mosaicking them or anything, I just need the least cloudy ones.
In torchgeo our GeoDataset takes a big list of GeoTIFFs as input, builds an internal spatial index, and then lets you sample imagery by bounding box (+ time range). If you could efficiently find the subset of S2 scenes that you wanted to sample from up front, then you could wrap most of the logic into a single Dataset that follows the same idea. |
Looks like loading time is similar w/ or w/out the additional env variables |
|
https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb has some analysis / thoughts on speeding up the STAC queries. With some effort I was able to get all the STAC items in ~2-3 minutes (haven't associated them with the points yet though). The basic idea is described at https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb#Option-2:-Parallelize...-carefully., but the tl/dr is to
The final table of results is at https://nbviewer.jupyter.org/gist/TomAugspurger/ceadc4b2f8b7e4263ff172ee1ea76dbb#Results. I'm not 100% sure what the next steps are, but for now I think we adopt that partitioning code in the example here and then look to put it in a library / API in the future. |
|
Hey @TomAugspurger, I've implemented the RCF model used here in torchgeo (torchgeo/torchgeo#176). Is this notebook still something on the radar to include in the tutorials? (wondering if I should refactor it to use the torchgeo model implementation) |
|
Still on my radar, thanks. I'll push an update some time this week. |
|
@calebrob6 I updated the notebook to use a variant of the strategy in #70 (comment). The core idea is in def query(points):
"""
Find a STAC item for points in the `points` DataFrame
Parameters
----------
points : geopandas.GeoDataFrame
A GeoDataFrame
Returns
-------
geopandas.GeoDataFrame
A new geopandas.GeoDataFrame with a `stac_item` column containing the STAC
item that covers each point.
"""
intersects = shapely.geometry.mapping(points.unary_union.convex_hull)
search_start = "2018-01-01"
search_end = "2019-12-31"
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
# The time frame in which we search for non-cloudy imagery
search = catalog.search(
collections=["sentinel-2-l2a"],
intersects=intersects,
datetime=[search_start, search_end],
query={"eo:cloud_cover": {"lt": 10}},
limit=500,
)
ic = search.get_all_items_as_dict()
features = ic["features"]
features_d = {item["id"]: item for item in features}
data = {
"eo:cloud_cover": [],
"geometry": [],
}
index = []
for item in features:
data["eo:cloud_cover"].append(item["properties"]["eo:cloud_cover"])
data["geometry"].append(shapely.geometry.shape(item["geometry"]))
index.append(item["id"])
items = geopandas.GeoDataFrame(data, index=index, geometry="geometry").sort_values(
"eo:cloud_cover"
)
point_list = points.geometry.tolist()
point_items = []
for point in point_list:
covered_by = items[items.covers(point)]
if len(covered_by):
point_items.append(features_d[covered_by.index[0]])
else:
# There weren't any scenes matching our conditions for this point (too cloudy)
point_items.append(None)
return points.assign(stac_item=point_items)This basic idea is to spatially partition the data, and then feed each partition to |
|
Awesome!! Makes a ton sense and gives almost a linear speed-up compared to the single-threaded way I was doing previously. So we don't have a torchgeo release yet, but I've implemented the RFC model defined in cell 4 -- https://torchgeo.readthedocs.io/en/latest/api/models.html#torchgeo.models.RCF. Do you want me to open a PR to clean this up once we have an actual release vs. adding it now with a |
|
Another sensible refactoring step would be to convert the returned STAC items into a stac collection with the population labels, then make a |
|
I talked with Caleb in Teams, and he plans to submit a PR updating this notebook once a version of torchgeo with this model is released. Thanks for submitting this PR @calebrob6! It's great to see higher-level applications of STAC. I'm hopeful we can improve the user-experience around large queries like this, but in the meantime it's nice to have a concrete example. |
This adds a new tutorial notebook that has an example implementation of the MOSAIKS method from "A generalizable and accessible approach to machine learning with global satellite imagery". It performs an unsupervised feature extraction of S2 imagery, then trains/tests a ridge regression model on population density labels from the extracted image features.
Some notes:
nbviewer link: https://nbviewer.jupyter.org/github/microsoft/PlanetaryComputerExamples/blob/295470443651d9d7005341707df80a59e04df699/tutorials/mosaiks.ipynb
@estherrolf