Is "batched" PreChippedGeoSampler supported? #1922
-
I'm attempting to use I am using the chipped version of LandCoverAI (i.e., whatever is in the My def setup(self, stage: str) -> None:
…
if stage in [“fit”]:
self.train_sampler = PreChippedGeoSampler(self.train_dataset, shuffle=True)
… and I am initializing it like so: When I try to train a model, the dev run completes successfully, and I can even get ca. 30 steps through the first epoch until I get the following sample collation error:
I checked the datamodule.setup("fit") and from torchgeo.datasets import stack_samples
from torch.utils.data import DataLoader
dataloader = DataLoader(datamodule.train_dataset,batch_size=8,sampler=datamodule.train_sampler,collate_fn=stack_samples).
for _ in dataloader:
continue This throws the same error as above, but if I rerun it, it appears to work fine. Additionally, I have noticed that the error returns if I set up the data module again. I'm not 100% certain this is a bug on TorchGeo's end, because pretty much all my code is custom, but I'd be grateful if somebody could replicate it. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
By the way, here's a suggestion that just came to mind: since pre-chipped datasets come in this "standardized" form that we can exploit to sample them simply using a file index, why are we relying on a spatial index at all? I get that they are GeoDatasets and come with a CRS, and so the logical thing is put them into a spatial index, but this is kind of irrelevant in the case of chip sampling IMO, and reading 10k+ chips into memory to construct this index takes time, and so does the actual indexing. This can significantly increase the training setup and epoch durations, potentially resulting in reduced batch sizes, etc. I think a "hybrid" dataset that behaves like a Geo- or NonGeo-Dataset on demand (e.g., via a use_index parameter in its initialized and type-dispatched getters) would be a nice addition. |
Beta Was this translation helpful? Give feedback.
-
Yes, iff all chips are the same size in the same CRS. If not, you'll need to random/center crop or resize all images to be the same size in order to collate them.
You don't have to. That's why there is a NonGeoDataset version of LandCover.ai. The benefit of a GeoDataset is solely that you can combine it with any other GeoDataset. So for example, you could create a dataset that includes LandCover.ai imagery and masks and combine it with Sentinel-2 imagery or a DEM to see how that impacts performance. I don't personally see any other advantage of using GeoDataset instead of NonGeoDataset. If you're only interested in benchmarking, NonGeoDataset is better. A hybrid approach that combined Geo and NonGeo would be interesting, although confusing (you have to explain that sometimes this dataset requires a GeoSampler and sometimes it doesn't) and a lot of code in a single class. But it would negate the need for duplicated classes for datasets like LandCover.ai which would be nice. I haven't thought too much in this direction yet. #409 is also a relevant discussion, and would negate most of the benefits of #1353. |
Beta Was this translation helpful? Give feedback.
-
Reading your reply, I just remembered that the LandCoverAI imagery is of varying resolution, and I haven’t accounted for that. I’ll check it out again later and mark your input as the answer assuming everything works fine (I guess it should – that was a pretty big oversight on my part). I can’t really comment on the reprojection stuff in #409, because I’m not familiar with warped files and VRTs that TorchGeo apparently uses. My use case is that I have a dataset that I have pre-chipped into patches for easier annotation, but I’m still experimenting with the patch size for training. In my mind, patch sizes less than the chip size should be handled by I’m currently looking to implement this |
Beta Was this translation helpful? Give feedback.
Yes, iff all chips are the same size in the same CRS. If not, you'll need to random/center crop or resize all images to be the same size in order to collate them.
You don't have to. That's why there is a NonGeoDataset version of LandCover.ai. The benefit of a GeoDataset is solely that you can combine it with any other GeoDataset. So for example, you could create a dataset that includes LandCover.ai imagery and masks and combine it with Sentinel-2 imagery or a DEM to see how that impacts performance. I don't personally see any other advantage of using GeoDataset instead of NonGeoDataset. If you'r…