-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aggregate
performance
#421
Comments
Here is the result from Profiler output
|
Thanks - that's what I thought ;-) |
See 8b52416 (this is still in a branch called This implements this over library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(stars)
# Loading required package: abind
buffers = read_sf("buffers.gpkg")
buffers = buffers[1, ] # use only one polygon
rasters = list.files("LC08_L1TP_190024_20200418_20200822_02_T1/",
pattern = "\\.TIF$", full.names = TRUE)
# from memory (RAM usage peak was over 50 GB)
ras = read_stars(rasters, along = 3, proxy = FALSE)
system.time(s <- st_extract(ras, buffers, FUN = mean))
# user system elapsed
# 0.056 0.000 0.056
# Warning message:
# In c.stars(list(LC08_L1TP_190024_20200418_20200822_02_T1_B1.TIF = c(9232.81542056075, :
# along argument ignored; maybe you wanted to use st_redimension?
# from file
ras = read_stars(rasters, along = 3, proxy = TRUE)
# should there be warning if 1 polygon was used?
system.time(a <- aggregate(ras, buffers, FUN = mean))
# user system elapsed
# 0.087 0.001 0.087
# Warning message:
# In c.stars(list(attr = c(9232.81542056075, 24792.0934579439, 23200.8761682243, :
# along argument ignored; maybe you wanted to use st_redimension? |
Thanks, I tried to test this on all polygons but I get an error. This is probably a regression as it works in library(sf)
library(stars) # v. 0.5.3
buffers = read_sf("data/vector/buffers.gpkg")
rasters = list.files("data/LC08_L1TP_190024_20200418_20200822_02_T1/",
pattern = "\\.TIF$", full.names = TRUE)
ras = read_stars(rasters, along = 3, proxy = FALSE)
system.time(s <- st_extract(ras, buffers, FUN = mean))
#> Error in st_crop.stars(x, i, crop = crop, ...) :
#> NA values in bounding box of y
#> Timing stopped at: 0.147 0.001 0.147
ras = read_stars(rasters, along = 3, proxy = TRUE)
system.time(a <- aggregate(ras, buffers, FUN = mean))
#> Error in st_crop.stars(x, i, crop = crop, ...) :
#> NA values in bounding box of y
#> Timing stopped at: 25.06 9.011 34.08 |
I now merged branch
|
I didn't write it precise. I meant After the merge, I ran tests again for 1940 polygons and it's OK now. I noticed over 8 times reduction in processing times between library(sf)
library(stars)
buffers = read_sf("data/vector/buffers.gpkg") # 1940 polygons
rasters = list.files("data/LC08_L1TP_190024_20200418_20200822_02_T1/",
pattern = "\\.TIF$", full.names = TRUE)
ras_mem = read_stars(rasters, along = 3, proxy = FALSE)
ras_proxy = read_stars(rasters, along = 3, proxy = TRUE)
# from memory (aggregate)
system.time(a <- aggregate(ras_mem, buffers, FUN = mean))
#> user system elapsed
#> 1114.140 131.276 1246.046
# from memory (st_extract)
system.time(b <- st_extract(ras_mem, buffers, FUN = mean))
#> user system elapsed
#> 146.594 0.553 147.231
# from file (aggregate)
system.time(c <- aggregate(ras_proxy, buffers, FUN = mean))
#> user system elapsed
#> 265.855 2.739 268.719 I also include timings from other packages for comparison:
|
Here is another interesting thing. It seems that the fastest way is convert library(sf)
library(stars)
buffers = read_sf("data/vector/buffers.gpkg") # 1940 polygons
rasters = list.files("data/LC08_L1TP_190024_20200418_20200822_02_T1/",
pattern = "\\.TIF$", full.names = TRUE)
ras = read_stars(rasters, along = 3, proxy = FALSE)
ras_names = c("B1", "B10", "B11", "B2", "B3", "B4", "B5", "B6", "B7", "B9")
ras = st_set_dimensions(ras, 3, values = ras_names, names = "band")
names(ras) = "reflectance"
start = Sys.time()
df = data.frame(ras)
df = unstack(df, form = reflectance ~ band)
dest = st_as_stars(st_bbox(ras), dx = 30, values = NA_integer_)
buffers_ras = data.frame(st_rasterize(buffers, dest))
buffers_ras = as.integer(buffers_ras$ID)
agg = aggregate(df, list(ID = buffers_ras), mean)
Sys.time() - start
#> Time difference of 1.747108 mins |
Would it be possible to improve the performance of
aggregate()
function, in particular the processing of data that is loaded into memory (it seems fine inproxy
case)? For the example below, processing 10 layer raster (7771 x 7871 pixels) fully loaded into memory takes ~14 min for only one polygon. Forproxy = TRUE
it takes ~0.2 s. At the end, I included timings for other packages.Timings in other packages for reference:
Data (851 MB): https://drive.google.com/uc?id=1lzglfQJqlQh9OWT_-czc5L0hQ1AhoR8M&export=download
Buffers: https://github.com/kadyb/raster-benchmark/blob/main/data/vector/buffers.gpkg
Related issue: #153
The text was updated successfully, but these errors were encountered: