You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While reading multiple rasters multiple times, the memory keep everything in cache even after the script ends.
Initially working with large VRT I figured out that my memory was increasing a lot reading relatively small tiles (1024px). I tried to reproduce the error excluding the VRT usage and i managed to do it when i'm reading the same tile content two times. The first time the memory go up then down once i close the dataset. The second time the memory keep the data in cache and I don't know why.
Steps to reproduce the issue
I've made a simplified scipt of my use case which for a list of tif files read them, perform some operations and close them.
importosimportnumpyasnpfromosgeoimportgdalfromtqdmimporttqdmdefcolorimetric_global_raster_check_tile_gdal(raster_path: str) ->float:
""" Function which checks for a given raster the percentage of padded pixels in the image using GDAL Args: raster_path (str): Input raster path to analyze Returns: float: Percentage of padded pixels in the image """# Open the raster using GDALraster=gdal.Open(raster_path)
ifrasterisNone:
raiseValueError(f"Unable to open the raster {raster_path}.")
amount_pixels_raster=raster.RasterXSize*raster.RasterYSizeamount_of_white_padded_pixels=0amount_of_black_padded_pixels=0block_size=1024forxinrange(0, raster.RasterXSize, block_size):
foryinrange(0, raster.RasterYSize, block_size):
gray_tile=raster.ReadAsArray(
x,
y,
min(block_size, raster.RasterXSize-x),
min(block_size, raster.RasterYSize-y),
)
# If the VRT/raster has multiple bands, average them to create a grayscale imageiflen(gray_tile.shape) ==3:
gray_tile=gray_tile.mean(axis=0)
# Collect amount of white/black pixelsamount_of_white_padded_pixels+=np.sum(
gray_tile>245
)
amount_of_black_padded_pixels+=np.sum(
gray_tile<10
)
gray_tile=Nonepadded_percentage= (
max(amount_of_black_padded_pixels, amount_of_white_padded_pixels)
/amount_pixels_raster
)
raster.Close()
returnpadded_percentageif__name__=="__main__":
pcrs_list= [
os.path.join("/data/rasters_dummy", tile) fortileinos.listdir("/data/rasters_dummy")
]
res= []
# FIRST RUN (EVERYTHING FINE)forpathintqdm(pcrs_list):
res.append(colorimetric_global_raster_check_tile_gdal(raster_path=path))
# SECOND RUN (EVERYTHING IS CACHED)forpathintqdm(pcrs_list):
res.append(colorimetric_global_raster_check_tile_gdal(raster_path=path))
As I said, initially i was working with a quite large VRT linking approximatively 1000 tiles (4000px,4000px) of ~45Mo.
I cannot shared the data i'm working with but the issue could be reproduce with dummy data generated by the following script :
importosimportnumpyasnpfromosgeoimportgdal, osrfromtqdmimporttqdm# For progress tracking# Directory to save the rastersoutput_dir='/data/rasters_dummy'os.makedirs(output_dir, exist_ok=True)
# Constantswidth, height=4000, 4000# Dimensions of the rasters (full image size)block_size=1024# Tile block sizenum_rasters=997# Number of rasters to generatenum_bands=3# Number of bandscrs_epsg=2154# Coordinate Reference System (EPSG:2154)# Affine transform parameters (dummy values, adjust as needed)geotransform= [0, 1, 0, 0, 0, -1] # Equivalent to rasterio's from_origin(0, 0, 1, 1)# Generate rastersforiintqdm(range(num_rasters), desc="Generating rasters"):
raster_filename=os.path.join(output_dir, f'raster_{i+1}.tif')
# Create the driver for GeoTIFF formatdriver=gdal.GetDriverByName('GTiff')
# Create the raster dataset with the specified width, height, and number of bandsdataset=driver.Create(
raster_filename,
width,
height,
num_bands,
gdal.GDT_Byte,
options=['TILED=YES', 'BLOCKXSIZE=1024', 'BLOCKYSIZE=1024', 'INTERLEAVE=PIXEL'],
)
# Set CRS (Coordinate Reference System)srs=osr.SpatialReference()
srs.ImportFromEPSG(crs_epsg)
dataset.SetProjection(srs.ExportToWkt())
# Set the affine transformation (geotransform)dataset.SetGeoTransform(geotransform)
# Create a 3D array filled with zeros (for 3 bands)data=np.zeros((num_bands, height, width), dtype='uint8')
# Write each bandforband_idxinrange(num_bands):
band=dataset.GetRasterBand(band_idx+1)
band.WriteArray(data[band_idx, :, :])
# Set NoData valueband.SetNoDataValue(0)
# Close the dataset (flushes to disk)dataset.FlushCache()
dataset=Noneprint(f"{num_rasters} rasters created in '{output_dir}' directory.")
Versions and provenance
I'm running my code in an amazon instance through a docker container.
Here is the dockerfile to build the image i'm using :
FROM continuumio/miniconda3:4.12.0
RUN apt-get --allow-releaseinfo-change update -y && apt-get install -y \
cmake \
build-essential \
pip
RUN apt-get install -y libgl1
RUN conda install -c conda-forge gdal==3.9.2
COPY . /src
WORKDIR /src
RUN pip install -e .
Here is the dependencies that i'm using :
"geopandas==0.14.3",
"pandarallel==1.6.5",
"numpy==1.26.4",
"pandas==2.2.0",
"tqdm==4.66.5",
Additional context
Here a graphic visualisation of the memory usage on my machine. The first bump a ~14:52 for the first run and right after at 14:53:30 the second run which keep everything in cache and even when it finished nothing is released. To manually free the memory i have to re-write the rasters
The text was updated successfully, but these errors were encountered:
What is the bug?
While reading multiple rasters multiple times, the memory keep everything in cache even after the script ends.
Initially working with large VRT I figured out that my memory was increasing a lot reading relatively small tiles (1024px). I tried to reproduce the error excluding the VRT usage and i managed to do it when i'm reading the same tile content two times. The first time the memory go up then down once i close the dataset. The second time the memory keep the data in cache and I don't know why.
Steps to reproduce the issue
I've made a simplified scipt of my use case which for a list of tif files read them, perform some operations and close them.
As I said, initially i was working with a quite large VRT linking approximatively 1000 tiles (4000px,4000px) of ~45Mo.
I cannot shared the data i'm working with but the issue could be reproduce with dummy data generated by the following script :
Versions and provenance
I'm running my code in an amazon instance through a docker container.
Here is the dockerfile to build the image i'm using :
Here is the dependencies that i'm using :
"geopandas==0.14.3",
"pandarallel==1.6.5",
"numpy==1.26.4",
"pandas==2.2.0",
"tqdm==4.66.5",
Additional context
Here a graphic visualisation of the memory usage on my machine. The first bump a ~14:52 for the first run and right after at 14:53:30 the second run which keep everything in cache and even when it finished nothing is released. To manually free the memory i have to re-write the rasters
The text was updated successfully, but these errors were encountered: