Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching issue when reading multiple rasters #10853

Open
MTh3399 opened this issue Sep 20, 2024 · 0 comments
Open

Caching issue when reading multiple rasters #10853

MTh3399 opened this issue Sep 20, 2024 · 0 comments

Comments

@MTh3399
Copy link

MTh3399 commented Sep 20, 2024

What is the bug?

While reading multiple rasters multiple times, the memory keep everything in cache even after the script ends.

Initially working with large VRT I figured out that my memory was increasing a lot reading relatively small tiles (1024px). I tried to reproduce the error excluding the VRT usage and i managed to do it when i'm reading the same tile content two times. The first time the memory go up then down once i close the dataset. The second time the memory keep the data in cache and I don't know why.

Steps to reproduce the issue

I've made a simplified scipt of my use case which for a list of tif files read them, perform some operations and close them.

import os
import numpy as np
from osgeo import gdal
from tqdm import tqdm


def colorimetric_global_raster_check_tile_gdal(raster_path: str) -> float:
    """
    Function which checks for a given raster the percentage of padded pixels in the image using GDAL

    Args:
        raster_path (str): Input raster path to analyze

    Returns:
        float: Percentage of padded pixels in the image
    """

    # Open the raster using GDAL
    raster = gdal.Open(raster_path)
    if raster is None:
        raise ValueError(f"Unable to open the raster {raster_path}.")

    amount_pixels_raster = raster.RasterXSize * raster.RasterYSize
    amount_of_white_padded_pixels = 0
    amount_of_black_padded_pixels = 0

    block_size = 1024

    for x in range(0, raster.RasterXSize, block_size):
        for y in range(0, raster.RasterYSize, block_size):

            gray_tile = raster.ReadAsArray(
                x,
                y,
                min(block_size, raster.RasterXSize - x),
                min(block_size, raster.RasterYSize - y),
            )

            # If the VRT/raster has multiple bands, average them to create a grayscale image
            if len(gray_tile.shape) == 3:
                gray_tile = gray_tile.mean(axis=0)

            # Collect amount of white/black pixels
            amount_of_white_padded_pixels += np.sum(
                gray_tile > 245
            )
            amount_of_black_padded_pixels += np.sum(
                gray_tile < 10
            )

            gray_tile = None

    padded_percentage = (
        max(amount_of_black_padded_pixels, amount_of_white_padded_pixels)
        / amount_pixels_raster
    )

    raster.Close()

    return padded_percentage


if __name__ == "__main__":

    pcrs_list = [
        os.path.join("/data/rasters_dummy", tile) for tile in os.listdir("/data/rasters_dummy")
    ]

    res = []
    
    # FIRST RUN (EVERYTHING FINE)
    for path in tqdm(pcrs_list):
        res.append(colorimetric_global_raster_check_tile_gdal(raster_path=path))
    
    # SECOND RUN (EVERYTHING IS CACHED)
    for path in tqdm(pcrs_list):
        res.append(colorimetric_global_raster_check_tile_gdal(raster_path=path))

As I said, initially i was working with a quite large VRT linking approximatively 1000 tiles (4000px,4000px) of ~45Mo.
I cannot shared the data i'm working with but the issue could be reproduce with dummy data generated by the following script :

import os
import numpy as np
from osgeo import gdal, osr
from tqdm import tqdm  # For progress tracking

# Directory to save the rasters
output_dir = '/data/rasters_dummy'
os.makedirs(output_dir, exist_ok=True)

# Constants
width, height = 4000, 4000  # Dimensions of the rasters (full image size)
block_size = 1024  # Tile block size
num_rasters = 997  # Number of rasters to generate
num_bands = 3  # Number of bands
crs_epsg = 2154  # Coordinate Reference System (EPSG:2154)

# Affine transform parameters (dummy values, adjust as needed)
geotransform = [0, 1, 0, 0, 0, -1]  # Equivalent to rasterio's from_origin(0, 0, 1, 1)

# Generate rasters
for i in tqdm(range(num_rasters), desc="Generating rasters"):
    raster_filename = os.path.join(output_dir, f'raster_{i+1}.tif')

    # Create the driver for GeoTIFF format
    driver = gdal.GetDriverByName('GTiff')

    # Create the raster dataset with the specified width, height, and number of bands
    dataset = driver.Create(
        raster_filename,
        width,
        height,
        num_bands,
        gdal.GDT_Byte,
        options=['TILED=YES', 'BLOCKXSIZE=1024', 'BLOCKYSIZE=1024', 'INTERLEAVE=PIXEL'],
    )

    # Set CRS (Coordinate Reference System)
    srs = osr.SpatialReference()
    srs.ImportFromEPSG(crs_epsg)
    dataset.SetProjection(srs.ExportToWkt())

    # Set the affine transformation (geotransform)
    dataset.SetGeoTransform(geotransform)

    # Create a 3D array filled with zeros (for 3 bands)
    data = np.zeros((num_bands, height, width), dtype='uint8')

    # Write each band
    for band_idx in range(num_bands):
        band = dataset.GetRasterBand(band_idx + 1)
        band.WriteArray(data[band_idx, :, :])

        # Set NoData value
        band.SetNoDataValue(0)

    # Close the dataset (flushes to disk)
    dataset.FlushCache()
    dataset = None

print(f"{num_rasters} rasters created in '{output_dir}' directory.")

Versions and provenance

I'm running my code in an amazon instance through a docker container.
Here is the dockerfile to build the image i'm using :

FROM continuumio/miniconda3:4.12.0

RUN apt-get --allow-releaseinfo-change update -y && apt-get install -y \
    cmake \
    build-essential \
    pip
RUN apt-get install -y libgl1

RUN conda install -c conda-forge gdal==3.9.2

COPY . /src
WORKDIR /src

RUN pip install  -e .

Here is the dependencies that i'm using :
"geopandas==0.14.3",
"pandarallel==1.6.5",
"numpy==1.26.4",
"pandas==2.2.0",
"tqdm==4.66.5",

Additional context

Here a graphic visualisation of the memory usage on my machine. The first bump a ~14:52 for the first run and right after at 14:53:30 the second run which keep everything in cache and even when it finished nothing is released. To manually free the memory i have to re-write the rasters

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant