Refactor reading TIFF metadata #82

kylebarron · 2025-03-27T16:21:36Z

This takes heavy inspiration from the Parquet crate. See ParquetMetaDataReader, AsyncFileReader, MetadataFetch.

Change list

Add TiffMetadataReader and ImageFileDirectoryReader metadata readers
- These work on a generic F: MetadataFetch, which only requires accessing a byte range from the remote source.
- Enables users to avoid reading all IFDs. If the user knows they only need the full resolution data from the first IFD, then they only need to parse that first IFD. Or if all you need is the geospatial metadata/projection information (in a GeoTIFF case) then I think you can just parse the first IFD.
Remove get_metadata_bytes from AsyncFileReader trait. AsyncFileReader is used only for reading actual image data.
Rename get_image_bytes to get_bytes and get_image_byte_ranges to get_byte_ranges.
Use get_bytes as the default implementation for MetadataFetch, used as the default get_metadata
Moves PrefetchReader to PrefetchMetadataFetch, as it only concerns reading metadata.
Remove ImageFileDirectories type, replacing it with just a Vec<ImageFileDirectory>.
Change default Python prefetch to 32kb from 16kb

cc @feefladder this gives much deeper access into how we load the metadata.

Closes #29 (even though you have to manually page with read_next_ifd)

feefladder · 2025-03-27T17:43:41Z

oof, that's a lot of changes... I glossed over them, but so far it's a bit much to fully understand. I was thinking about #29 a bit as well (but didn't write anything down), so here come those thoughts :)

Add TiffMetadataReader and ImageFileDirectoryReader metadata readers

These work on a generic F: MetadataFetch, which only requires accessing a byte range from the remote source.

Enables users to avoid reading all IFDs. If the user knows they only need the full resolution data from the first IFD, then they only need to parse that first IFD. Or if all you need is the geospatial metadata/projection information (in a GeoTIFF case) then I think you can just parse the first IFD.

I think if we want to read only geo information (in a COG), we'd want to skip the tile info of the first ifd, because:

geo info is contained within the prefetch range
tile information of the first ifd is the biggest metadata fetch
- all sub-resolution ifd TileOffsets+TileByteCounts combined are less than TileOffsets0/2+4*TileOffsets0.isrqt() aka ~half the TileOffsets0
Read tag data in its entirety from AsyncFileReader. #81+Aggresive caching in PrefetchReader #74 still works on later ifds (e.g. fetching overviews 5-end)

What I was thinking of:

for the first ifd, filter_ifd(keep_tags: &[Tag]), which reads geo data if pub const GEO_TAGS = [Tag::ModelTiepointTag, Tag::GeoAsciiParamsTag,Tag::GeoDoubleParamsTag] is supplied e.g. metadata_reader.filter_ifd(&GEO_TAGS)
for later ifds: skip_ifd(), which just skips to the next ifd based on the tag_count * tag_size (not really necessary, but skips reading all tags)
add a field ifd_location to IFD
- allows for further initializing an ifd that only has e.g. geo data read
- allows for cycle detection in reading all ifds
why re-create the reader_with_offset if read_tag_value can just also not mistreat the cursor (Read tag data in its entirety from AsyncFileReader. #81) (will add comment)

All other structural changes look mainly organizational to me. It's just a bit intermingled with re-organizing and making an async iterator, so hard for me to judge. I'll be a bit sad about having to reorganize #81 again though :')

I like:

decoupling the metadata stuff.

In short: looks good, will add a comment.

src/metadata/reader.rs

feefladder · 2025-03-27T17:48:36Z

src/metadata/reader.rs

+    };
+
+    let value_byte_length = count.checked_mul(tag_size).unwrap();
+


this is approximately where a decoupled cursor would be nice, since at this point we can know if the value fits in the offset field

kylebarron · 2025-03-27T17:50:58Z

As an example for I think what you want, refer to INSERT YOUR LOGIC HERE

async fn custom_metadata_read<F: MetadataFetch>(fetch: F) {
    let tiff_reader = TiffMetadataReader::try_open(&fetch).await.unwrap();
    let mut next_ifd_offset = tiff_reader.next_ifd_offset;
    while let Some(ifd_offset) = next_ifd_offset {
        let ifd_reader = ImageFileDirectoryReader::open(
            &fetch,
            ifd_offset,
            tiff_reader.bigtiff,
            tiff_reader.endianness,
        )
        .await
        .unwrap();

        let mut tags = HashMap::with_capacity(ifd_reader.tag_count as _);
        for tag_idx in 0..ifd_reader.tag_count {
            let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
            tags.insert(tag, value);

            // INSERT HERE
            if matches!(tag, Tag::TileOffsets) {
                // Logic for additional pre-buffering
            }
        }

        next_ifd_offset = ifd_reader.finish(&fetch).await.unwrap();
    }
    todo!()
}

This way you can totally customize the physical behavior of requests based on any tag value.

python/src/reader.rs

feefladder · 2025-03-27T18:00:02Z

I really don't understand atm how custom_metadata_fetch could easily be used with e.g. ReqwestReader

feefladder · 2025-03-27T18:03:12Z

and I like short names, think IFD is well enough understood/ImageFileDirectory not really illuminating so would prefer IFDReader over ImageFileDirectoryReader

kylebarron · 2025-03-27T18:03:34Z

I really don't understand atm how custom_metadata_fetch could easily be used with e.g. ReqwestReader

pub struct MyReader {
    // inner_reqwest_client: ...,
    // buffer: ...
}

impl MyReader {
    async fn manually_pre_buffer(&self, range: Range<u64>) {
        todo!()
    }
}

impl MetadataFetch for MyReader {
    fn fetch(
        &self,
        range: std::ops::Range<u64>,
    ) -> futures::future::BoxFuture<'_, AsyncTiffResult<Bytes>> {
        // Cache this if needed
        todo!()
    }
}

// #[tokio::test]
async fn read_by_tags(fetch: MyReader) {
    let tiff_reader = TiffMetadataReader::try_open(&fetch).await.unwrap();
    let mut next_ifd_offset = tiff_reader.next_ifd_offset;
    while let Some(ifd_offset) = next_ifd_offset {
        let ifd_reader = ImageFileDirectoryReader::open(
            &fetch,
            ifd_offset,
            tiff_reader.bigtiff,
            tiff_reader.endianness,
        )
        .await
        .unwrap();

        let mut tags = HashMap::with_capacity(ifd_reader.tag_count as _);
        for tag_idx in 0..ifd_reader.tag_count {
            let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
            tags.insert(tag, value);

            // INSERT HERE
            if matches!(tag, Tag::TileOffsets) {
                // Logic for additional pre-buffering
                fetch.manually_pre_buffer(0..100000);
            }
        }

        next_ifd_offset = ifd_reader.finish(&fetch).await.unwrap();
    }
    todo!()
}

feefladder · 2025-03-27T18:16:33Z

 for tag_idx in 0..ifd_reader.tag_count {
            let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
            tags.insert(tag, value);

            // INSERT HERE
            if matches!(tag, Tag::TileOffsets) {
                // Logic for additional pre-buffering
                fetch.manually_pre_buffer(0..100000);
            }
        }

this won't work, since read_tag already reads the tag, and subsequent tags are much smaller than the first encountered TileOffsets, so would need to split get_tag_name_and_count with read_tag_value for sensible buffering to happen.

I don't think that's a good idea though

feefladder · 2025-03-27T18:46:09Z

I just wanted something like:

struct MyMetaReader<'a> {
  reader: Arc<dyn AsyncFileReader>
  prefetch: Bytes,
  tile_info_buffer: Mutex<(Range<u64>,Bytes)> // or OnceLock?
}

impl async_tiff::Fetch for MyMetaReader {
  fn get_bytes(range: Range<u64>) {
    // manually pre-fetch and check buffer, solely based on the range size and whether it fits in the prefetch range
    if range.end < self.buffer.len() {
      let usize_range = range.start as usize..range.end as usize;
      async {Ok(self.buffer.slice(usize_range))}
    } else {
      let range_len = range.end - range.start;
      let estimate = (2*range_len + 4*range_len.isqrt()).max(self.buffer.len());
      let fetch_range = range.start..range.start + estimate;
      async move {
        let res = self.reader.get_bytes(fetch_range).await?;
        self.tile_info_buffer = (fetch_range, res);
        res.slice(0..range_len);
      }.boxed()
    }
  }
}

async fn main() {
  let meta_reader = MetadataReader::try_open(href) // gives an async iterator
  let ifds = Vec::new();
  let first_ifd = meta_reader.filter_ifd(GEO_TAGS).await?; // get geo data
  // calculate how many overviews there are based on first ifd information and geo stuff
  let skip_ifds = todo!()
  for _ in 0..skip_ifds {
    ifds.push(meta_reader.skip_ifd().await?); // skip ifds
  }
  ifds.push(meta_reader.read_ifd().await?); // get the overview we want
}

but now its also stateless or something... Or are there like 2 levels to the api I don't quite get? Do we have an iterator over ifds or over tags? or both?

geospatial-jeff · 2025-03-27T18:52:40Z

Is the additional complexity proposed in this PR worth it?

S3 best practices recommend a minimum read size of 8-16MB, which is already substantially larger than most COG headers in the wild. With that in mind, does it really make sense to ever not fetch and decode the entire header, even if the caller only needs 1 IFD?

@feefladder I'm curious how big are your COG headers?

feefladder · 2025-03-27T18:55:39Z

around like 5-8 MB, but the estimate is really good, over-estimating by at most like 3 kB, and the difference between only a 16kB prefetch and a 5MB prefetch is quite noticeable.

But that's about #81, not this PR actually

I don't really understand what is all going on here

kylebarron · 2025-03-27T18:57:45Z

this won't work, since read_tag already reads the tag

Sure, so instead of calling read_tag, you can vendor these three lines of code to read the tag yourself.

let tag_offset =
    self.ifd_start_offset + self.tag_count_byte_size + (self.ifd_entry_byte_size * tag_idx);
let mut cursor = MetadataCursor::new_with_offset(fetch, endianness, tag_offset);
let tag_name = Tag::from_u16_exhaustive(cursor.read_u16().await?);

feefladder · 2025-03-27T19:01:44Z

I think the only noticeable header skip benefit would be in skipping the tag offsets+byte_counts of the first (few) ifds.

kylebarron · 2025-03-27T19:13:32Z

All other structural changes look mainly organizational to me. It's just a bit intermingled with re-organizing and making an async iterator, so hard for me to judge. I'll be a bit sad about having to reorganize #81 again though :')

I don't think I want to merge #81. As it is the code is as close to a direct port of the upstream tiff crate as possible, which should help with maintainability.

All feedback I've gotten from @vincentsarago and @geospatial-jeff, two COG experts, a prefetch with an exponential buffer should be enough for our needs and should be a reasonable default.

So I don't want to refactor the code here to integrate complex caching or buffering native to the TIFF reader. Instead, I want to export the necessary core APIs from async_tiff for you to be able to handle the particular needs of your fetching in your own code.

feefladder · 2025-03-27T19:39:58Z

I think this PR adds a lot of complexity while not really solving the problem of special-casing TileOffsets and TileByteCounts.

If you think 4 match blocks that change cursor state to anywhere in the tiff is more maintainable than 2 match blocks that set the cursor at the end of the tag offset field, sure don't merge #81.

I'm sorry, I don't know what happened. I haven't been nice to you and I think the current state of things is not better than it could have been.

kylebarron · 2025-03-27T19:57:22Z

If you think 4 match blocks that change cursor state to anywhere in the tiff is more maintainable than 2 match blocks that set the cursor at the end of the tag offset field, sure don't merge #81.

Yes, I think it's more maintainable because it's exactly the same as the upstream code. I don't want to have to think about the specifics of tag parsing. The code as it is just works, and even reviewing your changes is really hard because I would have to research the tag structure and validate that it's correct.

You seem to have an extremely specific use case that I and no one I know share. As it stands this entire PR exists to create some low-level APIs that you could use to customize reading in your own code.

I might still go ahead with this PR because it makes it easier to support externalized and cached COG metadata, which @geospatial-jeff was interested in exploring.

Use file APIs directly

kylebarron added 4 commits March 27, 2025 11:41

Refactor reading TIFF metadata

2a0c3be

Fix Python bindings

a1d990b

Fix clippy

d7c3076

Don't use stateful cursor when reading tags

d3d19a6

kylebarron mentioned this pull request Mar 27, 2025

Read tag data in its entirety from AsyncFileReader. #81

Closed

Refactor ImageFileDirectoryReader to support reading individual tags

47b095e

feefladder reviewed Mar 27, 2025

View reviewed changes

src/metadata/reader.rs Outdated Show resolved Hide resolved

feefladder reviewed Mar 27, 2025

View reviewed changes

Expose next_ifd_offset

650cca1

feefladder reviewed Mar 27, 2025

View reviewed changes

python/src/reader.rs Outdated Show resolved Hide resolved

Don't make ImageFileDirectoryReader stateful

60dbc82

Remove get_metadata from AsyncFileReader

8fca88e

Use file APIs directly

kylebarron mentioned this pull request Mar 27, 2025

Exponential overflow header buffer #83

Open

kylebarron added 3 commits March 27, 2025 18:04

Remove get_metadata impl from reader

9029486

Rename to PrefetchBuffer

c0825b1

Fix doctest

04e871b

kylebarron merged commit 9f8314c into main Mar 28, 2025
6 checks passed

kylebarron deleted the kyle/refactor-reading-metadata branch March 28, 2025 18:12

feefladder mentioned this pull request Jun 12, 2025

added predictors #86

Merged

		};

		let value_byte_length = count.checked_mul(tag_size).unwrap();

Refactor reading TIFF metadata #82

Refactor reading TIFF metadata #82

Conversation

kylebarron commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change list

Uh oh!

feefladder commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

feefladder Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

kylebarron commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

feefladder commented Mar 27, 2025

Uh oh!

feefladder commented Mar 27, 2025

Uh oh!

kylebarron commented Mar 27, 2025

Uh oh!

feefladder commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feefladder commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geospatial-jeff commented Mar 27, 2025

Uh oh!

feefladder commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylebarron commented Mar 27, 2025

Uh oh!

feefladder commented Mar 27, 2025

Uh oh!

kylebarron commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

feefladder commented Mar 27, 2025

Uh oh!

kylebarron commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylebarron commented Mar 27, 2025 •

edited

Loading

feefladder commented Mar 27, 2025 •

edited

Loading

kylebarron commented Mar 27, 2025 •

edited

Loading

feefladder commented Mar 27, 2025 •

edited

Loading

feefladder commented Mar 27, 2025 •

edited

Loading

feefladder commented Mar 27, 2025 •

edited

Loading

kylebarron commented Mar 27, 2025 •

edited

Loading

kylebarron commented Mar 27, 2025 •

edited

Loading