Skip to content

Refactor reading TIFF metadata #82

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 28, 2025
Merged

Conversation

kylebarron
Copy link
Member

@kylebarron kylebarron commented Mar 27, 2025

This takes heavy inspiration from the Parquet crate. See ParquetMetaDataReader, AsyncFileReader, MetadataFetch.

Change list

  • Add TiffMetadataReader and ImageFileDirectoryReader metadata readers
    • These work on a generic F: MetadataFetch, which only requires accessing a byte range from the remote source.
    • Enables users to avoid reading all IFDs. If the user knows they only need the full resolution data from the first IFD, then they only need to parse that first IFD. Or if all you need is the geospatial metadata/projection information (in a GeoTIFF case) then I think you can just parse the first IFD.
  • Remove get_metadata_bytes from AsyncFileReader trait. AsyncFileReader is used only for reading actual image data.
  • Rename get_image_bytes to get_bytes and get_image_byte_ranges to get_byte_ranges.
  • Use get_bytes as the default implementation for MetadataFetch, used as the default get_metadata
  • Moves PrefetchReader to PrefetchMetadataFetch, as it only concerns reading metadata.
  • Remove ImageFileDirectories type, replacing it with just a Vec<ImageFileDirectory>.
  • Change default Python prefetch to 32kb from 16kb

cc @feefladder this gives much deeper access into how we load the metadata.

Closes #29 (even though you have to manually page with read_next_ifd)

@feefladder
Copy link
Contributor

feefladder commented Mar 27, 2025

oof, that's a lot of changes... I glossed over them, but so far it's a bit much to fully understand. I was thinking about #29 a bit as well (but didn't write anything down), so here come those thoughts :)

  • Add TiffMetadataReader and ImageFileDirectoryReader metadata readers

    • These work on a generic F: MetadataFetch, which only requires accessing a byte range from the remote source.
    • Enables users to avoid reading all IFDs. If the user knows they only need the full resolution data from the first IFD, then they only need to parse that first IFD. Or if all you need is the geospatial metadata/projection information (in a GeoTIFF case) then I think you can just parse the first IFD.

I think if we want to read only geo information (in a COG), we'd want to skip the tile info of the first ifd, because:

What I was thinking of:

  • for the first ifd, filter_ifd(keep_tags: &[Tag]), which reads geo data if pub const GEO_TAGS = [Tag::ModelTiepointTag, Tag::GeoAsciiParamsTag,Tag::GeoDoubleParamsTag] is supplied e.g. metadata_reader.filter_ifd(&GEO_TAGS)

  • for later ifds: skip_ifd(), which just skips to the next ifd based on the tag_count * tag_size (not really necessary, but skips reading all tags)

  • add a field ifd_location to IFD

    • allows for further initializing an ifd that only has e.g. geo data read
    • allows for cycle detection in reading all ifds
  • why re-create the reader_with_offset if read_tag_value can just also not mistreat the cursor (Read tag data in its entirety from AsyncFileReader.  #81) (will add comment)

All other structural changes look mainly organizational to me. It's just a bit intermingled with re-organizing and making an async iterator, so hard for me to judge. I'll be a bit sad about having to reorganize #81 again though :')

I like:

  • decoupling the metadata stuff.

In short: looks good, will add a comment.

};

let value_byte_length = count.checked_mul(tag_size).unwrap();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is approximately where a decoupled cursor would be nice, since at this point we can know if the value fits in the offset field

@kylebarron
Copy link
Member Author

kylebarron commented Mar 27, 2025

As an example for I think what you want, refer to INSERT YOUR LOGIC HERE

async fn custom_metadata_read<F: MetadataFetch>(fetch: F) {
    let tiff_reader = TiffMetadataReader::try_open(&fetch).await.unwrap();
    let mut next_ifd_offset = tiff_reader.next_ifd_offset;
    while let Some(ifd_offset) = next_ifd_offset {
        let ifd_reader = ImageFileDirectoryReader::open(
            &fetch,
            ifd_offset,
            tiff_reader.bigtiff,
            tiff_reader.endianness,
        )
        .await
        .unwrap();

        let mut tags = HashMap::with_capacity(ifd_reader.tag_count as _);
        for tag_idx in 0..ifd_reader.tag_count {
            let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
            tags.insert(tag, value);

            // INSERT HERE
            if matches!(tag, Tag::TileOffsets) {
                // Logic for additional pre-buffering
            }
        }

        next_ifd_offset = ifd_reader.finish(&fetch).await.unwrap();
    }
    todo!()
}

This way you can totally customize the physical behavior of requests based on any tag value.

@feefladder
Copy link
Contributor

I really don't understand atm how custom_metadata_fetch could easily be used with e.g. ReqwestReader

@feefladder
Copy link
Contributor

and I like short names, think IFD is well enough understood/ImageFileDirectory not really illuminating so would prefer IFDReader over ImageFileDirectoryReader

@kylebarron
Copy link
Member Author

I really don't understand atm how custom_metadata_fetch could easily be used with e.g. ReqwestReader

pub struct MyReader {
    // inner_reqwest_client: ...,
    // buffer: ...
}

impl MyReader {
    async fn manually_pre_buffer(&self, range: Range<u64>) {
        todo!()
    }
}

impl MetadataFetch for MyReader {
    fn fetch(
        &self,
        range: std::ops::Range<u64>,
    ) -> futures::future::BoxFuture<'_, AsyncTiffResult<Bytes>> {
        // Cache this if needed
        todo!()
    }
}

// #[tokio::test]
async fn read_by_tags(fetch: MyReader) {
    let tiff_reader = TiffMetadataReader::try_open(&fetch).await.unwrap();
    let mut next_ifd_offset = tiff_reader.next_ifd_offset;
    while let Some(ifd_offset) = next_ifd_offset {
        let ifd_reader = ImageFileDirectoryReader::open(
            &fetch,
            ifd_offset,
            tiff_reader.bigtiff,
            tiff_reader.endianness,
        )
        .await
        .unwrap();

        let mut tags = HashMap::with_capacity(ifd_reader.tag_count as _);
        for tag_idx in 0..ifd_reader.tag_count {
            let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
            tags.insert(tag, value);

            // INSERT HERE
            if matches!(tag, Tag::TileOffsets) {
                // Logic for additional pre-buffering
                fetch.manually_pre_buffer(0..100000);
            }
        }

        next_ifd_offset = ifd_reader.finish(&fetch).await.unwrap();
    }
    todo!()
}

@feefladder
Copy link
Contributor

feefladder commented Mar 27, 2025

 for tag_idx in 0..ifd_reader.tag_count {
            let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
            tags.insert(tag, value);

            // INSERT HERE
            if matches!(tag, Tag::TileOffsets) {
                // Logic for additional pre-buffering
                fetch.manually_pre_buffer(0..100000);
            }
        }

this won't work, since read_tag already reads the tag, and subsequent tags are much smaller than the first encountered TileOffsets, so would need to split get_tag_name_and_count with read_tag_value for sensible buffering to happen.

I don't think that's a good idea though

@feefladder
Copy link
Contributor

feefladder commented Mar 27, 2025

I just wanted something like:

struct MyMetaReader<'a> {
  reader: Arc<dyn AsyncFileReader>
  prefetch: Bytes,
  tile_info_buffer: Mutex<(Range<u64>,Bytes)> // or OnceLock?
}

impl async_tiff::Fetch for MyMetaReader {
  fn get_bytes(range: Range<u64>) {
    // manually pre-fetch and check buffer, solely based on the range size and whether it fits in the prefetch range
    if range.end < self.buffer.len() {
      let usize_range = range.start as usize..range.end as usize;
      async {Ok(self.buffer.slice(usize_range))}
    } else {
      let range_len = range.end - range.start;
      let estimate = (2*range_len + 4*range_len.isqrt()).max(self.buffer.len());
      let fetch_range = range.start..range.start + estimate;
      async move {
        let res = self.reader.get_bytes(fetch_range).await?;
        self.tile_info_buffer = (fetch_range, res);
        res.slice(0..range_len);
      }.boxed()
    }
  }
}

async fn main() {
  let meta_reader = MetadataReader::try_open(href) // gives an async iterator
  let ifds = Vec::new();
  let first_ifd = meta_reader.filter_ifd(GEO_TAGS).await?; // get geo data
  // calculate how many overviews there are based on first ifd information and geo stuff
  let skip_ifds = todo!()
  for _ in 0..skip_ifds {
    ifds.push(meta_reader.skip_ifd().await?); // skip ifds
  }
  ifds.push(meta_reader.read_ifd().await?); // get the overview we want
}

but now its also stateless or something... Or are there like 2 levels to the api I don't quite get? Do we have an iterator over ifds or over tags? or both?

@geospatial-jeff
Copy link

Is the additional complexity proposed in this PR worth it?

S3 best practices recommend a minimum read size of 8-16MB, which is already substantially larger than most COG headers in the wild. With that in mind, does it really make sense to ever not fetch and decode the entire header, even if the caller only needs 1 IFD?

@feefladder I'm curious how big are your COG headers?

@feefladder
Copy link
Contributor

feefladder commented Mar 27, 2025

around like 5-8 MB, but the estimate is really good, over-estimating by at most like 3 kB, and the difference between only a 16kB prefetch and a 5MB prefetch is quite noticeable.

But that's about #81, not this PR actually

I don't really understand what is all going on here

@kylebarron
Copy link
Member Author

this won't work, since read_tag already reads the tag

Sure, so instead of calling read_tag, you can vendor these three lines of code to read the tag yourself.

let tag_offset =
    self.ifd_start_offset + self.tag_count_byte_size + (self.ifd_entry_byte_size * tag_idx);
let mut cursor = MetadataCursor::new_with_offset(fetch, endianness, tag_offset);
let tag_name = Tag::from_u16_exhaustive(cursor.read_u16().await?);

@feefladder
Copy link
Contributor

I think the only noticeable header skip benefit would be in skipping the tag offsets+byte_counts of the first (few) ifds.

@kylebarron
Copy link
Member Author

kylebarron commented Mar 27, 2025

All other structural changes look mainly organizational to me. It's just a bit intermingled with re-organizing and making an async iterator, so hard for me to judge. I'll be a bit sad about having to reorganize #81 again though :')

I don't think I want to merge #81. As it is the code is as close to a direct port of the upstream tiff crate as possible, which should help with maintainability.

All feedback I've gotten from @vincentsarago and @geospatial-jeff, two COG experts, a prefetch with an exponential buffer should be enough for our needs and should be a reasonable default.

So I don't want to refactor the code here to integrate complex caching or buffering native to the TIFF reader. Instead, I want to export the necessary core APIs from async_tiff for you to be able to handle the particular needs of your fetching in your own code.

@feefladder
Copy link
Contributor

I think this PR adds a lot of complexity while not really solving the problem of special-casing TileOffsets and TileByteCounts.

If you think 4 match blocks that change cursor state to anywhere in the tiff is more maintainable than 2 match blocks that set the cursor at the end of the tag offset field, sure don't merge #81.

I'm sorry, I don't know what happened. I haven't been nice to you and I think the current state of things is not better than it could have been.

@kylebarron
Copy link
Member Author

kylebarron commented Mar 27, 2025

If you think 4 match blocks that change cursor state to anywhere in the tiff is more maintainable than 2 match blocks that set the cursor at the end of the tag offset field, sure don't merge #81.

Yes, I think it's more maintainable because it's exactly the same as the upstream code. I don't want to have to think about the specifics of tag parsing. The code as it is just works, and even reviewing your changes is really hard because I would have to research the tag structure and validate that it's correct.

You seem to have an extremely specific use case that I and no one I know share. As it stands this entire PR exists to create some low-level APIs that you could use to customize reading in your own code.

I might still go ahead with this PR because it makes it easier to support externalized and cached COG metadata, which @geospatial-jeff was interested in exploring.

@kylebarron kylebarron merged commit 9f8314c into main Mar 28, 2025
6 checks passed
@kylebarron kylebarron deleted the kyle/refactor-reading-metadata branch March 28, 2025 18:12
@feefladder feefladder mentioned this pull request Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make IFD reading an async iterator
3 participants