-
Notifications
You must be signed in to change notification settings - Fork 2
Refactor reading TIFF metadata #82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
oof, that's a lot of changes... I glossed over them, but so far it's a bit much to fully understand. I was thinking about #29 a bit as well (but didn't write anything down), so here come those thoughts :)
I think if we want to read only geo information (in a COG), we'd want to skip the tile info of the first ifd, because:
What I was thinking of:
All other structural changes look mainly organizational to me. It's just a bit intermingled with re-organizing and making an async iterator, so hard for me to judge. I'll be a bit sad about having to reorganize #81 again though :') I like:
In short: looks good, will add a comment. |
}; | ||
|
||
let value_byte_length = count.checked_mul(tag_size).unwrap(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is approximately where a decoupled cursor would be nice, since at this point we can know if the value fits in the offset field
As an example for I think what you want, refer to async fn custom_metadata_read<F: MetadataFetch>(fetch: F) {
let tiff_reader = TiffMetadataReader::try_open(&fetch).await.unwrap();
let mut next_ifd_offset = tiff_reader.next_ifd_offset;
while let Some(ifd_offset) = next_ifd_offset {
let ifd_reader = ImageFileDirectoryReader::open(
&fetch,
ifd_offset,
tiff_reader.bigtiff,
tiff_reader.endianness,
)
.await
.unwrap();
let mut tags = HashMap::with_capacity(ifd_reader.tag_count as _);
for tag_idx in 0..ifd_reader.tag_count {
let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
tags.insert(tag, value);
// INSERT HERE
if matches!(tag, Tag::TileOffsets) {
// Logic for additional pre-buffering
}
}
next_ifd_offset = ifd_reader.finish(&fetch).await.unwrap();
}
todo!()
} This way you can totally customize the physical behavior of requests based on any tag value. |
I really don't understand atm how |
and I like short names, think IFD is well enough understood/ |
pub struct MyReader {
// inner_reqwest_client: ...,
// buffer: ...
}
impl MyReader {
async fn manually_pre_buffer(&self, range: Range<u64>) {
todo!()
}
}
impl MetadataFetch for MyReader {
fn fetch(
&self,
range: std::ops::Range<u64>,
) -> futures::future::BoxFuture<'_, AsyncTiffResult<Bytes>> {
// Cache this if needed
todo!()
}
}
// #[tokio::test]
async fn read_by_tags(fetch: MyReader) {
let tiff_reader = TiffMetadataReader::try_open(&fetch).await.unwrap();
let mut next_ifd_offset = tiff_reader.next_ifd_offset;
while let Some(ifd_offset) = next_ifd_offset {
let ifd_reader = ImageFileDirectoryReader::open(
&fetch,
ifd_offset,
tiff_reader.bigtiff,
tiff_reader.endianness,
)
.await
.unwrap();
let mut tags = HashMap::with_capacity(ifd_reader.tag_count as _);
for tag_idx in 0..ifd_reader.tag_count {
let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
tags.insert(tag, value);
// INSERT HERE
if matches!(tag, Tag::TileOffsets) {
// Logic for additional pre-buffering
fetch.manually_pre_buffer(0..100000);
}
}
next_ifd_offset = ifd_reader.finish(&fetch).await.unwrap();
}
todo!()
} |
for tag_idx in 0..ifd_reader.tag_count {
let (tag, value) = ifd_reader.read_tag(&fetch, tag_idx).await.unwrap();
tags.insert(tag, value);
// INSERT HERE
if matches!(tag, Tag::TileOffsets) {
// Logic for additional pre-buffering
fetch.manually_pre_buffer(0..100000);
}
} this won't work, since I don't think that's a good idea though |
I just wanted something like: struct MyMetaReader<'a> {
reader: Arc<dyn AsyncFileReader>
prefetch: Bytes,
tile_info_buffer: Mutex<(Range<u64>,Bytes)> // or OnceLock?
}
impl async_tiff::Fetch for MyMetaReader {
fn get_bytes(range: Range<u64>) {
// manually pre-fetch and check buffer, solely based on the range size and whether it fits in the prefetch range
if range.end < self.buffer.len() {
let usize_range = range.start as usize..range.end as usize;
async {Ok(self.buffer.slice(usize_range))}
} else {
let range_len = range.end - range.start;
let estimate = (2*range_len + 4*range_len.isqrt()).max(self.buffer.len());
let fetch_range = range.start..range.start + estimate;
async move {
let res = self.reader.get_bytes(fetch_range).await?;
self.tile_info_buffer = (fetch_range, res);
res.slice(0..range_len);
}.boxed()
}
}
}
async fn main() {
let meta_reader = MetadataReader::try_open(href) // gives an async iterator
let ifds = Vec::new();
let first_ifd = meta_reader.filter_ifd(GEO_TAGS).await?; // get geo data
// calculate how many overviews there are based on first ifd information and geo stuff
let skip_ifds = todo!()
for _ in 0..skip_ifds {
ifds.push(meta_reader.skip_ifd().await?); // skip ifds
}
ifds.push(meta_reader.read_ifd().await?); // get the overview we want
} but now its also stateless or something... Or are there like 2 levels to the api I don't quite get? Do we have an iterator over ifds or over tags? or both? |
Is the additional complexity proposed in this PR worth it? S3 best practices recommend a minimum read size of 8-16MB, which is already substantially larger than most COG headers in the wild. With that in mind, does it really make sense to ever not fetch and decode the entire header, even if the caller only needs 1 IFD? @feefladder I'm curious how big are your COG headers? |
around like 5-8 MB, but the estimate is really good, over-estimating by at most like 3 kB, and the difference between only a 16kB prefetch and a 5MB prefetch is quite noticeable. But that's about #81, not this PR actually I don't really understand what is all going on here |
Sure, so instead of calling read_tag, you can vendor these three lines of code to read the tag yourself. let tag_offset =
self.ifd_start_offset + self.tag_count_byte_size + (self.ifd_entry_byte_size * tag_idx);
let mut cursor = MetadataCursor::new_with_offset(fetch, endianness, tag_offset);
let tag_name = Tag::from_u16_exhaustive(cursor.read_u16().await?); |
I think the only noticeable header skip benefit would be in skipping the tag offsets+byte_counts of the first (few) ifds. |
I don't think I want to merge #81. As it is the code is as close to a direct port of the upstream All feedback I've gotten from @vincentsarago and @geospatial-jeff, two COG experts, a prefetch with an exponential buffer should be enough for our needs and should be a reasonable default. So I don't want to refactor the code here to integrate complex caching or buffering native to the TIFF reader. Instead, I want to export the necessary core APIs from |
I think this PR adds a lot of complexity while not really solving the problem of special-casing TileOffsets and TileByteCounts. If you think 4 match blocks that change cursor state to anywhere in the tiff is more maintainable than 2 match blocks that set the cursor at the end of the tag offset field, sure don't merge #81. I'm sorry, I don't know what happened. I haven't been nice to you and I think the current state of things is not better than it could have been. |
Yes, I think it's more maintainable because it's exactly the same as the upstream code. I don't want to have to think about the specifics of tag parsing. The code as it is just works, and even reviewing your changes is really hard because I would have to research the tag structure and validate that it's correct. You seem to have an extremely specific use case that I and no one I know share. As it stands this entire PR exists to create some low-level APIs that you could use to customize reading in your own code. I might still go ahead with this PR because it makes it easier to support externalized and cached COG metadata, which @geospatial-jeff was interested in exploring. |
Use file APIs directly
This takes heavy inspiration from the Parquet crate. See
ParquetMetaDataReader
,AsyncFileReader
,MetadataFetch
.Change list
TiffMetadataReader
andImageFileDirectoryReader
metadata readersF: MetadataFetch
, which only requires accessing a byte range from the remote source.get_metadata_bytes
fromAsyncFileReader
trait.AsyncFileReader
is used only for reading actual image data.get_image_bytes
toget_bytes
andget_image_byte_ranges
toget_byte_ranges
.get_bytes
as the default implementation forMetadataFetch
, used as the defaultget_metadata
PrefetchReader
toPrefetchMetadataFetch
, as it only concerns reading metadata.ImageFileDirectories
type, replacing it with just aVec<ImageFileDirectory>
.cc @feefladder this gives much deeper access into how we load the metadata.
Closes #29 (even though you have to manually page with
read_next_ifd
)