Skip to content

Fix recursive search in Client.get_items #799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

mishaschwartz
Copy link

@mishaschwartz mishaschwartz commented May 14, 2025

Related Issue(s):

Description:

  • runs a non-recursive search when using the recursive argument is False
  • updates tests

PR Checklist:

  • Code is formatted
  • Tests pass
  • Changes are added to CHANGELOG.md

@jsignell jsignell self-requested a review May 15, 2025 13:07
@jsignell jsignell self-assigned this May 15, 2025
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few suggestions, but thank you so much for opening this PR! I think it'll be a real improvement.

Comment on lines 465 to 467
except APIError:
child_catalogs = [catalog for catalog, _, _ in self.walk()]
search = self.search(ids=ids, collections=[self, *child_catalogs])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would be pretty easy to do accidentally. I think I'd prefer to just let the error raise and make it a little harder to get every single item in planetary computer for instance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that without something like this, many functions that call get_items simply don't work for planetary computer or similar APIs that enforce this required argument. This includes:

  • Client.get_all_items,
  • Client.walk,
  • Client.validate_all,
  • Client.describe,
  • Client.make_all_asset_hrefs_relative,
  • Client.make_all_asset_hrefs_absolute

Note that the spec doesn't say one way or another that these arguments must be optional so I'm guessing that planetary computer's API is still spec compliant technically. However, the examples show that a search without collections should be supported so I don't really know one way or the other how to interpret that:

https://github.com/radiantearth/stac-api-spec/blob/604ade6158de15b8ab068320ca41e25e2bf0e116/item-search/examples.md?plain=1#L27

Otherwise the only way to make this work for APIs like planetary computer is to override the Client class like:

import pystac_client

class Client(pystac_client.Client):
    def search(self, *args, **kwargs):
        if kwargs["collections"] is None:
            kwargs["collections"] = [self.id *[catalog.id for catalog, _, _ in self.walk()]]
        return super().search(*args, **kwargs)

pystac_client.client.Client = Client  # so that sub-catalogs also use the updated search method 

If that's the approach we want to go with that's fine, but maybe we should document this workaround in case users want to interact with planetary computer.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking the time to write that all up! I think as long as a clear error surfaces it is fine to have those methods fail on Planetary Computer. Requiring collections is not technically compliant with the spec, so I think it is better to not bake in special handling for this scenario especially since it is likely to result in a surprising user experience (setting collections to include every collection might be very very slow).

@jsignell jsignell removed their assignment May 15, 2025
@mishaschwartz mishaschwartz requested a review from jsignell May 20, 2025 18:24
@mishaschwartz
Copy link
Author

@jsignell I realize that the default value for recursive should actually be False so that it matches the signature for Catalog.get_items (see: https://github.com/mishaschwartz/pystac/blob/7621455aef4813bfe2da571d88484d20bf616122/pystac/catalog.py#L538).

This is necessary because all of the pystac code assumes that the default value is False so many methods that pystac_client.Client inherits from pystac.Catalog will return unexpected results.

For example: if you have a catalog which contains subcatalogs, then Client.walk will yield items in subcatalogs multiple times (one for each ancestor catalog) since get_items is recursive by default.

I would like to make the default False but I realize that that is a breaking change for the pystac_client's public interface. Is that a change that you're comfortable with me making in this PR?

@jsignell
Copy link
Member

@jsignell I realize that the default value for recursive should actually be False so that it matches the signature for Catalog.get_items (see: https://github.com/mishaschwartz/pystac/blob/7621455aef4813bfe2da571d88484d20bf616122/pystac/catalog.py#L538).

This is necessary because all of the pystac code assumes that the default value is False so many methods that pystac_client.Client inherits from pystac.Catalog will return unexpected results.

For example: if you have a catalog which contains subcatalogs, then Client.walk will yield items in subcatalogs multiple times (one for each ancestor catalog) since get_items is recursive by default.

I would like to make the default False but I realize that that is a breaking change for the pystac_client's public interface. Is that a change that you're comfortable with me making in this PR?

What if we just leave the default as None? That way this PR is purely additive -- it adds the option to set recursive=False and otherwise changes nothing.

@codecov-commenter
Copy link

codecov-commenter commented May 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.22%. Comparing base (21435b0) to head (8a268d2).
Report is 123 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #799      +/-   ##
==========================================
+ Coverage   93.43%   94.22%   +0.79%     
==========================================
  Files          13       15       +2     
  Lines         990     1213     +223     
==========================================
+ Hits          925     1143     +218     
- Misses         65       70       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mishaschwartz
Copy link
Author

mishaschwartz commented May 21, 2025

What if we just leave the default as None? That way this PR is purely additive -- it adds the option to set recursive=False and otherwise changes nothing.

If we leave the default as None that's fine but we would need to change the way that the code treats a None value. Right now None is treated the same as True, we would need the None to be treated the same as False.

Either way, it changes the semantics of Client.get_items() without any arguments (which is the way that it is called in various pystac.Catalog methods).

@jsignell
Copy link
Member

If we leave the default as None that's fine but we would need to change the way that the code treats a None value. Right now None is treated the same as True, we would need the None to be treated the same as False.

Either way, it changes the semantics of Client.get_items() without any arguments (which is the way that it is called in various pystac.Catalog methods).

Sorry why not just let None be treated as True?

@mishaschwartz
Copy link
Author

Sorry why not just let None be treated as True?

OK here's an example:

  • pystac.Catalog.walk calls self.get_items() (Note: without any arguments so the default values are used!)
  • pystac.Catalog.get_items has recursive = False by default
  • this means that pystac.Catalog.walk is trying to get items non-recursively
  • pystac_client.Client.get_items has recursive = None by default which is the same as recursive = True
  • pystac_client.Client.walk is inherited from pystac.Catalog but will call pystac_client.Client.get_items
  • Therefore pystac_client.Client.walk will recursively call get_items even though we want it to call it non-recursively

A concrete example:

Let's say you have the following structure:

  • catalog1

    • catalog2
      • item1
  • pystac.Catalog.walk will tell you that item1 is a child of catalog2 but not catalog1

  • pystac_client.Client.walk will tell you that item1 is a child of catalog1 and catalog2

@jsignell
Copy link
Member

Thank you for spelling it out for me. This inheritance model is not pretty. I think I feel ok with changing the default to recursive=False as long as we use recursive=True in get_all_items

@gadomski gadomski self-requested a review May 27, 2025 15:52
Copy link
Member

@gadomski gadomski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diving in, @mishaschwartz! 🪂-ing in with a review here...I agree generally that a recursive=False call should not recurse, but I'm less convinced that we should default to False. If a user is searching an API with nested collections (which is itself a bit unusual), they can use collection IDs to limit the blast radius of their search, or explicitly pass recursive=False?

def get_items(
self, *ids: str, recursive: bool | None = None
) -> Iterator["Item_Type"]:
def get_items(self, *ids: str, recursive: bool = False) -> Iterator["Item_Type"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we should change the function signature. If we need to change the underlying behavior, that might be ok.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to you but this overrides a method from pystac.Catalog and changes the function signature of the method that it overrides. In my experience, it is best not to change the interface of an inherited method unless absolutely necessary.

An example of how this sort of thing causes problems can be found here: #799 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, it is best not to change the interface of an inherited method unless absolutely necessary.

Agreed, which is why I think it was a mistake to inherit from Catalog in the first place, but here we are 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh well 😆

if recursive is not False:
search = self.search(ids=ids)
else:
search = self.search(ids=ids, collections=[self.id])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't feel quite right, since the client is a Catalog, not a Collection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the naming is not ideal but that's the parameter name that the API provides.

Items can be direct children of Catalogs and the API spec does not provide a separate catalogs= parameter to differentiate between catalogs and collections. Specifying the catalog id in the collections parameter works with at least one API implementation (stac-fastapi) but I guess the spec doesn't specify what to do in this edge case.

The other option is to skip the option to use the search endpoint for all non-recursive calls and do something like:

        if self.conforms_to(ConformanceClasses.ITEM_SEARCH) and recursive:
            yield from self.search(ids=ids).items()
        else:
            if not self.conforms_to(ConformanceClasses.ITEM_SEARCH):
                self._warn_about_fallback("ITEM_SEARCH")
            for item in super().get_items(
                *ids, recursive=recursive is None or recursive
            ):
                call_modifier(self.modifier, item)
                yield item

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying the catalog id in the collections parameter works with at least one API implementation (stac-fastapi) but I guess the spec doesn't specify what to do in this edge case.

Yeah, we try to expand pystac-client with heuristics to help it work with real-world instances (rather than being strictly spec-enforcing) but this use-case is unusual enough that I'm not sure it's worth the complexity to manage.

I'm still not sure the problem we're trying to solve here is pystac-client's problem. As the original docstring said, we're not using recursive in pystac-client at all, we only use it when we fall back to pystac for non-API searches. So I'm a bit inclined to say "if pystac-client's recursion behavior isn't what you want, just use pystac directly"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the confusing thing for me as a user of pystac-client is that the recursive behaviour is inconsistent depending on whether its using the /search endpoint or not.

Currently the behaviour is:

  • if using /search: always recursive
  • otherwise: it depends on the recursive argument

If the solution is to just say don't use pystac-client in this case then let's at least document this better. Maybe change this

recursive: unused in pystac-client, but needed for falling back to pystac

to

recursive: If this client conforms to the ITEM_SEARCH conformance class, this is unused and this will always yield items recursively. Otherwise, this will only return items recursively if True.

Or something similar that talks about the distinction.

On a personal note... I don't think I'll be able to use pystac-client in my applications if we go this route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 to the docs update. pystac-client is for STAC APIs, not static STAC catalogs, and our fallback to pystac is more of a convenience than a core feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's fine. Just so you know, the documentation talks about pystac in a way that makes it seem like pystac is more than just a convenience so you might understand why people might assume that pystac-client would align more closely with pystac than it does:

In that last link you even have the line (in the consequences heading):

"Special care should be taken to ensure that we do not break any of PySTAC’s functionality through inheritance."

Which is exactly the issue that this PR is trying to address

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I appreciate the call-out. There's been discussions over the years on whether we should even have the two libraries be separate (for one example, stac-utils/pystac#1334 (comment)). Any documentation cleanup/fixes to make things clearer for folks would be appreciated 🙇🏼.

FWIW My current thinking is that if we ever wanted to go to a v1.0 release of pystac-client, we'd want to drop inheritance altogether to avoid these problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, that makes sense. I understand now why pystac-client is taking this approach.

I've created a separate PR #800 that just updates the docstring as we discussed.

@jsignell
Copy link
Member

Thank you @gadomski for being a more opinionated reviewer than me and thank you @mishaschwartz for being flexible on the approach!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client.get_items has surprising recursive behaviour when using the /search endpoint
4 participants