Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for metadata filter suggestions in the web UI #493

Open
pmeier opened this issue Aug 20, 2024 · 4 comments
Open

Support for metadata filter suggestions in the web UI #493

pmeier opened this issue Aug 20, 2024 · 4 comments
Assignees

Comments

@pmeier
Copy link
Member

pmeier commented Aug 20, 2024

In #484 we hardcode the available metadata. We cannot release with that. Instead we need a way to communicate this information from the backend to the web UI. For this we need

  1. An endpoint on the API, i.e. GET /corpuses/{name}/metadata

  2. A new abstract method on the SourceStorage class, e.g. list_metadata (name TBD). The return value should be dict[str, tuple[type, list[Any]]] with the keys being the available metadata keys and the values being a two-tuple of the type and the available values.

    • The type above might also be a str, e.g. "int", "float", etc., if that makes it easier.
    • The source storage might opt to return an empty list for the available values to indicate that no hints for the values are available.

    This function is potentially pretty expensive as one has to query the full database and potentially extract unique values from it. Thus, while we don't need to have it from the first version, we should have caching in mind.

    Lastly, I'm not sure yet if we want to make it a strictly abstract, i.e. decorating with @abstractmethod, because that would require everyone to implement this even if one doesn't want to work with corpuses. Instead we could also leave it undecorated and raise NotImplementedError instead and thus pushing the check to runtime. Thoughts?

In addition, 2. also has to be implemented on builtin source storages.

@pmeier
Copy link
Member Author

pmeier commented Aug 20, 2024

Afterthought to 1.: in #487 we decided to use None as sentinel for the default corpus as decided by the source storage. Not sure how this can work through the REST API though as {name} probably has to be a str and cannot be omitted. @nenb what do you think of removing the None option and instead use a string sentinel, e.g. "default"?

@nenb
Copy link
Contributor

nenb commented Aug 20, 2024

@nenb what do you think of removing the None option and instead use a string sentinel, e.g. "default"?

This seems fine to me.

The rest of the issue also seems fine to me - let's leave list_metadata undecorated for the reason that you outlined.

@pmeier
Copy link
Member Author

pmeier commented Aug 21, 2024

GET /corpuses/{name}/metadata is not going to cut it as we potentially have multiple source storages with corpuses. Thus, {name} is not unique. I see two options

  1. Switch to GET /corpuses/metadata and return a nested dictionary with the outer layers being the source storages and the corpus names. This amplifies the cost issue highlighted above since we now query all source storages and corpuses at once.
  2. Switch to something along the lines of GET source-storages/{source_storage_name}/corpuses/{corpus_name} to properly address the right corpus. I'm open to use a different scheme, e.g. put the corpus first in the path or just pass the query parameters.

Maybe a combination of both is a good solution?

  • GET /corpuses/metadata returns everything
  • GET /corpuses/metadata?source_storage=Chroma returns the same object, but only having Chroma as single item in the outer dictionary
  • GET /corpuses/metadata?source_storage=Chroma?corpus_name=default same above, but only having a single item in the secondary outer dictionary

@blakerosenthal this might also be a good solution for 1. #495 (comment) when using the JSON approach.

@nenb
Copy link
Contributor

nenb commented Aug 21, 2024

Maybe a combination of both is a good solution?
GET /corpuses/metadata returns everything
GET /corpuses/metadata?source_storage=Chroma returns the same object, but only having Chroma as single item in the outer dictionary
GET /corpuses/metadata?source_storage=Chroma?corpus_name=default same above, but only having a single item in the secondary outer dictionary

Makes sense, I'll implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

2 participants