Skip to content

Splash-ML collection redesign #34

@dylanmcreynolds

Description

@dylanmcreynolds

Currently, "Tags" are embedded as sub documents of Datasets, like:

{
    "uid": "ee600210432b8f81ad229c33",
    "type": "file",
    "uri": /a/b/c.tiff",
    "tags": [
        {
            "name": "label",
            "locator": { "min": 1, "max": 2},
            "value": "rods",
            "confidence": 0.900,
            "event_id": "wwewere6002104rwerwe81ad229c33",
        },
        {
            "name": "label",
            "value": "peaks",
            "locator": { "min": 1, "max": 2},
            "confidence": 0.001, 
            "event_id": "wwewere6002104rwerwe81ad229c33",
        },
            {
            "name": "geometry",
            "value": "GISAXS",
            "locator": { "min": 1, "max": 2},
            "confidence": 1.0, 
            "event_id": "wwewere6002104rwerwe81ad229c33",
        }
    ],
}

A dataset is rather primary in this instance.

There has been a lot of discussion about potentially turning switching to a model where the primary collection is Tag, which has a key for the data set that it was applied to, more like:

 {
  "name": "geometry",
  "value": "GISAXS",
  "locator": { "min": 1, "max": 2},
  "confidence": 1.0, 
  "event_id": "wwewere6002104rwerwe81ad229c33",
  "dataset": {
    "type": "file",
    "uri": "/a/b/c.tiff",
   }
}

With Dataset as the primary structure, searches for all datasets and all of their tags should be faster.

With Tags as the primary structure, searches for individual or multiple tags should be faster, as they would not have to return all of the payload of tags that were not queried.

I was pretty convinced that we want to switch to the Tags collection method, but talking with @taxe10 , the LabelMaker could better use the existing design. When the LabelMaker loads, it queries for all tags for multiple datasets, matching pretty well the current structure.

So the question for me now is, can we think think of compelling use cases where we want to search on a subset of known tags and receive all of the instances of them and the datasets that they relate to? If no, let's leave the collections as they are.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions