[DataCatalog]: Pretty printing #3913

ElenaKhaustova · 2024-06-03T11:35:21Z

Description

Compiling the catalog at runtime hinders users' ability to assess its structure and contents effectively. They express the need for an improved visual representation of the catalog when printing.

We propose:

Explore the feasibility of developing a dedicated function to compile the catalog.
Implement a "pretty printing" - implement catalog.print() / catalog.__repr__ function specifically tailored to improve the visual representation of the catalog when printed or displayed.

Relates to #1721

Context

There is no a particular requirement from user side on how output should look like. They mention that they expect to have something better than the following to help them understand what's in the catalog when debugging:

"I have Kedro jupyter notebook opening. I have a catalog object. Maybe we could have a nicer representation when you do that. You know how many data sets are available, things like that potentially."

"pretty printing (print(catalog) should give something understandable than a long, a minima catalog.list() and maybe details on the dataset)"

Current catalog printing output

Catalog datasets printing output

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-06-06T07:10:34Z

Possibly a couple of bottlenecks for meaningful progress here are

We should open up the DataCatalog abstraction, and instead of making it hide the datasets, just present it as a collection of those. Previous discussions point to this design decision, see for example Re-design io.core and io.data_catalog #1778 (comment)
We should be able to pretty print individual datasets, otherwise we won't be able to do much apart from listing dataset names (although it must also be said that a list of datasets is probably better than a memory address)

astrojuanlu · 2024-06-06T07:12:21Z

Notice also that, despite being somewhat limited in IPython, on Jupyter we can go crazy. See what Dask and Xarray do:

https://tutorial.dask.org/00_overview.html

https://tutorial.xarray.dev/overview/xarray-in-45-min.html

ElenaKhaustova · 2024-06-06T09:16:17Z

This can be solved if #3932 is done, but the last is a much more complex task, and it's unclear whether/when it will be done.

merelcht · 2024-06-10T14:18:39Z

This feels like a duplicate of a #1721 rather than just being related. Can we merge these tickets in a way and create something more actionable as sub task?

Maybe we can just make repr and str do the same for now? I don't see the need to change str, as useless as it might be...

That sounds like the most obvious task we could start with? cc: @noklam @astrojuanlu

astrojuanlu · 2024-06-11T08:41:13Z

I think that comment from #1721 referred to AbstractDataset.__str__. But yes, a better AbstractDataset.__repr__ is needed to show a fancy DataCatalog.__repr__ I believe.

Agreed with having either this or #1721 be the main parent issue, closing the other, and opening sub-issues for specific classes, namely

AbstractDataset
DataCatalog (essentially this issue)
OmegaConfigLoader (or its base class)

noklam · 2024-06-11T09:17:11Z

TL;DR I agree we should keep one of the two tickets only.

This can be solved if #3932 is done, but the last is a much more complex task, and it's unclear whether/when it will be done.

I am not sure #3932 is the full solution to this. The use case of (de)serialisation of DataCatalog is quite different from the interactively use case, and it's likely we don't want to show one big blob of JSON-like structure in a terminal.

I think I am holding my DataCatalog is a collection of datasets argument, the job of DataCatalog is to delegate or collect the implementation from AbstractDataset (we could have a base method in case it is not implemented). I would assume people don't really want to see the metadata of DataCatalog, it's just that they need to access datasets information through DataCatalog

I have the use case of interacting with big Kedro project in mind here. I agree we should keep one of the two tickets only. This ticket is not necessarily contradict with #1721, it could be part of the solution to this use case. i.e. print(catalog) with some basic info, catalog.summary() (like how pandas.DataFrame works), or even catalog.summary(<pipeline>)?

Top level information that I would love to know:

Number of datasets
Names of this datasets (catalog.list), incomplete if dataset factory pattern exists. There is a separate issue discuss this.

If I need further details of the datasets, i will then access it through a specific API (it shouldn't print all datasets details when I do print(catalog), it's too much information). Potentially, it could show one of these following information

The materialised version of dataset definition in catalog.yml
AbstractDataset/dataset.describe()?

astrojuanlu · 2024-06-11T09:20:31Z

I think I am holding my "DataCatalog is a collection of datasets" argument,

Sorry @noklam I didn't quite get if you're in favour or against, could you clarify?

noklam · 2024-06-11T09:53:22Z

@astrojuanlu TL;DR, I think we should close one of the ticket. I don't want to get into the solution yet since I think the scope is not clear enough yet. (Something need to be done for DataCatalog, some are down to the Dataset level as you mentioned.

Compiling the catalog at runtime hinders users' ability to assess its structure and contents effectively. They express the need for an improved visual representation of the catalog when printing. (from this issue description)

The former should be handled separately, while "printing" in my mind is specific to interactive workflow (a.k.a Notebook 95% the time)

I think there are 2 approaches:

We try to customise the printing (i.e. catalog.print, catalog.repr, catalog.str etc)
We show the "raw" constructor as is (as if DataCatalog is a dataclass)

This is actually not too bad, but unfortunately framework and component is competing here. For example catalog._default_pattern is empty, while framework default is a memory dataset. The reason for this is that Framework actually inject the pattern through Runner, but this is very confusing to users that doesn't understand the details. So I lean more to 1.

deepyaman · 2024-06-24T13:31:18Z

We can extend the built-in pprint.PrettyPrinter to support AbstractDataset and DataCatalog. __repr__ can also be made to leverage the pretty printer.

@astrojuanlu In notebooks, you can do quite a bit by providing an HTML representation. Scikit-learn also does this.

If this is not super urgent, I would be happy to prototype something, as I have some recent experience working with the pretty printer (and, because doing so is quite non-trivial, if you've never done it before).

ElenaKhaustova · 2024-07-02T17:22:25Z

A summary of what we suggest:

Implement __repr__ for AbstractDataset and DataCatalog
- we already have an implementation of __str__ method for AbstractDataset based on the dataset's _describe which can be adjusted and moved to __repr__
- implement __repr__ for DataCatalog based on dataset's __repr__
- one of the potential solutions is to extend the built-in pprint.PrettyPrinter
- we might also need to update _describe for MemoryDataset, LambdaDataset, SharedMemoryDataset, and CachedDataset
Keep this issue as a parent and open two corresponding tickets for datasets and catalog
Should we open separate issues for ConfigLoader with autocompletion and meaningful representation? It is mentioned in Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721, but not much context was provided, and potential solutions were not discussed.
Close Improve DataCatalog and ConfigLoader with autocompletion and meaningful representation when it get printed #1721 as it is partly addressed in Key completion for dataset access #3973 and the "meaningful representation part" will be addressed in terms of the current ticket.

ElenaKhaustova · 2024-07-19T16:14:36Z

Related issue: kedro-org/kedro-plugins#769

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 3, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jun 3, 2024

iamelijahko mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Closed

ElenaKhaustova added the Type: Parent Issue label Jun 25, 2024

ElenaKhaustova assigned ElenaKhaustova and deepyaman Jun 25, 2024

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Closed

This was referenced Jul 2, 2024

Pretty printing: AbstractDataset.__repr__ #3980

Closed

Pretty printing: DataCatalog.__repr__ #3981

Closed

ElenaKhaustova closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Pretty printing #3913

[DataCatalog]: Pretty printing #3913

ElenaKhaustova commented Jun 3, 2024 •

edited

Loading

astrojuanlu commented Jun 6, 2024

astrojuanlu commented Jun 6, 2024

ElenaKhaustova commented Jun 6, 2024

merelcht commented Jun 10, 2024

astrojuanlu commented Jun 11, 2024

noklam commented Jun 11, 2024 •

edited

Loading

astrojuanlu commented Jun 11, 2024

noklam commented Jun 11, 2024

deepyaman commented Jun 24, 2024

ElenaKhaustova commented Jul 2, 2024 •

edited

Loading

ElenaKhaustova commented Jul 19, 2024

[DataCatalog]: Pretty printing #3913

[DataCatalog]: Pretty printing #3913

Comments

ElenaKhaustova commented Jun 3, 2024 • edited Loading

Description

Context

astrojuanlu commented Jun 6, 2024

astrojuanlu commented Jun 6, 2024

ElenaKhaustova commented Jun 6, 2024

merelcht commented Jun 10, 2024

astrojuanlu commented Jun 11, 2024

noklam commented Jun 11, 2024 • edited Loading

astrojuanlu commented Jun 11, 2024

noklam commented Jun 11, 2024

deepyaman commented Jun 24, 2024

ElenaKhaustova commented Jul 2, 2024 • edited Loading

ElenaKhaustova commented Jul 19, 2024

ElenaKhaustova commented Jun 3, 2024 •

edited

Loading

noklam commented Jun 11, 2024 •

edited

Loading

ElenaKhaustova commented Jul 2, 2024 •

edited

Loading