Pretty printing: `AbstractDataset.repr` #3980

ElenaKhaustova · 2024-07-02T17:42:14Z

Description

Parent ticket: #3913

Implement __repr__ for AbstractDataset for better dataset representation and printing and further use it within DataCatalog.__repr__

Context

#3913 (comment)
#1721

Possible Implementation

We already have an implementation of __str__ method for AbstractDataset based on the dataset's _describe which can be adjusted and moved to __repr__.
Update _describe for MemoryDataset, LambdaDataset, SharedMemoryDataset, and CachedDataset if needed.
One of the potential solutions is to extend the built-in pprint.PrettyPrinter.

The text was updated successfully, but these errors were encountered:

ElenaKhaustova · 2024-07-04T23:30:37Z

I've prototyped two different approaches for printing:

both approaches implement __repr__() method for AbstractDataset based on the implementation of _describe() method for the specific dataset. Since the __str__ method is not implemented __repr__() is called when converting object to string, so we have the same results when print(obj), obj.
Pretty printing catalog #3990 - the first approach prints datasets in one line in a format module.Class(arg1=val_1, ..., arg_n=val_n) where argument values are formatted with pprint.pformat and then joined as strings, so we construct the resulting string from formatted strings.

Pretty printing dataset with indentation #3991 - the second approach represents dataset in dict format {'module.Class': {'arg_1:' val_1}, ..., {}}, so at first the end dictionary is created and then it is formatted with pprint.pformat thus we can control indentation but it looks less compact.

for both approaches, we can control depth, so we can hide arguments after a certain indentation level
they also both look representative enough compared to what we had before but the first is more compact, while the second keeps the indentation
an alternative option can be exposing _pretty_repr and providing some level of customisation to user to set up width, indentation, depth

I'm curious about what you think. Does it feel good enough? Do we want more or less information provided? Which approach seems better?

noklam · 2024-07-05T12:37:17Z

One caveats if you are doing pprint kedro-org/vscode-kedro#33

ElenaKhaustova · 2024-07-08T15:42:40Z

The non-indented version was preferred to the indented, closing #3991

datajoely · 2024-07-09T15:13:31Z

I think this is great! Inspired by @astrojuanlu 's post about the sklearn (provided slack link, linen not working for me) I had a think about what we could do with __rich__repr__ and provide a more beautiful experience in interactive environments.

The main bits of novelty are:

YAML representations of the load/save arguments
Example of how to import the class explicitly for debugging
Pulled out the first part of the docstring.
Generated a hyperlink to the docs.kedro.org definition
I forgot to add file path, but that part is trivial

Code

import importlib
import re
import sys
import yaml
import kedro
from kedro.io.data_catalog import DataCatalog

csv = """
cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ","
    na_values: ["#NA", NA]
  save_args:
    index: False
    date_format: "%Y-%m-%d %H:%M"
    decimal: .
"""

catalog = DataCatalog.from_config(yaml.safe_load(csv))

def find_shortest_import_path(class_path):
    parts = class_path.split('.')
    class_name = parts[-1]
    
    # Iterate over the possible module paths
    for i in range(1, len(parts)):
        module_path = ".".join(parts[:i])
        try:
            module = importlib.import_module(module_path)
            # If the module contains the class name, return the shortest path
            if hasattr(module, class_name):
                return f"{module_path}.{class_name}"
        except ModuleNotFoundError:
            continue
    
    # If no shorter path found, return the full class path
    return class_path


def generate_docs_url(class_path: str) -> str:
    base_module = 'kedro_datasets'
    if class_path.startswith('kedro_datasets'):
        try:    
            split_class_path = find_shortest_import_path(class_path).split('.')
            module_import = split_class_path[0]
            module_install = base_module.replace('_','-')
            version = importlib.import_module('kedro_datasets').__version__
            short_class_path = f"{split_class_path[1]}.{split_class_path[2]}"
            prefix = "https://docs.kedro.org/projects"
            module_loc = f"{module_install}-{version}"
            class_loc = f"{module_import}.{short_class_path}.html"
            url = f"{prefix}/{module_install}/en/{module_loc}/api/{class_loc}"
            return url
        except ModuleNotFoundError:
            return None
    return None

import rich
from rich import box
from rich.layout import Layout
from rich.syntax import Syntax
from rich.table import Table

def build_rich_repr(ds: kedro.io.AbstractDataset):
    load_args = yaml.safe_dump({"load_args":getattr(ds,'_load_args')}).strip()
    save_args = yaml.safe_dump({"save_args":getattr(ds,'_save_args')}).strip()
    help_message = re.sub("(\n|\t|\s{1,5})", " ", ds.__doc__.split('\n\n')[0].replace('`',''))
    class_path = f"{ds.__module__}.{ds.__class__.__name__}"
    docs_url = generate_docs_url(class_path)
    import_statement = f"from {ds.__module__} import {ds.__class__.__name__}"
    layout = Layout()
    theme = dict(theme="default") if 'ipykernel' in sys.modules else dict()
    syntax_args = dict(padding=1,  line_numbers=True, word_wrap=True, **theme)
    load_args_r = Syntax(load_args, "yaml", **syntax_args)
    save_args_r = Syntax(save_args, "yaml", **syntax_args)
    import_statement_r = Syntax(import_statement, "python", **syntax_args)
    
    t = Table(
        "Attribute", "Value",
        padding=1,
        show_header=False,
        box=box.SIMPLE
    )
    t.add_row("Class documentation", f'[b][link={docs_url}]{class_path}[/link][/b]')
    t.add_row("Docstring snippet", help_message)
    t.add_row("Load arguments", load_args_r)
    t.add_row("Save arguments", save_args_r)
    t.add_row("Import statement\n[i](useful for REPL testing)[/]", import_statement_r)
    return t

rich.print(build_rich_repr(catalog.datasets.cars))

Would yield this in Jupyter

Or this in IPython:

Or this in the default mac os terminal (the hyperlink doesn't work here):

datajoely · 2024-07-09T15:23:23Z

One other piece - does your implementation @ElenaKhaustova expose encrypted database connection strings?

would this YAML:

shuttle_id_dataset:
  type: pandas.SQLQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  credentials: db_credentials
  layer: raw

render our the constructed connection string or not?

ElenaKhaustova · 2024-07-09T18:30:18Z

One other piece - does your implementation @ElenaKhaustova expose encrypted database connection strings?

would this YAML:
shuttle_id_dataset:
  type: pandas.SQLQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  credentials: db_credentials
  layer: raw
render our the constructed connection string or not?

What is exposed depends on the implementation of _describe(): https://github.com/kedro-org/kedro-plugins/blob/be99fecf6cf5ac8f6a0a717c56b06dbc148b26eb/kedro-datasets/kedro_datasets/pandas/sql_dataset.py#L558

  def _describe(self) -> dict[str, Any]:
      load_args = copy.deepcopy(self._load_args)
      return {
          "sql": str(load_args.pop("sql", None)),
          "filepath": str(self._filepath),
          "load_args": str(load_args),
          "execution_options": str(self._execution_options),
      }

So none of the credentials or connection string are exposed.

ElenaKhaustova · 2024-07-18T13:42:07Z

Solved in #3987

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jul 2, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jul 2, 2024

ElenaKhaustova mentioned this issue Jul 2, 2024

Pretty printing: DataCatalog.__repr__ #3981

Closed

ElenaKhaustova self-assigned this Jul 3, 2024

This was referenced Jul 3, 2024

Pretty printing dataset #3987

Merged

Pretty printing dataset with indentation #3991

Closed

Pretty printing catalog #3990

Merged

ElenaKhaustova closed this as completed Jul 18, 2024

github-actions bot mentioned this issue Aug 1, 2024

Monthly issue metrics report #4049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretty printing: `AbstractDataset.repr` #3980

Pretty printing: `AbstractDataset.repr` #3980

ElenaKhaustova commented Jul 2, 2024 •

edited

Loading

ElenaKhaustova commented Jul 4, 2024 •

edited

Loading

noklam commented Jul 5, 2024

ElenaKhaustova commented Jul 8, 2024 •

edited

Loading

datajoely commented Jul 9, 2024 •

edited

Loading

datajoely commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024 •

edited

Loading

ElenaKhaustova commented Jul 18, 2024

Pretty printing: AbstractDataset.__repr__ #3980

Pretty printing: AbstractDataset.__repr__ #3980

Comments

ElenaKhaustova commented Jul 2, 2024 • edited Loading

Description

Context

Possible Implementation

ElenaKhaustova commented Jul 4, 2024 • edited Loading

noklam commented Jul 5, 2024

ElenaKhaustova commented Jul 8, 2024 • edited Loading

datajoely commented Jul 9, 2024 • edited Loading

datajoely commented Jul 9, 2024

ElenaKhaustova commented Jul 9, 2024 • edited Loading

ElenaKhaustova commented Jul 18, 2024

Pretty printing: `AbstractDataset.repr` #3980

Pretty printing: `AbstractDataset.repr` #3980

ElenaKhaustova commented Jul 2, 2024 •

edited

Loading

ElenaKhaustova commented Jul 4, 2024 •

edited

Loading

ElenaKhaustova commented Jul 8, 2024 •

edited

Loading

datajoely commented Jul 9, 2024 •

edited

Loading

ElenaKhaustova commented Jul 9, 2024 •

edited

Loading