Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretty printing: AbstractDataset.__repr__ #3980

Closed
ElenaKhaustova opened this issue Jul 2, 2024 · 7 comments
Closed

Pretty printing: AbstractDataset.__repr__ #3980

ElenaKhaustova opened this issue Jul 2, 2024 · 7 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jul 2, 2024

Description

Parent ticket: #3913

Implement __repr__ for AbstractDataset for better dataset representation and printing and further use it within DataCatalog.__repr__

Context

#3913 (comment)
#1721

Possible Implementation

  • We already have an implementation of __str__ method for AbstractDataset based on the dataset's _describe which can be adjusted and moved to __repr__.
  • Update _describe for MemoryDataset, LambdaDataset, SharedMemoryDataset, and CachedDataset if needed.
  • One of the potential solutions is to extend the built-in pprint.PrettyPrinter.
@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jul 4, 2024

I've prototyped two different approaches for printing:

  • both approaches implement __repr__() method for AbstractDataset based on the implementation of _describe() method for the specific dataset. Since the __str__ method is not implemented __repr__() is called when converting object to string, so we have the same results when print(obj), obj.
  • Pretty printing catalog #3990 - the first approach prints datasets in one line in a format module.Class(arg1=val_1, ..., arg_n=val_n) where argument values are formatted with pprint.pformat and then joined as strings, so we construct the resulting string from formatted strings.

Screenshot 2024-07-05 at 00 27 12

  • Pretty printing dataset with indentation #3991 - the second approach represents dataset in dict format {'module.Class': {'arg_1:' val_1}, ..., {}}, so at first the end dictionary is created and then it is formatted with pprint.pformat thus we can control indentation but it looks less compact.

Screenshot 2024-07-05 at 00 16 11

  • for both approaches, we can control depth, so we can hide arguments after a certain indentation level
  • they also both look representative enough compared to what we had before but the first is more compact, while the second keeps the indentation
  • an alternative option can be exposing _pretty_repr and providing some level of customisation to user to set up width, indentation, depth

I'm curious about what you think. Does it feel good enough? Do we want more or less information provided? Which approach seems better?

@noklam
Copy link
Contributor

noklam commented Jul 5, 2024

One caveats if you are doing pprint kedro-org/vscode-kedro#33

@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jul 8, 2024

The non-indented version was preferred to the indented, closing #3991

@datajoely
Copy link
Contributor

datajoely commented Jul 9, 2024

I think this is great! Inspired by @astrojuanlu 's post about the sklearn (provided slack link, linen not working for me) I had a think about what we could do with __rich__repr__ and provide a more beautiful experience in interactive environments.

The main bits of novelty are:

  • YAML representations of the load/save arguments
  • Example of how to import the class explicitly for debugging
  • Pulled out the first part of the docstring.
  • Generated a hyperlink to the docs.kedro.org definition
  • I forgot to add file path, but that part is trivial
Code

import importlib
import re
import sys
import yaml
import kedro
from kedro.io.data_catalog import DataCatalog

csv = """
cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ","
    na_values: ["#NA", NA]
  save_args:
    index: False
    date_format: "%Y-%m-%d %H:%M"
    decimal: .
"""

catalog = DataCatalog.from_config(yaml.safe_load(csv))

def find_shortest_import_path(class_path):
    parts = class_path.split('.')
    class_name = parts[-1]
    
    # Iterate over the possible module paths
    for i in range(1, len(parts)):
        module_path = ".".join(parts[:i])
        try:
            module = importlib.import_module(module_path)
            # If the module contains the class name, return the shortest path
            if hasattr(module, class_name):
                return f"{module_path}.{class_name}"
        except ModuleNotFoundError:
            continue
    
    # If no shorter path found, return the full class path
    return class_path


def generate_docs_url(class_path: str) -> str:
    base_module = 'kedro_datasets'
    if class_path.startswith('kedro_datasets'):
        try:    
            split_class_path = find_shortest_import_path(class_path).split('.')
            module_import = split_class_path[0]
            module_install = base_module.replace('_','-')
            version = importlib.import_module('kedro_datasets').__version__
            short_class_path = f"{split_class_path[1]}.{split_class_path[2]}"
            prefix = "https://docs.kedro.org/projects"
            module_loc = f"{module_install}-{version}"
            class_loc = f"{module_import}.{short_class_path}.html"
            url = f"{prefix}/{module_install}/en/{module_loc}/api/{class_loc}"
            return url
        except ModuleNotFoundError:
            return None
    return None

import rich
from rich import box
from rich.layout import Layout
from rich.syntax import Syntax
from rich.table import Table

def build_rich_repr(ds: kedro.io.AbstractDataset):
    load_args = yaml.safe_dump({"load_args":getattr(ds,'_load_args')}).strip()
    save_args = yaml.safe_dump({"save_args":getattr(ds,'_save_args')}).strip()
    help_message = re.sub("(\n|\t|\s{1,5})", " ", ds.__doc__.split('\n\n')[0].replace('`',''))
    class_path = f"{ds.__module__}.{ds.__class__.__name__}"
    docs_url = generate_docs_url(class_path)
    import_statement = f"from {ds.__module__} import {ds.__class__.__name__}"
    layout = Layout()
    theme = dict(theme="default") if 'ipykernel' in sys.modules else dict()
    syntax_args = dict(padding=1,  line_numbers=True, word_wrap=True, **theme)
    load_args_r = Syntax(load_args, "yaml", **syntax_args)
    save_args_r = Syntax(save_args, "yaml", **syntax_args)
    import_statement_r = Syntax(import_statement, "python", **syntax_args)
    
    t = Table(
        "Attribute", "Value",
        padding=1,
        show_header=False,
        box=box.SIMPLE
    )
    t.add_row("Class documentation", f'[b][link={docs_url}]{class_path}[/link][/b]')
    t.add_row("Docstring snippet", help_message)
    t.add_row("Load arguments", load_args_r)
    t.add_row("Save arguments", save_args_r)
    t.add_row("Import statement\n[i](useful for REPL testing)[/]", import_statement_r)
    return t

rich.print(build_rich_repr(catalog.datasets.cars))

Would yield this in Jupyter

image

Or this in IPython:

image

Or this in the default mac os terminal (the hyperlink doesn't work here):

image

@datajoely
Copy link
Contributor

One other piece - does your implementation @ElenaKhaustova expose encrypted database connection strings?

would this YAML:

shuttle_id_dataset:
  type: pandas.SQLQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  credentials: db_credentials
  layer: raw

render our the constructed connection string or not?

@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Jul 9, 2024

One other piece - does your implementation @ElenaKhaustova expose encrypted database connection strings?

would this YAML:

shuttle_id_dataset:
  type: pandas.SQLQueryDataSet
  sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
  credentials: db_credentials
  layer: raw

render our the constructed connection string or not?

What is exposed depends on the implementation of _describe(): https://github.com/kedro-org/kedro-plugins/blob/be99fecf6cf5ac8f6a0a717c56b06dbc148b26eb/kedro-datasets/kedro_datasets/pandas/sql_dataset.py#L558

  def _describe(self) -> dict[str, Any]:
      load_args = copy.deepcopy(self._load_args)
      return {
          "sql": str(load_args.pop("sql", None)),
          "filepath": str(self._filepath),
          "load_args": str(load_args),
          "execution_options": str(self._execution_options),
      }

So none of the credentials or connection string are exposed.

@astrojuanlu astrojuanlu moved this from In Progress to In Review in Kedro Framework Jul 10, 2024
@ElenaKhaustova
Copy link
Contributor Author

Solved in #3987

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

3 participants