Skip to content

🎉 Make graphs a first-class citizen in ETL#5530

Closed
pabloarosado wants to merge 33 commits into
masterfrom
wip-prototype-graph-metadata
Closed

🎉 Make graphs a first-class citizen in ETL#5530
pabloarosado wants to merge 33 commits into
masterfrom
wip-prototype-graph-metadata

Conversation

@pabloarosado

@pabloarosado pabloarosado commented Jan 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Create a new graph channel that allows metadata from charts and multidims to live in etl as first-class citizens.

Context

Currently, etl allows you to define metadata for datasets, tables, and indicators. Charts' metadata has never had a place to live in etl, even though they often include some of our most important, public-facing metadata.

Since recently, we can also create metadata for multidims, which bridges that gap somewhat. But they have always lived in a bit of a mixed territory, as "export" steps, partially in ETL and partially in the database.

In this PR I have prototyped an alternative architecture, where charts and multidims can live fully in etl.

Benefits

  • All of our most important metadata can now live in the etl, instead of being spread between etl and database.
  • After a dataset update, Indicator upgrader would not be needed. We would just need to update the DAG, and running the graph steps would automatically update the indicator ids in charts.
    • NOTE: I haven't tested this update workflow, but theoretically it should work.
  • Charts and mdim steps are executed in the same way, namely: etlr slug --graph
  • This opens the doors to streamlining the drafting and creation of charts and mdims from etl (with some wizard). Alternatively, if anyone creates charts (and eventually mdims) from the grapher admin, this also makes it easier to pull those charts into etl.
  • For charts that require some manual work after an update, we can now have a .py script that would make those changes. See graph://emissions/latest/per-capita-co2-vs-average.

Important notes

  • All code has been a back and forth with Claude, so it would need to be reviewed thoroughly if we decide to go ahead.
  • I've tried to touch as little as possible of the existing core code, to ensure nothing breaks.
  • Because of the previous point, there's some logic duplication; if we like this prototype, we could refactor things a bit in collection modules to import some of that logic to grapher/graph.
  • I'm not sure if graph is the best name, but I wanted to avoid, e.g. chart, or grapher, or other things that could cause future headaches given the status quo.

How it works

Graph steps live in etl/steps/graph/ and are defined in dag/graph/*.yml DAG files. They can be:

  1. Metadata-only: Simple .meta.yml files for chart configs or multidim configs.
  2. Metadata + code: A .meta.yml file and a .py file for complex chart or mdim steps.

I replicated some existing charts and mdims. Feel free to run them:

etlr banning-of-chick-culling --graph

Make changes in the yaml file, or in the admin, and see what happens.
You can check chart-diff for changes.

For mdims (covid cases and covid deaths), you can run:

etlr covid-cases --graph
etlr covid-deaths --graph

TODO

If we like the prototype, the following things (at least) would need to happen (they could happen in a separate PR):

  • Clean up code and refactor to avoid logic duplication with collection modules.
  • Check if all relevant fields in charts and multidims are properly handled.
  • Adapt VSCode extension to make graph steps clickable in the DAG.
  • Adapt step upgrader to update dependencies in graph steps.
  • Neither Claude nor myself could easily figure out the logic of mdim slugs. So this needs to be decided.
  • We could consider doing a big migration of charts and mdims to become graphs, or have some tooling to let users migrate individual charts and mdims easily in upcoming updates.

@owidbot

owidbot commented Jan 21, 2026

Copy link
Copy Markdown
Contributor

Quick links (staging server):

Site Dev Site Preview Admin Wizard Docs

Login: ssh owid@staging-site-wip-prototype-graph-metadata

chart-diff: ✅
  • 2/2 reviewed charts
  • Modified: 2/2
  • New: 0/0
  • Rejected: 0
  • Data changes: 1
  • Metadata changes: 0
data-diff: ✅ No differences found

Automatically updated datasets matching excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2026-01-23 10:56:39 UTC
Execution time: 6.06 seconds

@pabloarosado pabloarosado changed the title 🚧 Prototype easy graph metadata 🎉 Make graphs a first-class citizen in ETL Jan 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new graph channel that makes charts and multidimensional visualizations first-class citizens in the ETL pipeline, alongside datasets, tables, and indicators. This allows chart metadata to be version-controlled in ETL rather than living solely in the database.

Changes:

  • Adds a new GraphStep class that manages chart metadata in the grapher database
  • Implements support for both YAML-only charts and charts with Python logic for dynamic configuration
  • Introduces paths.create_graph() helper method for creating charts from Python steps
  • Adds CLI flags --graph, --graph-push, and --graph-pull for managing graph steps
  • Includes example implementations replicating existing charts and multidims (animal welfare, COVID, emissions)

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
etl/steps/__init__.py Implements GraphStep class with run(), is_dirty(), and checksum logic
etl/grapher/graph.py Core logic for upserting charts, pulling from DB, expanding indicator paths, and conflict detection
etl/helpers.py Adds create_graph() method to PathFinder and updates channel detection for graph steps
etl/version_tracker.py Updates path resolution to support graph steps with .meta.yml files
etl/config.py Adds GRAPH, GRAPH_PUSH, and GRAPH_PULL configuration flags
etl/command.py Integrates --graph flag into CLI, adds graph steps to DAG construction, updates validation logic
dag/main.yml Includes new graph-specific DAG files
dag/graph/*.yml Example DAG files defining graph steps for animal welfare, COVID, and emissions
etl/steps/graph/**/*.meta.yml Example YAML metadata files for charts
etl/steps/graph/emissions/latest/per-capita-co2-vs-average.py Example Python step with dynamic threshold calculation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread etl/steps/__init__.py
if "slug" not in metadata:
raise ValueError(
f"Multidim graph step {self.path} requires 'slug' in metadata.\n"
f"Add to {self.metadata_path}:\n"

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message references self.metadata_path but the GraphStep class doesn't have a metadata_path property, only a metadata_file property. This will cause an AttributeError at runtime. Change self.metadata_path to self.metadata_file.

Suggested change
f"Add to {self.metadata_path}:\n"
f"Add to {self.metadata_file}:\n"

Copilot uses AI. Check for mistakes.
Comment thread etl/grapher/graph.py
Comment on lines +796 to +798
# Apply string substitution if yaml_params provided
if yaml_params:
content = content.format(**yaml_params)

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Python's str.format() method for YAML template substitution is potentially unsafe. If an attacker controls the YAML file content, they could inject format string exploits using expressions like {__import__('os').system('malicious_command')}. Consider using a safer templating approach such as Jinja2 with auto-escaping, or restrict yaml_params to only allow simple string replacements in specific, known locations in the template.

Copilot uses AI. Check for mistakes.
Comment thread etl/steps/__init__.py

source_checksum = self.checksum_input()

# Upsert graph to database (graph steps require --graph flag to run)

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "graph steps require --graph flag to run" but this is misleading. Graph steps can be defined and their run() method can be called, but they won't be included in the DAG unless the --graph flag is passed to the etl command (see etl/command.py:490-491). The --graph flag determines whether graph steps are included in the DAG, not whether they can run. Consider updating the comment to clarify: "Upsert graph to database (graph steps are only included in DAG when --graph flag is set)".

Suggested change
# Upsert graph to database (graph steps require --graph flag to run)
# Upsert graph to database (graph steps are only included in DAG when --graph flag is set)

Copilot uses AI. Check for mistakes.
- 0
- '{global_per_capita_threshold}'
- 100000000
columnSlug: '1119914'

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columnSlug field is hardcoded to '1119914', which appears to be a specific variable ID. This creates a tight coupling to the database state and will break if the variable is recreated or if the chart is used in different environments. Consider either removing this field and letting the grapher auto-populate it from the dimensions, or implementing logic to dynamically resolve it similar to how dimensions are handled. If this field is required for map visualizations, document why it's necessary and how to keep it in sync.

Suggested change
columnSlug: '1119914'

Copilot uses AI. Check for mistakes.
Comment thread etl/command.py
Comment on lines 403 to 409
includes=includes,
excludes=excludes,
grapher=grapher,
graph=graph,
export=export,
private=private,
only=only,

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph steps require database access to upsert charts, but there's no sanity check for database configuration when --graph flag is used. Consider adding a database sanity check similar to how it's done for --grapher flag. The main() function (outside these changed lines) calls sanity_check_db_settings() when grapher=True but not when graph=True. This could lead to confusing errors if users try to run graph steps without proper database configuration.

Copilot uses AI. Check for mistakes.
Comment thread etl/helpers.py Outdated
Comment on lines 299 to 307
def channel(self) -> CHANNEL:
if self._is_snapshot_file:
return "snapshot" # type: ignore
# Check if this is a graph step: etl/steps/graph/namespace/version/slug
step_dir_parent = self.f.parent.parent.parent.name
if step_dir_parent == "graph":
return "graph" # type: ignore
# Data steps: etl/steps/data/channel/namespace/version/slug
return self.f.parent.parent.parent.name # type: ignore

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channel property returns "graph" as a CHANNEL type, but "graph" is not included in the CHANNEL Literal type definition in owid.catalog. The CHANNEL type only includes: "snapshot", "garden", "meadow", "grapher", "open_numbers", "examples", "explorers", "external", and "multidim". This will cause type errors in type-checked code. Either add "graph" to the CHANNEL type definition in the catalog library, or use a union type that includes both CHANNEL and a graph-specific type.

Copilot uses AI. Check for mistakes.
Comment thread etl/command.py Outdated
Comment thread etl/steps/__init__.py
Comment on lines +1149 to +1447
class GraphStep(Step):
"""
A step that manages chart metadata in the grapher database.

Graph steps create or update charts based on indicator dependencies.
Single-indicator graphs can inherit metadata from the indicator's grapher_config.
Multi-indicator graphs require explicit metadata in a .meta.yml file.

Example:
graph://fur-farming-ban
"""

path: str
dependencies: List[Step]

def __init__(self, path: str, dependencies: List[Step]) -> None:
# path is now "namespace/version/short_name" like data steps
self.path = path
self.dependencies = dependencies

def __str__(self) -> str:
return f"graph://{self.path}"

@property
def slug(self) -> str:
"""Extract slug from path (namespace/version/slug -> slug)."""
return self.path.split("/")[-1]

@property
def _step_dir(self) -> Path:
"""Directory containing the step files."""
return paths.STEP_DIR / "graph" / self.path

@property
def metadata_file(self) -> Optional[Path]:
"""Path to .meta.yml file."""
meta_file = self._step_dir.with_suffix(".meta.yml")
return meta_file if meta_file.exists() else None

@property
def _python_file(self) -> Optional[Path]:
"""Path to .py file if it exists."""
py_file = self._step_dir.with_suffix(".py")
if py_file.exists():
return py_file
# Check for __init__.py in directory
init_file = self._step_dir / "__init__.py"
return init_file if init_file.exists() else None

def run(self) -> None:
"""Run graph step - either Python file or YAML-only."""
from etl import config

# If GRAPH_PULL is enabled, pull from database instead of pushing
if config.GRAPH_PULL:
self._pull_from_database()
return

# Normal flow: push to database
# Check if there's a Python file
py_file = self._python_file
if py_file:
# Has Python code - run it
self._run_python_step(py_file)
else:
# YAML-only - use existing logic
self._run_yaml_only()

def _run_python_step(self, py_file: Path) -> None:
"""Run Python module for graph step."""
# Import here to avoid circular imports
from importlib import import_module

# Import and run the module
if config.DEBUG:
# Run in isolated environment (same process, but isolated imports)
with isolated_env(py_file.parent):
module_path = py_file.relative_to(paths.BASE_DIR).as_posix().replace("/", ".").replace(".py", "")
step_module = import_module(module_path)
run_module_run(step_module, dest_dir=None) # Graph steps don't have dest_dir
else:
# Run in subprocess (safer, but slower)
args = ["uv", "run", "etl", "d", "run-python-step"]
if config.IPDB_ENABLED:
args.append("--ipdb")
args.extend([str(self), ""]) # Empty string for dest_dir

env = os.environ.copy()
env["PATH"] = os.path.expanduser("~/.cargo/bin") + ":" + env["PATH"]

try:
subprocess.check_call(args, env=env)
except subprocess.CalledProcessError as e:
sys.exit(e.returncode)

def _run_yaml_only(self) -> None:
"""Create graph from YAML metadata - either a single chart or multidimensional collection."""
# Load metadata to determine if this is a multidim collection or normal chart
if not self.metadata_file or not self.metadata_file.exists():
raise ValueError(f"Metadata file not found for {self}")

from etl.grapher.graph import _load_metadata_file

metadata = _load_metadata_file(self.metadata_file)

# Determine if this is a multidimensional collection
if self._is_multidim_collection(metadata):
self._create_multidim_collection(metadata)
else:
self._create_single_chart(metadata)

@staticmethod
def _is_multidim_collection(metadata: dict) -> bool:
"""
Determine if metadata defines a multidimensional collection.

A multidimensional collection has:
- A 'views' array defining multiple chart variants
- Dimensions with choices that users can select between

Examples:
# Multidim: Has views array
views:
- dimensions: {indicator: total}
indicators: {...}
- dimensions: {indicator: people}
indicators: {...}

# Normal chart: No views, just dimensions for indicator mapping
dimensions:
- property: y
catalogPath: grapher/.../indicator
"""
return "views" in metadata and isinstance(metadata["views"], list)

def _create_multidim_collection(self, metadata: dict) -> None:
"""Create a multidimensional collection from YAML config."""
from etl.collection.core.create import create_collection
from etl.dag_helpers import load_dag
from etl.grapher.graph import _expand_indicator_path

# Load DAG to get dependencies
dag = load_dag()
step_name = str(self)
dep_uris = list(dag.get(step_name, []))

# Expand short indicator names in views to full catalog paths
first_indicator_path = None
if "views" in metadata:
for view in metadata["views"]:
if "indicators" in view:
for prop, indicators in view["indicators"].items():
for i, indicator in enumerate(indicators):
if "catalogPath" in indicator:
indicator["catalogPath"] = _expand_indicator_path(indicator["catalogPath"], dep_uris)
# Capture first indicator path for the collection catalog_path
if first_indicator_path is None:
first_indicator_path = indicator["catalogPath"]

# Collection slug for multidim (without version)
# Must be set explicitly in metadata as 'slug'
# Format: namespace/base_dataset#collection_name
# Example: covid/covid#covid_deaths
if "slug" not in metadata:
raise ValueError(
f"Multidim graph step {self.path} requires 'slug' in metadata.\n"
f"Add to {self.metadata_path}:\n"
f" slug: namespace/base#collection"
)

# Build full catalog_path by inserting version from step path
parts = self.path.split("/")
if len(parts) >= 3:
version = parts[1]
slug = metadata["slug"]
# Insert version: covid/covid#covid_deaths → covid/latest/covid#covid_deaths
if "/" in slug:
namespace, rest = slug.split("/", 1)
catalog_path = f"{namespace}/{version}/{rest}"
else:
catalog_path = slug
else:
catalog_path = metadata["slug"]

c = create_collection(
config_yaml=metadata,
dependencies=dep_uris,
catalog_path=catalog_path,
)
c.save()

def _pull_from_database(self) -> None:
"""Pull chart configuration from database and write to .meta.yml file."""
from etl.dag_helpers import load_dag
from etl.grapher.graph import pull_graph

# Get metadata file path (create if doesn't exist)
metadata_file = self._step_dir.with_suffix(".meta.yml")

# Get dependencies from DAG
dag = load_dag()
step_name = str(self)
dep_uris = list(dag.get(step_name, []))

# Pull from database
pull_graph(
slug=self.slug,
metadata_file=metadata_file,
dependencies=dep_uris,
)

def _create_single_chart(self, metadata: dict) -> None:
"""Create a single chart from YAML config."""
from etl import config
from etl.dag_helpers import load_dag
from etl.grapher.graph import upsert_graph

# Load DAG to get original dependency strings
dag = load_dag()
step_name = str(self)
dep_uris = list(dag.get(step_name, []))

if not dep_uris:
raise ValueError(
f"No dependencies found for {step_name} in DAG. "
f"Graph steps must have at least one indicator dependency."
)

source_checksum = self.checksum_input()

# Upsert graph to database (graph steps require --graph flag to run)
upsert_graph(
slug=self.slug,
metadata_file=self.metadata_file,
dependencies=dep_uris,
source_checksum=source_checksum,
graph_push=config.GRAPH_PUSH,
)

def is_dirty(self) -> bool:
"""
Check if graph needs update based on:
- Local index.json doesn't exist or checksum differs
- (If --graph flag) Database chart doesn't exist or has different checksum
"""
from etl import config
from etl.grapher.graph import _load_graph_metadata

# 1. Check if local index.json exists
metadata = _load_graph_metadata(self.slug)
if not metadata:
return True # No metadata file, needs to be created

# 2. Check if any dependency changed
if any(d.is_dirty() for d in self.dependencies):
return True

# 3. Compare local checksum with expected checksum
local_checksum = metadata.get("local_checksum")
expected_checksum = self.checksum_input()

if local_checksum != expected_checksum:
return True

# 4. If --graph flag is set, also check if database needs updating
if config.GRAPH:
from etl.grapher.graph import fetch_graph_db_checksum, has_db_divergence

db_checksum = fetch_graph_db_checksum(self.slug)
if db_checksum != expected_checksum:
return True

# Check if database has diverged from what we last wrote
# This queries the DB to detect manual edits (similar to grapher steps)
if has_db_divergence(self.slug):
return True

return False

def checksum_input(self) -> str:
"""
Return checksum of all inputs to this graph step.
Similar to DataStep.checksum_input() but for graph metadata.
"""
from etl.grapher.graph import calculate_source_checksum

# Convert Step dependencies to URIs
dep_uris = [str(d) for d in self.dependencies]

return calculate_source_checksum(
dependencies=dep_uris,
metadata_file=self.metadata_file,
)

def checksum_output(self) -> str:
"""Return checksum representing the graph step's state."""
return self.checksum_input()


Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new GraphStep class and its associated functionality in etl/grapher/graph.py lack test coverage. The repository has comprehensive tests for other step types (DataStep, SnapshotStep, etc. in tests/test_steps.py), but there are no tests for GraphStep. Consider adding tests to cover: 1) GraphStep creation and initialization, 2) metadata loading and parsing with yaml_params, 3) indicator path expansion logic, 4) is_dirty() detection with various checksum scenarios, 5) error handling for missing dependencies and invalid metadata, 6) pull_graph() functionality.

Copilot uses AI. Check for mistakes.
Comment thread etl/grapher/graph.py
if chart_exists:
log.info("graph.update_chart", chart_id=chart.id, inheritance=is_inheritance_enabled)
config["isInheritanceEnabled"] = is_inheritance_enabled
result = admin_api.update_chart(

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable result is not used.

Suggested change
result = admin_api.update_chart(
admin_api.update_chart(

Copilot uses AI. Check for mistakes.
Comment thread etl/grapher/graph.py
if "#" not in catalog_path:
# Just indicator name - need to find which table(s) contain it
indicator = catalog_path
table_name = None

Copilot AI Jan 23, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable table_name is not used.

Suggested change
table_name = None

Copilot uses AI. Check for mistakes.
@lucasrodes

lucasrodes commented Jan 23, 2026

Copy link
Copy Markdown
Member

Hey Pablo, thanks for working on this. I think there is definitely value in adding charts to our ETL.

I have various comments on the implementation. I haven't been able to review all the files because there are a substantial number of lines of code. But I can have a better look next week.

I know this is a prototype and a kickstart, and I guess you are not expecting a full review (from what we discussed yesterday). Still, I'd like to point out a couple of design aspects I'd handle differently.

A general point I want to make, I suppose, is that we should not rush into finding a solution. I don't think there's urgency at the moment. There are many building blocks and ways to solve this problem, and we should aim to nail it, if possible. I'd suggest shaping this a bit in terms of architectural design (happy to help) & then do incremental steps towards a good & robust solution. A good strategy I like is to start with a "foundational layer" and then develop higher layers of abstraction. This also helps reviewers in general. This is what I did for etl.collections, and more recently with owid-catalog, and I've had a good experience.

So, my suggestion is: (i) let's have a shaping session to envision what we want, and then (ii) draft a roadmap with the different steps to get there. We don't need to do it all in one.

Still, I'll leave some general thoughts below:

  • I would reconsider how MDIMs/Explorers and Charts co-live in ETL.
    • This PR proposes a different channel for Charts, and I think that is not ideal. I would favour re-using our export:// space, maybe with a new channel 'views', 'charts', or so.
    • It seems that your motivation for not using export:// steps is (I quote): "But they have always lived in a bit of a mixed territory, as 'export' steps, partially in ETL and partially in the database". I guess I'd like to better understand this angle.
    • I think that the bar for creating a new prefix should be very high, to avoid confusion. I think the organization we'd use for charts should be similar (if not the same) as with MDIMs/Explorers
    • The same way we can choose how we display a Collection (as an Explorer or an MDIM), I think we should be able to present a collection as a single chart (if there's only one indicator). So we should aim at integrating this logic into Collections, or just redefine that space. But the user experience should feel seamless.
    • I think graph:// is unnecessarily close to grapher://. The first time I saw the term I was a bit confused, as in I thought it was a graph network. I'd suggest using different terms like "chart", "view", or even "viz"/"dataviz". The term "graph" is overused in many contexts, and it's not always clear what it means.
  • ETL-DB sync
    • Keeping charts in ETL and DB in sync is a bit tricky (not impossible!), so this probably merits involving devs too.
    • Not sure how we would identify charts in ADMIN. We currently use IDs in lots of places. Who would decide those IDs? I just fear this could introduce breaking changes to our setup (including the staging servers). Not impossible, but good to do baby steps.
    • We probably need dev knowledge here.
  • If we want charts to be a first-class citizen in ETL, we need a proper object.
    • We do have objects for Datasets, Collections. I'd suggest creating a class View or something similar with the metadata in the correct format, etc. Maybe we could even sync the object View that Collection uses (c.views) with this somehow.
    • This is essential for bringing structure to our workflow. We do this in collections to assist us in detecting correct vs. incorrect configs, performing class operations, etc.
    • There are some details like paths.create_collection returning an actual Collection object, paths.create_dataset returning a Dataset, but paths.create_graph returning an integer. I think this is confusing, and we should try to maintain a similar user interface. There are more examples of this. This could be solved by having proper objects.

@lucasrodes lucasrodes left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving some side comments. Most of them addressed in my comment before.

Comment on lines +3 to +4
graph://animal_welfare/latest/fur-farming-ban:
- data://grapher/animal_welfare/2024-12-17/fur_laws

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a strong preference for avoiding graph://. It's unnecessarily close to grapher://.

Before creating a new prefix, I'd re-consider what we already have and why we would or would not use them here. E.g. data://, export://, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure about "graph", but then I thought that:

  • "view" is confusing; we use this word in analytics a lot for visits (it's already problematic to talk about "the number of views in explorer and mdim views").
  • "chart" would be confusing, because it would sound as if mdims are not included.
  • "viz" could be, though it's a bit informal; "visualization" is inconveniently long.

We could reuse "export" but I think that doesn't fulfil the promise of making graphs a first-class citizen. It sounds as if it's creating a thing outside of etl.

We can brainstorm some alternatives!

@lucasrodes lucasrodes Jan 23, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using export://
Sure. I'm then just confused with the purpose of export://. We have been using it just for explorers and MDIMs, which I think should live in the same space as charts.

In my opinion, explorers, MDIMs, and charts should probably live under the same prefix/channel/namespace.

If we agreed on that, then using a new prefix like graph:// is effectively just a renaming of export://.

Using graph://
I understand your opposition to view, chart, and viz. Especially against "view", which I also find confusing.

"chart," I think, is perfectly fine. For all intents and purposes, MDIMs, Explorers, and grapher charts are indeed just charts. Some have a single view (grapher charts), others have multiple views (Explorers/MDIMs). But all can be considered charts. I think we may be disagreeing on the terminology here, and maybe we need to better define it.

"viz" is fine, though maybe a bit too casual.

Anyway, I am happy to find another term and go with it.

But I really wouldn't go with graph.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding export steps, they are meant to be steps that write outside of etl (we also have export://github and export://s3). They came after the external steps, which are steps that are read from outside of etl.

I agree with having charts and mdims in the same place, that's part of the motivation of this prototype. My preferred choice would be to take mdims out of export, but we can discuss this further.

I like "chart" too, and it's probably the best word. I just decided against it to avoid confusion with the meaning of "chart" under the status quo. But if you think it's significantly better, I'm up for it. We can decide on Monday.

Comment thread dag/main.yml
Comment on lines +807 to +809
- dag/graph/animal_welfare.yml
- dag/graph/covid.yml
- dag/graph/emissions.yml

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new strategy, where we would create new dag yaml files not only based on topic domain but also based on "step type".

I'm not sure now. I'd start small with just one yaml file like dag/charts.yml or so, before committing to multiple files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started indeed with just a graphs.yml file. Then I realised that this would be unrealistic, since it would blow up very quickly. Breaking it out into namespaces felt much more scalable.
Then I also thought that adding charts to existing namespace dag files (mixing data steps and graph steps) would be inconvenient. That's why I created the separate dag files.
No strong opinion, whatever is most convenient, this is easy to change.

Comment thread etl/grapher/graph.py
Comment on lines +54 to +93

# Fields that are auto-managed by the database
DB_MANAGED_FIELDS = {
"id",
"version",
"createdAt",
"updatedAt",
"publishedAt",
"lastEditedAt",
"lastEditedByUserId",
"publishedByUserId",
}

# Fields managed by ETL (not manual overrides)
ETL_MANAGED_FIELDS = {
"dimensions", # Variable IDs managed from catalogPath
"isInheritanceEnabled", # Controlled by ETL config
}

# Schema and metadata fields
METADATA_FIELDS = {
"$schema",
"slug",
"isPublished",
"isIndexable",
"relatedQuestions",
}

# User-editable content fields to check for conflicts
# These are the fields that users typically edit in the admin UI
# and that we want to detect when checking for manual overrides
USER_CONTENT_FIELDS = {
"title",
"subtitle",
"note",
"chartTypes",
"hasMapTab",
"tab",
"selectedEntityNames",
}

@lucasrodes lucasrodes Jan 23, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these definitions I think should be outside of the python file, in an external yaml file that a dev or whoever can easily update. Or best case: programmatically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, don't worry about the code for now, it's currently a Claude monster, it would need to be properly cleaned and refactored.

@@ -0,0 +1,24 @@
map:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it can be confusing that there is no reference to the indicator being displayed here.

My understanding is that this would come from the DAG file?

I think this can be a bit confusing, as there would be like two (distant) places where one edits the chart config sort of.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first idea was to include indicators in the DAG. Then I realised that this was problematic because (1) for some graphs we have many indicators, (2) it implies too many changes, given that the main unit in our DAG is the dataset. So I decided to just keep the DAG just for datasets.

The indicator is specified in the dimensions (the indicator name happens to be status), and follows the same logic as collections: you can name just the indicator, or the table_name#indicator, or the full catalog path. This way we avoid hardcoding versions in our steps.

dimensions:
  - property: y
    catalogPath: status

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Nice!

Did you reuse some of the logic from collections for this? I remember breaking a bit my brain to resolve indicator names and convert them into full catalog paths given the step dependencies.

While the MDIM config was not showing the complete catalog path, the config submitted to the DB was. This was resolved on Collection.save(). Is this similar in some way here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I asked Claude to reuse the logic already existing in collections, but Claude said it wasn't possible without making changes to those modules (which I wanted to avoid for now). So Claude added some logic for this. I added a TODO to clarify that some of that logic could be reused from collections to avoid duplication.

@pabloarosado

Copy link
Copy Markdown
Contributor Author

Hey @lucasrodes, thanks a lot for the thorough feedback.

Maybe I wasn't clear enough in the description: This code has been mostly writen by Claude with my guidance. So I don't think it makes sense to have a thorough code review before we agree on the main ideas. It's a prototype to play around, with which we can see if this is a direction we want to go (that said, I tried to make sure to not break any existing code; so we could safely merge simply ignoring graph steps, and iterate).

Leaving aside code quality and specific implementation details, I think the core ideas bring several benefits. I'm a bit disappointed that you haven't addressed any of them. I'm wondering if disagree or think they are not valuable enough.

I'd be happy to discuss this further on Monday. I've added it to the agenda. Thanks!

@lucasrodes

lucasrodes commented Jan 23, 2026

Copy link
Copy Markdown
Member

Hey @pabloarosado, Sorry if my message before sounded dismissive! I am not opposing this at all; I actually appreciate that you took the time to think about how to integrate charts into ETL.

My message was mostly focused on the solution's design and architecture, rather than its motivation. I won't really argue the benefits, which I think are pretty nice, and I can just say that I like all of them! I can see how that was not obvious from my message, apologies!

I'm just weighing the technical challenges, and think that we should shape this a bit more before entering the solution space. But I could be wrong.

Before coding (or vibe coding, for the matter), I'd have a brainstorming/shaping session to discuss and sketch the solution's architecture. I think it'd be nice to solve this in the broader context of ETL (MDIMs, Explorers, Admin)... which is just tricky. My experience with etl.collections can be useful, and a lot of that work can be integrated with these features.

For the work on owid-catalog, I've found it very helpful to have a main tracking issue with different "milestones": #5386.

@pabloarosado

Copy link
Copy Markdown
Contributor Author

Thanks for the clarification @lucasrodes, I'm happy that you see the benefits of this approach.
The rest of this cycle for me is about experimenting and creating quick prototypes, and this is part of it. Having something to play with can give us some inspiration on how to improve our workflow. So this PR does not go against brainstorming, it should help us brainstorm further!

@pabloarosado

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@pabloarosado

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9e95e8917

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread etl/steps/__init__.py
Comment on lines +1231 to +1234
args = ["uv", "run", "etl", "d", "run-python-step"]
if config.IPDB_ENABLED:
args.append("--ipdb")
args.extend([str(self), ""]) # Empty string for dest_dir

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Support graph steps in run-python-step path

In non-DEBUG runs, graph steps invoke etl d run-python-step with a graph://... URI. The runner at etl/run_python_step.py only allows data, data-private, and export, so it raises a ValueError for graph steps. This means any graph step that has a .py file (e.g., the emissions example in this commit) will fail to execute unless DEBUG mode is enabled. Consider updating the runner to accept graph (or using a graph-specific runner) so subprocess execution works.

Useful? React with 👍 / 👎.

Comment thread etl/grapher/graph.py
Comment on lines +448 to +450
# Multiple dependencies - need to check for ambiguity
# For now, use table#indicator format
return f"{table}#{indicator}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve dataset disambiguation on graph pull

When there are multiple dependencies, _shorten_indicator_path always collapses to table#indicator. If two dependencies share the same table name (common when table name equals dataset name), the resulting .meta.yml loses the dataset qualifier. On the next --graph run, _expand_indicator_path will pick the first matching dataset (dependency order comes from a set and is not deterministic), which can point to the wrong variable or fail resolution. Consider emitting dataset/table#indicator whenever the table name is ambiguous across dependencies.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown

This PR has had no activity within the last 30 days. It is considered stale and will be closed in 7 days if no further activity is detected.

@github-actions

github-actions Bot commented Apr 1, 2026

Copy link
Copy Markdown

This PR has had no activity within the last 30 days. It is considered stale and will be closed in 7 days if no further activity is detected.

@github-actions github-actions Bot added the stale label Apr 1, 2026
@github-actions github-actions Bot closed this Apr 9, 2026
@Marigold Marigold reopened this Apr 9, 2026
@github-actions github-actions Bot removed the stale label Apr 10, 2026
@lucasrodes

lucasrodes commented Apr 29, 2026

Copy link
Copy Markdown
Member

@pabloarosado shall we close in favor of #5969 ?

@pabloarosado

Copy link
Copy Markdown
Contributor Author

We can close this PR, and continue with the alternative approach: #5969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants