🎉 Make graphs a first-class citizen in ETL by pabloarosado · Pull Request #5530 · owid/etl

pabloarosado · 2026-01-21T10:15:39Z

Summary

Create a new graph channel that allows metadata from charts and multidims to live in etl as first-class citizens.

Context

Currently, etl allows you to define metadata for datasets, tables, and indicators. Charts' metadata has never had a place to live in etl, even though they often include some of our most important, public-facing metadata.

Since recently, we can also create metadata for multidims, which bridges that gap somewhat. But they have always lived in a bit of a mixed territory, as "export" steps, partially in ETL and partially in the database.

In this PR I have prototyped an alternative architecture, where charts and multidims can live fully in etl.

Benefits

All of our most important metadata can now live in the etl, instead of being spread between etl and database.
After a dataset update, Indicator upgrader would not be needed. We would just need to update the DAG, and running the graph steps would automatically update the indicator ids in charts.
- NOTE: I haven't tested this update workflow, but theoretically it should work.
Charts and mdim steps are executed in the same way, namely: etlr slug --graph
This opens the doors to streamlining the drafting and creation of charts and mdims from etl (with some wizard). Alternatively, if anyone creates charts (and eventually mdims) from the grapher admin, this also makes it easier to pull those charts into etl.
For charts that require some manual work after an update, we can now have a .py script that would make those changes. See graph://emissions/latest/per-capita-co2-vs-average.

Important notes

All code has been a back and forth with Claude, so it would need to be reviewed thoroughly if we decide to go ahead.
I've tried to touch as little as possible of the existing core code, to ensure nothing breaks.
Because of the previous point, there's some logic duplication; if we like this prototype, we could refactor things a bit in collection modules to import some of that logic to grapher/graph.
I'm not sure if graph is the best name, but I wanted to avoid, e.g. chart, or grapher, or other things that could cause future headaches given the status quo.

How it works

Graph steps live in etl/steps/graph/ and are defined in dag/graph/*.yml DAG files. They can be:

Metadata-only: Simple .meta.yml files for chart configs or multidim configs.
Metadata + code: A .meta.yml file and a .py file for complex chart or mdim steps.

I replicated some existing charts and mdims. Feel free to run them:

etlr banning-of-chick-culling --graph

Make changes in the yaml file, or in the admin, and see what happens.
You can check chart-diff for changes.

For mdims (covid cases and covid deaths), you can run:

etlr covid-cases --graph
etlr covid-deaths --graph

TODO

If we like the prototype, the following things (at least) would need to happen (they could happen in a separate PR):

Clean up code and refactor to avoid logic duplication with collection modules.
Check if all relevant fields in charts and multidims are properly handled.
Adapt VSCode extension to make graph steps clickable in the DAG.
Adapt step upgrader to update dependencies in graph steps.
Neither Claude nor myself could easily figure out the logic of mdim slugs. So this needs to be decided.
We could consider doing a big migration of charts and mdims to become graphs, or have some tooling to let users migrate individual charts and mdims easily in upcoming updates.

owidbot · 2026-01-21T10:18:53Z

Quick links (staging server):

Site Dev	Site Preview	Admin	Wizard	Docs

Login: ssh owid@staging-site-wip-prototype-graph-metadata

chart-diff: ✅

2/2 reviewed charts
Modified: 2/2
New: 0/0
Rejected: 0
Data changes: 1
Metadata changes: 0

data-diff: ✅ No differences found

Edited: 2026-01-23 10:56:39 UTC
Execution time: 6.06 seconds

…-metadata

Copilot

Pull request overview

This PR introduces a new graph channel that makes charts and multidimensional visualizations first-class citizens in the ETL pipeline, alongside datasets, tables, and indicators. This allows chart metadata to be version-controlled in ETL rather than living solely in the database.

Changes:

Adds a new GraphStep class that manages chart metadata in the grapher database
Implements support for both YAML-only charts and charts with Python logic for dynamic configuration
Introduces paths.create_graph() helper method for creating charts from Python steps
Adds CLI flags --graph, --graph-push, and --graph-pull for managing graph steps
Includes example implementations replicating existing charts and multidims (animal welfare, COVID, emissions)

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`etl/steps/__init__.py`	Implements `GraphStep` class with run(), is_dirty(), and checksum logic
`etl/grapher/graph.py`	Core logic for upserting charts, pulling from DB, expanding indicator paths, and conflict detection
`etl/helpers.py`	Adds `create_graph()` method to PathFinder and updates channel detection for graph steps
`etl/version_tracker.py`	Updates path resolution to support graph steps with `.meta.yml` files
`etl/config.py`	Adds GRAPH, GRAPH_PUSH, and GRAPH_PULL configuration flags
`etl/command.py`	Integrates --graph flag into CLI, adds graph steps to DAG construction, updates validation logic
`dag/main.yml`	Includes new graph-specific DAG files
`dag/graph/*.yml`	Example DAG files defining graph steps for animal welfare, COVID, and emissions
`etl/steps/graph/*/.meta.yml`	Example YAML metadata files for charts
`etl/steps/graph/emissions/latest/per-capita-co2-vs-average.py`	Example Python step with dynamic threshold calculation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T12:29:48Z

+        if "slug" not in metadata:
+            raise ValueError(
+                f"Multidim graph step {self.path} requires 'slug' in metadata.\n"
+                f"Add to {self.metadata_path}:\n"


The error message references self.metadata_path but the GraphStep class doesn't have a metadata_path property, only a metadata_file property. This will cause an AttributeError at runtime. Change self.metadata_path to self.metadata_file.

Suggested change

f"Add to {self.metadata_path}:\n"

f"Add to {self.metadata_file}:\n"

Copilot · 2026-01-23T12:29:49Z

+    # Apply string substitution if yaml_params provided
+    if yaml_params:
+        content = content.format(**yaml_params)


Using Python's str.format() method for YAML template substitution is potentially unsafe. If an attacker controls the YAML file content, they could inject format string exploits using expressions like {__import__('os').system('malicious_command')}. Consider using a safer templating approach such as Jinja2 with auto-escaping, or restrict yaml_params to only allow simple string replacements in specific, known locations in the template.

Copilot · 2026-01-23T12:29:49Z

+
+        source_checksum = self.checksum_input()
+
+        # Upsert graph to database (graph steps require --graph flag to run)


The comment states "graph steps require --graph flag to run" but this is misleading. Graph steps can be defined and their run() method can be called, but they won't be included in the DAG unless the --graph flag is passed to the etl command (see etl/command.py:490-491). The --graph flag determines whether graph steps are included in the DAG, not whether they can run. Consider updating the comment to clarify: "Upsert graph to database (graph steps are only included in DAG when --graph flag is set)".

Suggested change

# Upsert graph to database (graph steps require --graph flag to run)

# Upsert graph to database (graph steps are only included in DAG when --graph flag is set)

Copilot · 2026-01-23T12:29:49Z

+      - 0
+      - '{global_per_capita_threshold}'
+      - 100000000
+  columnSlug: '1119914'


The columnSlug field is hardcoded to '1119914', which appears to be a specific variable ID. This creates a tight coupling to the database state and will break if the variable is recreated or if the chart is used in different environments. Consider either removing this field and letting the grapher auto-populate it from the dimensions, or implementing logic to dynamically resolve it similar to how dimensions are handled. If this field is required for map visualizations, document why it's necessary and how to keep it in sync.

Suggested change

columnSlug: '1119914'

Copilot · 2026-01-23T12:29:49Z

        includes=includes,
        excludes=excludes,
        grapher=grapher,
+        graph=graph,
        export=export,
        private=private,
        only=only,


Graph steps require database access to upsert charts, but there's no sanity check for database configuration when --graph flag is used. Consider adding a database sanity check similar to how it's done for --grapher flag. The main() function (outside these changed lines) calls sanity_check_db_settings() when grapher=True but not when graph=True. This could lead to confusing errors if users try to run graph steps without proper database configuration.

Copilot · 2026-01-23T12:29:50Z

    def channel(self) -> CHANNEL:
        if self._is_snapshot_file:
            return "snapshot"  # type: ignore
+        # Check if this is a graph step: etl/steps/graph/namespace/version/slug
+        step_dir_parent = self.f.parent.parent.parent.name
+        if step_dir_parent == "graph":
+            return "graph"  # type: ignore
+        # Data steps: etl/steps/data/channel/namespace/version/slug
        return self.f.parent.parent.parent.name  # type: ignore


The channel property returns "graph" as a CHANNEL type, but "graph" is not included in the CHANNEL Literal type definition in owid.catalog. The CHANNEL type only includes: "snapshot", "garden", "meadow", "grapher", "open_numbers", "examples", "explorers", "external", and "multidim". This will cause type errors in type-checked code. Either add "graph" to the CHANNEL type definition in the catalog library, or use a union type that includes both CHANNEL and a graph-specific type.

Copilot · 2026-01-23T12:29:51Z

+class GraphStep(Step):
+    """
+    A step that manages chart metadata in the grapher database.
+
+    Graph steps create or update charts based on indicator dependencies.
+    Single-indicator graphs can inherit metadata from the indicator's grapher_config.
+    Multi-indicator graphs require explicit metadata in a .meta.yml file.
+
+    Example:
+        graph://fur-farming-ban
+    """
+
+    path: str
+    dependencies: List[Step]
+
+    def __init__(self, path: str, dependencies: List[Step]) -> None:
+        # path is now "namespace/version/short_name" like data steps
+        self.path = path
+        self.dependencies = dependencies
+
+    def __str__(self) -> str:
+        return f"graph://{self.path}"
+
+    @property
+    def slug(self) -> str:
+        """Extract slug from path (namespace/version/slug -> slug)."""
+        return self.path.split("/")[-1]
+
+    @property
+    def _step_dir(self) -> Path:
+        """Directory containing the step files."""
+        return paths.STEP_DIR / "graph" / self.path
+
+    @property
+    def metadata_file(self) -> Optional[Path]:
+        """Path to .meta.yml file."""
+        meta_file = self._step_dir.with_suffix(".meta.yml")
+        return meta_file if meta_file.exists() else None
+
+    @property
+    def _python_file(self) -> Optional[Path]:
+        """Path to .py file if it exists."""
+        py_file = self._step_dir.with_suffix(".py")
+        if py_file.exists():
+            return py_file
+        # Check for __init__.py in directory
+        init_file = self._step_dir / "__init__.py"
+        return init_file if init_file.exists() else None
+
+    def run(self) -> None:
+        """Run graph step - either Python file or YAML-only."""
+        from etl import config
+
+        # If GRAPH_PULL is enabled, pull from database instead of pushing
+        if config.GRAPH_PULL:
+            self._pull_from_database()
+            return
+
+        # Normal flow: push to database
+        # Check if there's a Python file
+        py_file = self._python_file
+        if py_file:
+            # Has Python code - run it
+            self._run_python_step(py_file)
+        else:
+            # YAML-only - use existing logic
+            self._run_yaml_only()
+
+    def _run_python_step(self, py_file: Path) -> None:
+        """Run Python module for graph step."""
+        # Import here to avoid circular imports
+        from importlib import import_module
+
+        # Import and run the module
+        if config.DEBUG:
+            # Run in isolated environment (same process, but isolated imports)
+            with isolated_env(py_file.parent):
+                module_path = py_file.relative_to(paths.BASE_DIR).as_posix().replace("/", ".").replace(".py", "")
+                step_module = import_module(module_path)
+                run_module_run(step_module, dest_dir=None)  # Graph steps don't have dest_dir
+        else:
+            # Run in subprocess (safer, but slower)
+            args = ["uv", "run", "etl", "d", "run-python-step"]
+            if config.IPDB_ENABLED:
+                args.append("--ipdb")
+            args.extend([str(self), ""])  # Empty string for dest_dir
+
+            env = os.environ.copy()
+            env["PATH"] = os.path.expanduser("~/.cargo/bin") + ":" + env["PATH"]
+
+            try:
+                subprocess.check_call(args, env=env)
+            except subprocess.CalledProcessError as e:
+                sys.exit(e.returncode)
+
+    def _run_yaml_only(self) -> None:
+        """Create graph from YAML metadata - either a single chart or multidimensional collection."""
+        # Load metadata to determine if this is a multidim collection or normal chart
+        if not self.metadata_file or not self.metadata_file.exists():
+            raise ValueError(f"Metadata file not found for {self}")
+
+        from etl.grapher.graph import _load_metadata_file
+
+        metadata = _load_metadata_file(self.metadata_file)
+
+        # Determine if this is a multidimensional collection
+        if self._is_multidim_collection(metadata):
+            self._create_multidim_collection(metadata)
+        else:
+            self._create_single_chart(metadata)
+
+    @staticmethod
+    def _is_multidim_collection(metadata: dict) -> bool:
+        """
+        Determine if metadata defines a multidimensional collection.
+
+        A multidimensional collection has:
+        - A 'views' array defining multiple chart variants
+        - Dimensions with choices that users can select between
+
+        Examples:
+            # Multidim: Has views array
+            views:
+              - dimensions: {indicator: total}
+                indicators: {...}
+              - dimensions: {indicator: people}
+                indicators: {...}
+
+            # Normal chart: No views, just dimensions for indicator mapping
+            dimensions:
+              - property: y
+                catalogPath: grapher/.../indicator
+        """
+        return "views" in metadata and isinstance(metadata["views"], list)
+
+    def _create_multidim_collection(self, metadata: dict) -> None:
+        """Create a multidimensional collection from YAML config."""
+        from etl.collection.core.create import create_collection
+        from etl.dag_helpers import load_dag
+        from etl.grapher.graph import _expand_indicator_path
+
+        # Load DAG to get dependencies
+        dag = load_dag()
+        step_name = str(self)
+        dep_uris = list(dag.get(step_name, []))
+
+        # Expand short indicator names in views to full catalog paths
+        first_indicator_path = None
+        if "views" in metadata:
+            for view in metadata["views"]:
+                if "indicators" in view:
+                    for prop, indicators in view["indicators"].items():
+                        for i, indicator in enumerate(indicators):
+                            if "catalogPath" in indicator:
+                                indicator["catalogPath"] = _expand_indicator_path(indicator["catalogPath"], dep_uris)
+                                # Capture first indicator path for the collection catalog_path
+                                if first_indicator_path is None:
+                                    first_indicator_path = indicator["catalogPath"]
+
+        # Collection slug for multidim (without version)
+        # Must be set explicitly in metadata as 'slug'
+        # Format: namespace/base_dataset#collection_name
+        # Example: covid/covid#covid_deaths
+        if "slug" not in metadata:
+            raise ValueError(
+                f"Multidim graph step {self.path} requires 'slug' in metadata.\n"
+                f"Add to {self.metadata_path}:\n"
+                f"  slug: namespace/base#collection"
+            )
+
+        # Build full catalog_path by inserting version from step path
+        parts = self.path.split("/")
+        if len(parts) >= 3:
+            version = parts[1]
+            slug = metadata["slug"]
+            # Insert version: covid/covid#covid_deaths → covid/latest/covid#covid_deaths
+            if "/" in slug:
+                namespace, rest = slug.split("/", 1)
+                catalog_path = f"{namespace}/{version}/{rest}"
+            else:
+                catalog_path = slug
+        else:
+            catalog_path = metadata["slug"]
+
+        c = create_collection(
+            config_yaml=metadata,
+            dependencies=dep_uris,
+            catalog_path=catalog_path,
+        )
+        c.save()
+
+    def _pull_from_database(self) -> None:
+        """Pull chart configuration from database and write to .meta.yml file."""
+        from etl.dag_helpers import load_dag
+        from etl.grapher.graph import pull_graph
+
+        # Get metadata file path (create if doesn't exist)
+        metadata_file = self._step_dir.with_suffix(".meta.yml")
+
+        # Get dependencies from DAG
+        dag = load_dag()
+        step_name = str(self)
+        dep_uris = list(dag.get(step_name, []))
+
+        # Pull from database
+        pull_graph(
+            slug=self.slug,
+            metadata_file=metadata_file,
+            dependencies=dep_uris,
+        )
+
+    def _create_single_chart(self, metadata: dict) -> None:
+        """Create a single chart from YAML config."""
+        from etl import config
+        from etl.dag_helpers import load_dag
+        from etl.grapher.graph import upsert_graph
+
+        # Load DAG to get original dependency strings
+        dag = load_dag()
+        step_name = str(self)
+        dep_uris = list(dag.get(step_name, []))
+
+        if not dep_uris:
+            raise ValueError(
+                f"No dependencies found for {step_name} in DAG. "
+                f"Graph steps must have at least one indicator dependency."
+            )
+
+        source_checksum = self.checksum_input()
+
+        # Upsert graph to database (graph steps require --graph flag to run)
+        upsert_graph(
+            slug=self.slug,
+            metadata_file=self.metadata_file,
+            dependencies=dep_uris,
+            source_checksum=source_checksum,
+            graph_push=config.GRAPH_PUSH,
+        )
+
+    def is_dirty(self) -> bool:
+        """
+        Check if graph needs update based on:
+        - Local index.json doesn't exist or checksum differs
+        - (If --graph flag) Database chart doesn't exist or has different checksum
+        """
+        from etl import config
+        from etl.grapher.graph import _load_graph_metadata
+
+        # 1. Check if local index.json exists
+        metadata = _load_graph_metadata(self.slug)
+        if not metadata:
+            return True  # No metadata file, needs to be created
+
+        # 2. Check if any dependency changed
+        if any(d.is_dirty() for d in self.dependencies):
+            return True
+
+        # 3. Compare local checksum with expected checksum
+        local_checksum = metadata.get("local_checksum")
+        expected_checksum = self.checksum_input()
+
+        if local_checksum != expected_checksum:
+            return True
+
+        # 4. If --graph flag is set, also check if database needs updating
+        if config.GRAPH:
+            from etl.grapher.graph import fetch_graph_db_checksum, has_db_divergence
+
+            db_checksum = fetch_graph_db_checksum(self.slug)
+            if db_checksum != expected_checksum:
+                return True
+
+            # Check if database has diverged from what we last wrote
+            # This queries the DB to detect manual edits (similar to grapher steps)
+            if has_db_divergence(self.slug):
+                return True
+
+        return False
+
+    def checksum_input(self) -> str:
+        """
+        Return checksum of all inputs to this graph step.
+        Similar to DataStep.checksum_input() but for graph metadata.
+        """
+        from etl.grapher.graph import calculate_source_checksum
+
+        # Convert Step dependencies to URIs
+        dep_uris = [str(d) for d in self.dependencies]
+
+        return calculate_source_checksum(
+            dependencies=dep_uris,
+            metadata_file=self.metadata_file,
+        )
+
+    def checksum_output(self) -> str:
+        """Return checksum representing the graph step's state."""
+        return self.checksum_input()
+
+


The new GraphStep class and its associated functionality in etl/grapher/graph.py lack test coverage. The repository has comprehensive tests for other step types (DataStep, SnapshotStep, etc. in tests/test_steps.py), but there are no tests for GraphStep. Consider adding tests to cover: 1) GraphStep creation and initialization, 2) metadata loading and parsing with yaml_params, 3) indicator path expansion logic, 4) is_dirty() detection with various checksum scenarios, 5) error handling for missing dependencies and invalid metadata, 6) pull_graph() functionality.

Copilot · 2026-01-23T12:29:51Z

+        if chart_exists:
+            log.info("graph.update_chart", chart_id=chart.id, inheritance=is_inheritance_enabled)
+            config["isInheritanceEnabled"] = is_inheritance_enabled
+            result = admin_api.update_chart(


Variable result is not used.

Suggested change

result = admin_api.update_chart(

admin_api.update_chart(

Copilot · 2026-01-23T12:29:51Z

+    if "#" not in catalog_path:
+        # Just indicator name - need to find which table(s) contain it
+        indicator = catalog_path
+        table_name = None


Variable table_name is not used.

Suggested change

table_name = None

lucasrodes · 2026-01-23T15:10:06Z

Hey Pablo, thanks for working on this. I think there is definitely value in adding charts to our ETL.

I have various comments on the implementation. I haven't been able to review all the files because there are a substantial number of lines of code. But I can have a better look next week.

I know this is a prototype and a kickstart, and I guess you are not expecting a full review (from what we discussed yesterday). Still, I'd like to point out a couple of design aspects I'd handle differently.

A general point I want to make, I suppose, is that we should not rush into finding a solution. I don't think there's urgency at the moment. There are many building blocks and ways to solve this problem, and we should aim to nail it, if possible. I'd suggest shaping this a bit in terms of architectural design (happy to help) & then do incremental steps towards a good & robust solution. A good strategy I like is to start with a "foundational layer" and then develop higher layers of abstraction. This also helps reviewers in general. This is what I did for etl.collections, and more recently with owid-catalog, and I've had a good experience.

So, my suggestion is: (i) let's have a shaping session to envision what we want, and then (ii) draft a roadmap with the different steps to get there. We don't need to do it all in one.

Still, I'll leave some general thoughts below:

I would reconsider how MDIMs/Explorers and Charts co-live in ETL.
- This PR proposes a different channel for Charts, and I think that is not ideal. I would favour re-using our export:// space, maybe with a new channel 'views', 'charts', or so.
- It seems that your motivation for not using export:// steps is (I quote): "But they have always lived in a bit of a mixed territory, as 'export' steps, partially in ETL and partially in the database". I guess I'd like to better understand this angle.
- I think that the bar for creating a new prefix should be very high, to avoid confusion. I think the organization we'd use for charts should be similar (if not the same) as with MDIMs/Explorers
- The same way we can choose how we display a Collection (as an Explorer or an MDIM), I think we should be able to present a collection as a single chart (if there's only one indicator). So we should aim at integrating this logic into Collections, or just redefine that space. But the user experience should feel seamless.
- I think graph:// is unnecessarily close to grapher://. The first time I saw the term I was a bit confused, as in I thought it was a graph network. I'd suggest using different terms like "chart", "view", or even "viz"/"dataviz". The term "graph" is overused in many contexts, and it's not always clear what it means.
ETL-DB sync
- Keeping charts in ETL and DB in sync is a bit tricky (not impossible!), so this probably merits involving devs too.
- Not sure how we would identify charts in ADMIN. We currently use IDs in lots of places. Who would decide those IDs? I just fear this could introduce breaking changes to our setup (including the staging servers). Not impossible, but good to do baby steps.
- We probably need dev knowledge here.
If we want charts to be a first-class citizen in ETL, we need a proper object.
- We do have objects for Datasets, Collections. I'd suggest creating a class View or something similar with the metadata in the correct format, etc. Maybe we could even sync the object View that Collection uses (c.views) with this somehow.
- This is essential for bringing structure to our workflow. We do this in collections to assist us in detecting correct vs. incorrect configs, performing class operations, etc.
- There are some details like paths.create_collection returning an actual Collection object, paths.create_dataset returning a Dataset, but paths.create_graph returning an integer. I think this is confusing, and we should try to maintain a similar user interface. There are more examples of this. This could be solved by having proper objects.

lucasrodes

leaving some side comments. Most of them addressed in my comment before.

lucasrodes · 2026-01-23T14:25:10Z

+  graph://animal_welfare/latest/fur-farming-ban:
+    - data://grapher/animal_welfare/2024-12-17/fur_laws


I have a strong preference for avoiding graph://. It's unnecessarily close to grapher://.

Before creating a new prefix, I'd re-consider what we already have and why we would or would not use them here. E.g. data://, export://, etc.

I was unsure about "graph", but then I thought that:

"view" is confusing; we use this word in analytics a lot for visits (it's already problematic to talk about "the number of views in explorer and mdim views").

"chart" would be confusing, because it would sound as if mdims are not included.

"viz" could be, though it's a bit informal; "visualization" is inconveniently long.

We could reuse "export" but I think that doesn't fulfil the promise of making graphs a first-class citizen. It sounds as if it's creating a thing outside of etl.

We can brainstorm some alternatives!

Using export://
Sure. I'm then just confused with the purpose of export://. We have been using it just for explorers and MDIMs, which I think should live in the same space as charts.

In my opinion, explorers, MDIMs, and charts should probably live under the same prefix/channel/namespace.

If we agreed on that, then using a new prefix like graph:// is effectively just a renaming of export://.

Using graph://
I understand your opposition to view, chart, and viz. Especially against "view", which I also find confusing.

"chart," I think, is perfectly fine. For all intents and purposes, MDIMs, Explorers, and grapher charts are indeed just charts. Some have a single view (grapher charts), others have multiple views (Explorers/MDIMs). But all can be considered charts. I think we may be disagreeing on the terminology here, and maybe we need to better define it.

"viz" is fine, though maybe a bit too casual.

Anyway, I am happy to find another term and go with it.

But I really wouldn't go with graph.

Regarding export steps, they are meant to be steps that write outside of etl (we also have export://github and export://s3). They came after the external steps, which are steps that are read from outside of etl.

I agree with having charts and mdims in the same place, that's part of the motivation of this prototype. My preferred choice would be to take mdims out of export, but we can discuss this further.

I like "chart" too, and it's probably the best word. I just decided against it to avoid confusion with the meaning of "chart" under the status quo. But if you think it's significantly better, I'm up for it. We can decide on Monday.

lucasrodes · 2026-01-23T14:27:11Z

+  - dag/graph/animal_welfare.yml
+  - dag/graph/covid.yml
+  - dag/graph/emissions.yml


This is a new strategy, where we would create new dag yaml files not only based on topic domain but also based on "step type".

I'm not sure now. I'd start small with just one yaml file like dag/charts.yml or so, before committing to multiple files.

I started indeed with just a graphs.yml file. Then I realised that this would be unrealistic, since it would blow up very quickly. Breaking it out into namespaces felt much more scalable.
Then I also thought that adding charts to existing namespace dag files (mixing data steps and graph steps) would be inconvenient. That's why I created the separate dag files.
No strong opinion, whatever is most convenient, this is easy to change.

lucasrodes · 2026-01-23T14:30:43Z

+
+# Fields that are auto-managed by the database
+DB_MANAGED_FIELDS = {
+    "id",
+    "version",
+    "createdAt",
+    "updatedAt",
+    "publishedAt",
+    "lastEditedAt",
+    "lastEditedByUserId",
+    "publishedByUserId",
+}
+
+# Fields managed by ETL (not manual overrides)
+ETL_MANAGED_FIELDS = {
+    "dimensions",  # Variable IDs managed from catalogPath
+    "isInheritanceEnabled",  # Controlled by ETL config
+}
+
+# Schema and metadata fields
+METADATA_FIELDS = {
+    "$schema",
+    "slug",
+    "isPublished",
+    "isIndexable",
+    "relatedQuestions",
+}
+
+# User-editable content fields to check for conflicts
+# These are the fields that users typically edit in the admin UI
+# and that we want to detect when checking for manual overrides
+USER_CONTENT_FIELDS = {
+    "title",
+    "subtitle",
+    "note",
+    "chartTypes",
+    "hasMapTab",
+    "tab",
+    "selectedEntityNames",
+}


Most of these definitions I think should be outside of the python file, in an external yaml file that a dev or whoever can easily update. Or best case: programmatically.

Sure, don't worry about the code for now, it's currently a Claude monster, it would need to be properly cleaned and refactored.

lucasrodes · 2026-01-23T14:36:27Z

@@ -0,0 +1,24 @@
+map:


i think it can be confusing that there is no reference to the indicator being displayed here.

My understanding is that this would come from the DAG file?

I think this can be a bit confusing, as there would be like two (distant) places where one edits the chart config sort of.

My first idea was to include indicators in the DAG. Then I realised that this was problematic because (1) for some graphs we have many indicators, (2) it implies too many changes, given that the main unit in our DAG is the dataset. So I decided to just keep the DAG just for datasets.

The indicator is specified in the dimensions (the indicator name happens to be status), and follows the same logic as collections: you can name just the indicator, or the table_name#indicator, or the full catalog path. This way we avoid hardcoding versions in our steps.

dimensions: - property: y catalogPath: status

I see. Nice!

Did you reuse some of the logic from collections for this? I remember breaking a bit my brain to resolve indicator names and convert them into full catalog paths given the step dependencies.

While the MDIM config was not showing the complete catalog path, the config submitted to the DB was. This was resolved on Collection.save(). Is this similar in some way here?

For this I asked Claude to reuse the logic already existing in collections, but Claude said it wasn't possible without making changes to those modules (which I wanted to avoid for now). So Claude added some logic for this. I added a TODO to clarify that some of that logic could be reused from collections to avoid duplication.

pabloarosado · 2026-01-23T15:56:00Z

Hey @lucasrodes, thanks a lot for the thorough feedback.

Maybe I wasn't clear enough in the description: This code has been mostly writen by Claude with my guidance. So I don't think it makes sense to have a thorough code review before we agree on the main ideas. It's a prototype to play around, with which we can see if this is a direction we want to go (that said, I tried to make sure to not break any existing code; so we could safely merge simply ignoring graph steps, and iterate).

Leaving aside code quality and specific implementation details, I think the core ideas bring several benefits. I'm a bit disappointed that you haven't addressed any of them. I'm wondering if disagree or think they are not valuable enough.

I'd be happy to discuss this further on Monday. I've added it to the agenda. Thanks!

lucasrodes · 2026-01-23T16:49:24Z

Hey @pabloarosado, Sorry if my message before sounded dismissive! I am not opposing this at all; I actually appreciate that you took the time to think about how to integrate charts into ETL.

My message was mostly focused on the solution's design and architecture, rather than its motivation. I won't really argue the benefits, which I think are pretty nice, and I can just say that I like all of them! I can see how that was not obvious from my message, apologies!

I'm just weighing the technical challenges, and think that we should shape this a bit more before entering the solution space. But I could be wrong.

Before coding (or vibe coding, for the matter), I'd have a brainstorming/shaping session to discuss and sketch the solution's architecture. I think it'd be nice to solve this in the broader context of ETL (MDIMs, Explorers, Admin)... which is just tricky. My experience with etl.collections can be useful, and a lot of that work can be integrated with these features.

For the work on owid-catalog, I've found it very helpful to have a main tracking issue with different "milestones": #5386.

pabloarosado · 2026-01-23T17:24:51Z

Thanks for the clarification @lucasrodes, I'm happy that you see the benefits of this approach.
The rest of this cycle for me is about experimenting and creating quick prototypes, and this is part of it. Having something to play with can give us some inspiration on how to improve our workflow. So this PR does not go against brainstorming, it should help us brainstorm further!

pabloarosado · 2026-01-28T14:37:08Z

@codex review

chatgpt-codex-connector · 2026-01-28T14:37:14Z

To use Codex here, create a Codex account and connect to github.

pabloarosado · 2026-01-28T14:39:57Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9e95e8917

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-28T14:47:02Z

+            args = ["uv", "run", "etl", "d", "run-python-step"]
+            if config.IPDB_ENABLED:
+                args.append("--ipdb")
+            args.extend([str(self), ""])  # Empty string for dest_dir


Support graph steps in run-python-step path

In non-DEBUG runs, graph steps invoke etl d run-python-step with a graph://... URI. The runner at etl/run_python_step.py only allows data, data-private, and export, so it raises a ValueError for graph steps. This means any graph step that has a .py file (e.g., the emissions example in this commit) will fail to execute unless DEBUG mode is enabled. Consider updating the runner to accept graph (or using a graph-specific runner) so subprocess execution works.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-28T14:47:02Z

+    # Multiple dependencies - need to check for ambiguity
+    # For now, use table#indicator format
+    return f"{table}#{indicator}"


Preserve dataset disambiguation on graph pull

When there are multiple dependencies, _shorten_indicator_path always collapses to table#indicator. If two dependencies share the same table name (common when table name equals dataset name), the resulting .meta.yml loses the dataset qualifier. On the next --graph run, _expand_indicator_path will pick the first matching dataset (dependency order comes from a set and is not deterministic), which can point to the wrong variable or fail resolution. Consider emitting dataset/table#indicator whenever the table name is ambiguous across dependencies.

Useful? React with 👍 / 👎.

github-actions · 2026-02-28T07:09:57Z

This PR has had no activity within the last 30 days. It is considered stale and will be closed in 7 days if no further activity is detected.

github-actions · 2026-04-01T07:34:29Z

This PR has had no activity within the last 30 days. It is considered stale and will be closed in 7 days if no further activity is detected.

lucasrodes · 2026-04-29T23:22:22Z

@pabloarosado shall we close in favor of #5969 ?

pabloarosado · 2026-04-30T06:54:51Z

We can close this PR, and continue with the alternative approach: #5969

🚧 Prototype easy graph metadata

0faebec

github-actions Bot assigned pabloarosado Jan 21, 2026

pabloarosado added 19 commits January 21, 2026 13:06

Create graph step to have chart metadata in ETL

17bda99

Improve the logic of graph steps

884f494

Add the --graph flag

232cd04

Fix graph step logic

8aec469

Unify channel name

85eb5c6

Add config of chick culling chart

d586e13

Fix graph push

405db64

Add mdim covid example and some refactoring

680a1fe

Simplify code and avoid unnecessary changes to core modules

596ca1b

Create folder of graph dag files

de2a9b0

Ensure --graph steps execute --grapher dependencies

f6407e1

Improve format

9337893

Allow mdims with just a yaml file

4d7cfa0

Fix conflicts

aaa1c0b

Let graph steps have code

490f49d

Avoid hardcoding paths in meta.yml files

d3d7966

Add --graph-pull and clean code

f7a861c

Merge branch 'master' of github.com:owid/etl into wip-prototype-graph…

9253c59

…-metadata

Fix minor bugs

a89a87c

pabloarosado changed the title ~~🚧 Prototype easy graph metadata~~ 🎉 Make graphs a first-class citizen in ETL Jan 23, 2026

pabloarosado added 7 commits January 23, 2026 11:16

Improve format

98b9361

Merge branch 'master' of github.com:owid/etl into wip-prototype-graph…

cfa59b7

…-metadata

Improve emission graph steps

5cb2fda

Minor fix

8c3c606

Fix missing indicator metadata fields

b3c4423

Minor tweaks to animal_welfare graph steps

b5837b1

Rename covid steps

8a27f2d

Copilot started reviewing on behalf of pabloarosado January 23, 2026 12:23 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

pabloarosado added 4 commits January 23, 2026 13:34

Remove unnecessary code

bebcc1b

Clean up unnecessary code

33263b0

Remove spurious old comment

7a488b7

Remove unnecessary comment

a9e95e8

lucasrodes reviewed Jan 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jan 28, 2026

View reviewed changes

github-actions Bot added the stale label Feb 28, 2026

pabloarosado removed the stale label Mar 1, 2026

github-actions Bot added the stale label Apr 1, 2026

github-actions Bot closed this Apr 9, 2026

Marigold reopened this Apr 9, 2026

github-actions Bot removed the stale label Apr 10, 2026

💄🤖 Resolve conflicts & fix type errors after merge with master

5835fd6

pabloarosado mentioned this pull request Apr 23, 2026

🎉 Charts as first-class ETL citizens via simple mdims #5969

Closed

pabloarosado closed this Apr 30, 2026

pabloarosado deleted the wip-prototype-graph-metadata branch April 30, 2026 06:55

pabloarosado mentioned this pull request Jun 2, 2026

tracking: Bringing charts and mdims closer together in etl and grapher #6069

Open

	f"Add to {self.metadata_path}:\n"
	f"Add to {self.metadata_file}:\n"


		source_checksum = self.checksum_input()

		# Upsert graph to database (graph steps require --graph flag to run)

		graph://animal_welfare/latest/fur-farming-ban:
		- data://grapher/animal_welfare/2024-12-17/fur_laws

Uh oh!

Conversation

pabloarosado commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Benefits

Important notes

How it works

TODO

Uh oh!

owidbot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

lucasrodes commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasrodes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucasrodes Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucasrodes Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pabloarosado commented Jan 21, 2026 •

edited

Loading

owidbot commented Jan 21, 2026 •

edited

Loading

lucasrodes commented Jan 23, 2026 •

edited

Loading

lucasrodes Jan 23, 2026 •

edited

Loading

lucasrodes Jan 23, 2026 •

edited

Loading

lucasrodes commented Jan 23, 2026 •

edited

Loading

lucasrodes commented Apr 29, 2026 •

edited

Loading