🎉 Make graphs a first-class citizen in ETL#5530
Conversation
|
Quick links (staging server):
Login: chart-diff: ✅
data-diff: ✅ No differences foundAutomatically updated datasets matching excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2026-01-23 10:56:39 UTC |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new graph channel that makes charts and multidimensional visualizations first-class citizens in the ETL pipeline, alongside datasets, tables, and indicators. This allows chart metadata to be version-controlled in ETL rather than living solely in the database.
Changes:
- Adds a new
GraphStepclass that manages chart metadata in the grapher database - Implements support for both YAML-only charts and charts with Python logic for dynamic configuration
- Introduces
paths.create_graph()helper method for creating charts from Python steps - Adds CLI flags
--graph,--graph-push, and--graph-pullfor managing graph steps - Includes example implementations replicating existing charts and multidims (animal welfare, COVID, emissions)
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
etl/steps/__init__.py |
Implements GraphStep class with run(), is_dirty(), and checksum logic |
etl/grapher/graph.py |
Core logic for upserting charts, pulling from DB, expanding indicator paths, and conflict detection |
etl/helpers.py |
Adds create_graph() method to PathFinder and updates channel detection for graph steps |
etl/version_tracker.py |
Updates path resolution to support graph steps with .meta.yml files |
etl/config.py |
Adds GRAPH, GRAPH_PUSH, and GRAPH_PULL configuration flags |
etl/command.py |
Integrates --graph flag into CLI, adds graph steps to DAG construction, updates validation logic |
dag/main.yml |
Includes new graph-specific DAG files |
dag/graph/*.yml |
Example DAG files defining graph steps for animal welfare, COVID, and emissions |
etl/steps/graph/**/*.meta.yml |
Example YAML metadata files for charts |
etl/steps/graph/emissions/latest/per-capita-co2-vs-average.py |
Example Python step with dynamic threshold calculation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if "slug" not in metadata: | ||
| raise ValueError( | ||
| f"Multidim graph step {self.path} requires 'slug' in metadata.\n" | ||
| f"Add to {self.metadata_path}:\n" |
There was a problem hiding this comment.
The error message references self.metadata_path but the GraphStep class doesn't have a metadata_path property, only a metadata_file property. This will cause an AttributeError at runtime. Change self.metadata_path to self.metadata_file.
| f"Add to {self.metadata_path}:\n" | |
| f"Add to {self.metadata_file}:\n" |
| # Apply string substitution if yaml_params provided | ||
| if yaml_params: | ||
| content = content.format(**yaml_params) |
There was a problem hiding this comment.
Using Python's str.format() method for YAML template substitution is potentially unsafe. If an attacker controls the YAML file content, they could inject format string exploits using expressions like {__import__('os').system('malicious_command')}. Consider using a safer templating approach such as Jinja2 with auto-escaping, or restrict yaml_params to only allow simple string replacements in specific, known locations in the template.
|
|
||
| source_checksum = self.checksum_input() | ||
|
|
||
| # Upsert graph to database (graph steps require --graph flag to run) |
There was a problem hiding this comment.
The comment states "graph steps require --graph flag to run" but this is misleading. Graph steps can be defined and their run() method can be called, but they won't be included in the DAG unless the --graph flag is passed to the etl command (see etl/command.py:490-491). The --graph flag determines whether graph steps are included in the DAG, not whether they can run. Consider updating the comment to clarify: "Upsert graph to database (graph steps are only included in DAG when --graph flag is set)".
| # Upsert graph to database (graph steps require --graph flag to run) | |
| # Upsert graph to database (graph steps are only included in DAG when --graph flag is set) |
| - 0 | ||
| - '{global_per_capita_threshold}' | ||
| - 100000000 | ||
| columnSlug: '1119914' |
There was a problem hiding this comment.
The columnSlug field is hardcoded to '1119914', which appears to be a specific variable ID. This creates a tight coupling to the database state and will break if the variable is recreated or if the chart is used in different environments. Consider either removing this field and letting the grapher auto-populate it from the dimensions, or implementing logic to dynamically resolve it similar to how dimensions are handled. If this field is required for map visualizations, document why it's necessary and how to keep it in sync.
| columnSlug: '1119914' |
| includes=includes, | ||
| excludes=excludes, | ||
| grapher=grapher, | ||
| graph=graph, | ||
| export=export, | ||
| private=private, | ||
| only=only, |
There was a problem hiding this comment.
Graph steps require database access to upsert charts, but there's no sanity check for database configuration when --graph flag is used. Consider adding a database sanity check similar to how it's done for --grapher flag. The main() function (outside these changed lines) calls sanity_check_db_settings() when grapher=True but not when graph=True. This could lead to confusing errors if users try to run graph steps without proper database configuration.
| def channel(self) -> CHANNEL: | ||
| if self._is_snapshot_file: | ||
| return "snapshot" # type: ignore | ||
| # Check if this is a graph step: etl/steps/graph/namespace/version/slug | ||
| step_dir_parent = self.f.parent.parent.parent.name | ||
| if step_dir_parent == "graph": | ||
| return "graph" # type: ignore | ||
| # Data steps: etl/steps/data/channel/namespace/version/slug | ||
| return self.f.parent.parent.parent.name # type: ignore |
There was a problem hiding this comment.
The channel property returns "graph" as a CHANNEL type, but "graph" is not included in the CHANNEL Literal type definition in owid.catalog. The CHANNEL type only includes: "snapshot", "garden", "meadow", "grapher", "open_numbers", "examples", "explorers", "external", and "multidim". This will cause type errors in type-checked code. Either add "graph" to the CHANNEL type definition in the catalog library, or use a union type that includes both CHANNEL and a graph-specific type.
| class GraphStep(Step): | ||
| """ | ||
| A step that manages chart metadata in the grapher database. | ||
|
|
||
| Graph steps create or update charts based on indicator dependencies. | ||
| Single-indicator graphs can inherit metadata from the indicator's grapher_config. | ||
| Multi-indicator graphs require explicit metadata in a .meta.yml file. | ||
|
|
||
| Example: | ||
| graph://fur-farming-ban | ||
| """ | ||
|
|
||
| path: str | ||
| dependencies: List[Step] | ||
|
|
||
| def __init__(self, path: str, dependencies: List[Step]) -> None: | ||
| # path is now "namespace/version/short_name" like data steps | ||
| self.path = path | ||
| self.dependencies = dependencies | ||
|
|
||
| def __str__(self) -> str: | ||
| return f"graph://{self.path}" | ||
|
|
||
| @property | ||
| def slug(self) -> str: | ||
| """Extract slug from path (namespace/version/slug -> slug).""" | ||
| return self.path.split("/")[-1] | ||
|
|
||
| @property | ||
| def _step_dir(self) -> Path: | ||
| """Directory containing the step files.""" | ||
| return paths.STEP_DIR / "graph" / self.path | ||
|
|
||
| @property | ||
| def metadata_file(self) -> Optional[Path]: | ||
| """Path to .meta.yml file.""" | ||
| meta_file = self._step_dir.with_suffix(".meta.yml") | ||
| return meta_file if meta_file.exists() else None | ||
|
|
||
| @property | ||
| def _python_file(self) -> Optional[Path]: | ||
| """Path to .py file if it exists.""" | ||
| py_file = self._step_dir.with_suffix(".py") | ||
| if py_file.exists(): | ||
| return py_file | ||
| # Check for __init__.py in directory | ||
| init_file = self._step_dir / "__init__.py" | ||
| return init_file if init_file.exists() else None | ||
|
|
||
| def run(self) -> None: | ||
| """Run graph step - either Python file or YAML-only.""" | ||
| from etl import config | ||
|
|
||
| # If GRAPH_PULL is enabled, pull from database instead of pushing | ||
| if config.GRAPH_PULL: | ||
| self._pull_from_database() | ||
| return | ||
|
|
||
| # Normal flow: push to database | ||
| # Check if there's a Python file | ||
| py_file = self._python_file | ||
| if py_file: | ||
| # Has Python code - run it | ||
| self._run_python_step(py_file) | ||
| else: | ||
| # YAML-only - use existing logic | ||
| self._run_yaml_only() | ||
|
|
||
| def _run_python_step(self, py_file: Path) -> None: | ||
| """Run Python module for graph step.""" | ||
| # Import here to avoid circular imports | ||
| from importlib import import_module | ||
|
|
||
| # Import and run the module | ||
| if config.DEBUG: | ||
| # Run in isolated environment (same process, but isolated imports) | ||
| with isolated_env(py_file.parent): | ||
| module_path = py_file.relative_to(paths.BASE_DIR).as_posix().replace("/", ".").replace(".py", "") | ||
| step_module = import_module(module_path) | ||
| run_module_run(step_module, dest_dir=None) # Graph steps don't have dest_dir | ||
| else: | ||
| # Run in subprocess (safer, but slower) | ||
| args = ["uv", "run", "etl", "d", "run-python-step"] | ||
| if config.IPDB_ENABLED: | ||
| args.append("--ipdb") | ||
| args.extend([str(self), ""]) # Empty string for dest_dir | ||
|
|
||
| env = os.environ.copy() | ||
| env["PATH"] = os.path.expanduser("~/.cargo/bin") + ":" + env["PATH"] | ||
|
|
||
| try: | ||
| subprocess.check_call(args, env=env) | ||
| except subprocess.CalledProcessError as e: | ||
| sys.exit(e.returncode) | ||
|
|
||
| def _run_yaml_only(self) -> None: | ||
| """Create graph from YAML metadata - either a single chart or multidimensional collection.""" | ||
| # Load metadata to determine if this is a multidim collection or normal chart | ||
| if not self.metadata_file or not self.metadata_file.exists(): | ||
| raise ValueError(f"Metadata file not found for {self}") | ||
|
|
||
| from etl.grapher.graph import _load_metadata_file | ||
|
|
||
| metadata = _load_metadata_file(self.metadata_file) | ||
|
|
||
| # Determine if this is a multidimensional collection | ||
| if self._is_multidim_collection(metadata): | ||
| self._create_multidim_collection(metadata) | ||
| else: | ||
| self._create_single_chart(metadata) | ||
|
|
||
| @staticmethod | ||
| def _is_multidim_collection(metadata: dict) -> bool: | ||
| """ | ||
| Determine if metadata defines a multidimensional collection. | ||
|
|
||
| A multidimensional collection has: | ||
| - A 'views' array defining multiple chart variants | ||
| - Dimensions with choices that users can select between | ||
|
|
||
| Examples: | ||
| # Multidim: Has views array | ||
| views: | ||
| - dimensions: {indicator: total} | ||
| indicators: {...} | ||
| - dimensions: {indicator: people} | ||
| indicators: {...} | ||
|
|
||
| # Normal chart: No views, just dimensions for indicator mapping | ||
| dimensions: | ||
| - property: y | ||
| catalogPath: grapher/.../indicator | ||
| """ | ||
| return "views" in metadata and isinstance(metadata["views"], list) | ||
|
|
||
| def _create_multidim_collection(self, metadata: dict) -> None: | ||
| """Create a multidimensional collection from YAML config.""" | ||
| from etl.collection.core.create import create_collection | ||
| from etl.dag_helpers import load_dag | ||
| from etl.grapher.graph import _expand_indicator_path | ||
|
|
||
| # Load DAG to get dependencies | ||
| dag = load_dag() | ||
| step_name = str(self) | ||
| dep_uris = list(dag.get(step_name, [])) | ||
|
|
||
| # Expand short indicator names in views to full catalog paths | ||
| first_indicator_path = None | ||
| if "views" in metadata: | ||
| for view in metadata["views"]: | ||
| if "indicators" in view: | ||
| for prop, indicators in view["indicators"].items(): | ||
| for i, indicator in enumerate(indicators): | ||
| if "catalogPath" in indicator: | ||
| indicator["catalogPath"] = _expand_indicator_path(indicator["catalogPath"], dep_uris) | ||
| # Capture first indicator path for the collection catalog_path | ||
| if first_indicator_path is None: | ||
| first_indicator_path = indicator["catalogPath"] | ||
|
|
||
| # Collection slug for multidim (without version) | ||
| # Must be set explicitly in metadata as 'slug' | ||
| # Format: namespace/base_dataset#collection_name | ||
| # Example: covid/covid#covid_deaths | ||
| if "slug" not in metadata: | ||
| raise ValueError( | ||
| f"Multidim graph step {self.path} requires 'slug' in metadata.\n" | ||
| f"Add to {self.metadata_path}:\n" | ||
| f" slug: namespace/base#collection" | ||
| ) | ||
|
|
||
| # Build full catalog_path by inserting version from step path | ||
| parts = self.path.split("/") | ||
| if len(parts) >= 3: | ||
| version = parts[1] | ||
| slug = metadata["slug"] | ||
| # Insert version: covid/covid#covid_deaths → covid/latest/covid#covid_deaths | ||
| if "/" in slug: | ||
| namespace, rest = slug.split("/", 1) | ||
| catalog_path = f"{namespace}/{version}/{rest}" | ||
| else: | ||
| catalog_path = slug | ||
| else: | ||
| catalog_path = metadata["slug"] | ||
|
|
||
| c = create_collection( | ||
| config_yaml=metadata, | ||
| dependencies=dep_uris, | ||
| catalog_path=catalog_path, | ||
| ) | ||
| c.save() | ||
|
|
||
| def _pull_from_database(self) -> None: | ||
| """Pull chart configuration from database and write to .meta.yml file.""" | ||
| from etl.dag_helpers import load_dag | ||
| from etl.grapher.graph import pull_graph | ||
|
|
||
| # Get metadata file path (create if doesn't exist) | ||
| metadata_file = self._step_dir.with_suffix(".meta.yml") | ||
|
|
||
| # Get dependencies from DAG | ||
| dag = load_dag() | ||
| step_name = str(self) | ||
| dep_uris = list(dag.get(step_name, [])) | ||
|
|
||
| # Pull from database | ||
| pull_graph( | ||
| slug=self.slug, | ||
| metadata_file=metadata_file, | ||
| dependencies=dep_uris, | ||
| ) | ||
|
|
||
| def _create_single_chart(self, metadata: dict) -> None: | ||
| """Create a single chart from YAML config.""" | ||
| from etl import config | ||
| from etl.dag_helpers import load_dag | ||
| from etl.grapher.graph import upsert_graph | ||
|
|
||
| # Load DAG to get original dependency strings | ||
| dag = load_dag() | ||
| step_name = str(self) | ||
| dep_uris = list(dag.get(step_name, [])) | ||
|
|
||
| if not dep_uris: | ||
| raise ValueError( | ||
| f"No dependencies found for {step_name} in DAG. " | ||
| f"Graph steps must have at least one indicator dependency." | ||
| ) | ||
|
|
||
| source_checksum = self.checksum_input() | ||
|
|
||
| # Upsert graph to database (graph steps require --graph flag to run) | ||
| upsert_graph( | ||
| slug=self.slug, | ||
| metadata_file=self.metadata_file, | ||
| dependencies=dep_uris, | ||
| source_checksum=source_checksum, | ||
| graph_push=config.GRAPH_PUSH, | ||
| ) | ||
|
|
||
| def is_dirty(self) -> bool: | ||
| """ | ||
| Check if graph needs update based on: | ||
| - Local index.json doesn't exist or checksum differs | ||
| - (If --graph flag) Database chart doesn't exist or has different checksum | ||
| """ | ||
| from etl import config | ||
| from etl.grapher.graph import _load_graph_metadata | ||
|
|
||
| # 1. Check if local index.json exists | ||
| metadata = _load_graph_metadata(self.slug) | ||
| if not metadata: | ||
| return True # No metadata file, needs to be created | ||
|
|
||
| # 2. Check if any dependency changed | ||
| if any(d.is_dirty() for d in self.dependencies): | ||
| return True | ||
|
|
||
| # 3. Compare local checksum with expected checksum | ||
| local_checksum = metadata.get("local_checksum") | ||
| expected_checksum = self.checksum_input() | ||
|
|
||
| if local_checksum != expected_checksum: | ||
| return True | ||
|
|
||
| # 4. If --graph flag is set, also check if database needs updating | ||
| if config.GRAPH: | ||
| from etl.grapher.graph import fetch_graph_db_checksum, has_db_divergence | ||
|
|
||
| db_checksum = fetch_graph_db_checksum(self.slug) | ||
| if db_checksum != expected_checksum: | ||
| return True | ||
|
|
||
| # Check if database has diverged from what we last wrote | ||
| # This queries the DB to detect manual edits (similar to grapher steps) | ||
| if has_db_divergence(self.slug): | ||
| return True | ||
|
|
||
| return False | ||
|
|
||
| def checksum_input(self) -> str: | ||
| """ | ||
| Return checksum of all inputs to this graph step. | ||
| Similar to DataStep.checksum_input() but for graph metadata. | ||
| """ | ||
| from etl.grapher.graph import calculate_source_checksum | ||
|
|
||
| # Convert Step dependencies to URIs | ||
| dep_uris = [str(d) for d in self.dependencies] | ||
|
|
||
| return calculate_source_checksum( | ||
| dependencies=dep_uris, | ||
| metadata_file=self.metadata_file, | ||
| ) | ||
|
|
||
| def checksum_output(self) -> str: | ||
| """Return checksum representing the graph step's state.""" | ||
| return self.checksum_input() | ||
|
|
||
|
|
There was a problem hiding this comment.
The new GraphStep class and its associated functionality in etl/grapher/graph.py lack test coverage. The repository has comprehensive tests for other step types (DataStep, SnapshotStep, etc. in tests/test_steps.py), but there are no tests for GraphStep. Consider adding tests to cover: 1) GraphStep creation and initialization, 2) metadata loading and parsing with yaml_params, 3) indicator path expansion logic, 4) is_dirty() detection with various checksum scenarios, 5) error handling for missing dependencies and invalid metadata, 6) pull_graph() functionality.
| if chart_exists: | ||
| log.info("graph.update_chart", chart_id=chart.id, inheritance=is_inheritance_enabled) | ||
| config["isInheritanceEnabled"] = is_inheritance_enabled | ||
| result = admin_api.update_chart( |
There was a problem hiding this comment.
Variable result is not used.
| result = admin_api.update_chart( | |
| admin_api.update_chart( |
| if "#" not in catalog_path: | ||
| # Just indicator name - need to find which table(s) contain it | ||
| indicator = catalog_path | ||
| table_name = None |
There was a problem hiding this comment.
Variable table_name is not used.
| table_name = None |
|
Hey Pablo, thanks for working on this. I think there is definitely value in adding charts to our ETL. I have various comments on the implementation. I haven't been able to review all the files because there are a substantial number of lines of code. But I can have a better look next week. I know this is a prototype and a kickstart, and I guess you are not expecting a full review (from what we discussed yesterday). Still, I'd like to point out a couple of design aspects I'd handle differently. A general point I want to make, I suppose, is that we should not rush into finding a solution. I don't think there's urgency at the moment. There are many building blocks and ways to solve this problem, and we should aim to nail it, if possible. I'd suggest shaping this a bit in terms of architectural design (happy to help) & then do incremental steps towards a good & robust solution. A good strategy I like is to start with a "foundational layer" and then develop higher layers of abstraction. This also helps reviewers in general. This is what I did for So, my suggestion is: (i) let's have a shaping session to envision what we want, and then (ii) draft a roadmap with the different steps to get there. We don't need to do it all in one. Still, I'll leave some general thoughts below:
|
lucasrodes
left a comment
There was a problem hiding this comment.
leaving some side comments. Most of them addressed in my comment before.
| graph://animal_welfare/latest/fur-farming-ban: | ||
| - data://grapher/animal_welfare/2024-12-17/fur_laws |
There was a problem hiding this comment.
I have a strong preference for avoiding graph://. It's unnecessarily close to grapher://.
Before creating a new prefix, I'd re-consider what we already have and why we would or would not use them here. E.g. data://, export://, etc.
There was a problem hiding this comment.
I was unsure about "graph", but then I thought that:
- "view" is confusing; we use this word in analytics a lot for visits (it's already problematic to talk about "the number of views in explorer and mdim views").
- "chart" would be confusing, because it would sound as if mdims are not included.
- "viz" could be, though it's a bit informal; "visualization" is inconveniently long.
We could reuse "export" but I think that doesn't fulfil the promise of making graphs a first-class citizen. It sounds as if it's creating a thing outside of etl.
We can brainstorm some alternatives!
There was a problem hiding this comment.
Using export://
Sure. I'm then just confused with the purpose of export://. We have been using it just for explorers and MDIMs, which I think should live in the same space as charts.
In my opinion, explorers, MDIMs, and charts should probably live under the same prefix/channel/namespace.
If we agreed on that, then using a new prefix like graph:// is effectively just a renaming of export://.
Using graph://
I understand your opposition to view, chart, and viz. Especially against "view", which I also find confusing.
"chart," I think, is perfectly fine. For all intents and purposes, MDIMs, Explorers, and grapher charts are indeed just charts. Some have a single view (grapher charts), others have multiple views (Explorers/MDIMs). But all can be considered charts. I think we may be disagreeing on the terminology here, and maybe we need to better define it.
"viz" is fine, though maybe a bit too casual.
Anyway, I am happy to find another term and go with it.
But I really wouldn't go with graph.
There was a problem hiding this comment.
Regarding export steps, they are meant to be steps that write outside of etl (we also have export://github and export://s3). They came after the external steps, which are steps that are read from outside of etl.
I agree with having charts and mdims in the same place, that's part of the motivation of this prototype. My preferred choice would be to take mdims out of export, but we can discuss this further.
I like "chart" too, and it's probably the best word. I just decided against it to avoid confusion with the meaning of "chart" under the status quo. But if you think it's significantly better, I'm up for it. We can decide on Monday.
| - dag/graph/animal_welfare.yml | ||
| - dag/graph/covid.yml | ||
| - dag/graph/emissions.yml |
There was a problem hiding this comment.
This is a new strategy, where we would create new dag yaml files not only based on topic domain but also based on "step type".
I'm not sure now. I'd start small with just one yaml file like dag/charts.yml or so, before committing to multiple files.
There was a problem hiding this comment.
I started indeed with just a graphs.yml file. Then I realised that this would be unrealistic, since it would blow up very quickly. Breaking it out into namespaces felt much more scalable.
Then I also thought that adding charts to existing namespace dag files (mixing data steps and graph steps) would be inconvenient. That's why I created the separate dag files.
No strong opinion, whatever is most convenient, this is easy to change.
|
|
||
| # Fields that are auto-managed by the database | ||
| DB_MANAGED_FIELDS = { | ||
| "id", | ||
| "version", | ||
| "createdAt", | ||
| "updatedAt", | ||
| "publishedAt", | ||
| "lastEditedAt", | ||
| "lastEditedByUserId", | ||
| "publishedByUserId", | ||
| } | ||
|
|
||
| # Fields managed by ETL (not manual overrides) | ||
| ETL_MANAGED_FIELDS = { | ||
| "dimensions", # Variable IDs managed from catalogPath | ||
| "isInheritanceEnabled", # Controlled by ETL config | ||
| } | ||
|
|
||
| # Schema and metadata fields | ||
| METADATA_FIELDS = { | ||
| "$schema", | ||
| "slug", | ||
| "isPublished", | ||
| "isIndexable", | ||
| "relatedQuestions", | ||
| } | ||
|
|
||
| # User-editable content fields to check for conflicts | ||
| # These are the fields that users typically edit in the admin UI | ||
| # and that we want to detect when checking for manual overrides | ||
| USER_CONTENT_FIELDS = { | ||
| "title", | ||
| "subtitle", | ||
| "note", | ||
| "chartTypes", | ||
| "hasMapTab", | ||
| "tab", | ||
| "selectedEntityNames", | ||
| } |
There was a problem hiding this comment.
Most of these definitions I think should be outside of the python file, in an external yaml file that a dev or whoever can easily update. Or best case: programmatically.
There was a problem hiding this comment.
Sure, don't worry about the code for now, it's currently a Claude monster, it would need to be properly cleaned and refactored.
| @@ -0,0 +1,24 @@ | |||
| map: | |||
There was a problem hiding this comment.
i think it can be confusing that there is no reference to the indicator being displayed here.
My understanding is that this would come from the DAG file?
I think this can be a bit confusing, as there would be like two (distant) places where one edits the chart config sort of.
There was a problem hiding this comment.
My first idea was to include indicators in the DAG. Then I realised that this was problematic because (1) for some graphs we have many indicators, (2) it implies too many changes, given that the main unit in our DAG is the dataset. So I decided to just keep the DAG just for datasets.
The indicator is specified in the dimensions (the indicator name happens to be status), and follows the same logic as collections: you can name just the indicator, or the table_name#indicator, or the full catalog path. This way we avoid hardcoding versions in our steps.
dimensions:
- property: y
catalogPath: status
There was a problem hiding this comment.
I see. Nice!
Did you reuse some of the logic from collections for this? I remember breaking a bit my brain to resolve indicator names and convert them into full catalog paths given the step dependencies.
While the MDIM config was not showing the complete catalog path, the config submitted to the DB was. This was resolved on Collection.save(). Is this similar in some way here?
There was a problem hiding this comment.
For this I asked Claude to reuse the logic already existing in collections, but Claude said it wasn't possible without making changes to those modules (which I wanted to avoid for now). So Claude added some logic for this. I added a TODO to clarify that some of that logic could be reused from collections to avoid duplication.
|
Hey @lucasrodes, thanks a lot for the thorough feedback. Maybe I wasn't clear enough in the description: This code has been mostly writen by Claude with my guidance. So I don't think it makes sense to have a thorough code review before we agree on the main ideas. It's a prototype to play around, with which we can see if this is a direction we want to go (that said, I tried to make sure to not break any existing code; so we could safely merge simply ignoring graph steps, and iterate). Leaving aside code quality and specific implementation details, I think the core ideas bring several benefits. I'm a bit disappointed that you haven't addressed any of them. I'm wondering if disagree or think they are not valuable enough. I'd be happy to discuss this further on Monday. I've added it to the agenda. Thanks! |
|
Hey @pabloarosado, Sorry if my message before sounded dismissive! I am not opposing this at all; I actually appreciate that you took the time to think about how to integrate charts into ETL. My message was mostly focused on the solution's design and architecture, rather than its motivation. I won't really argue the benefits, which I think are pretty nice, and I can just say that I like all of them! I can see how that was not obvious from my message, apologies! I'm just weighing the technical challenges, and think that we should shape this a bit more before entering the solution space. But I could be wrong. Before coding (or vibe coding, for the matter), I'd have a brainstorming/shaping session to discuss and sketch the solution's architecture. I think it'd be nice to solve this in the broader context of ETL (MDIMs, Explorers, Admin)... which is just tricky. My experience with For the work on |
|
Thanks for the clarification @lucasrodes, I'm happy that you see the benefits of this approach. |
|
@codex review |
|
To use Codex here, create a Codex account and connect to github. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a9e95e8917
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| args = ["uv", "run", "etl", "d", "run-python-step"] | ||
| if config.IPDB_ENABLED: | ||
| args.append("--ipdb") | ||
| args.extend([str(self), ""]) # Empty string for dest_dir |
There was a problem hiding this comment.
Support graph steps in run-python-step path
In non-DEBUG runs, graph steps invoke etl d run-python-step with a graph://... URI. The runner at etl/run_python_step.py only allows data, data-private, and export, so it raises a ValueError for graph steps. This means any graph step that has a .py file (e.g., the emissions example in this commit) will fail to execute unless DEBUG mode is enabled. Consider updating the runner to accept graph (or using a graph-specific runner) so subprocess execution works.
Useful? React with 👍 / 👎.
| # Multiple dependencies - need to check for ambiguity | ||
| # For now, use table#indicator format | ||
| return f"{table}#{indicator}" |
There was a problem hiding this comment.
Preserve dataset disambiguation on graph pull
When there are multiple dependencies, _shorten_indicator_path always collapses to table#indicator. If two dependencies share the same table name (common when table name equals dataset name), the resulting .meta.yml loses the dataset qualifier. On the next --graph run, _expand_indicator_path will pick the first matching dataset (dependency order comes from a set and is not deterministic), which can point to the wrong variable or fail resolution. Consider emitting dataset/table#indicator whenever the table name is ambiguous across dependencies.
Useful? React with 👍 / 👎.
|
This PR has had no activity within the last 30 days. It is considered stale and will be closed in 7 days if no further activity is detected. |
|
This PR has had no activity within the last 30 days. It is considered stale and will be closed in 7 days if no further activity is detected. |
|
@pabloarosado shall we close in favor of #5969 ? |
|
We can close this PR, and continue with the alternative approach: #5969 |
Summary
Create a new
graphchannel that allows metadata from charts and multidims to live in etl as first-class citizens.Context
Currently, etl allows you to define metadata for datasets, tables, and indicators. Charts' metadata has never had a place to live in etl, even though they often include some of our most important, public-facing metadata.
Since recently, we can also create metadata for multidims, which bridges that gap somewhat. But they have always lived in a bit of a mixed territory, as "export" steps, partially in ETL and partially in the database.
In this PR I have prototyped an alternative architecture, where charts and multidims can live fully in etl.
Benefits
etlr slug --graphgraph://emissions/latest/per-capita-co2-vs-average.Important notes
grapher/graph.graphis the best name, but I wanted to avoid, e.g.chart, orgrapher, or other things that could cause future headaches given the status quo.How it works
Graph steps live in
etl/steps/graph/and are defined indag/graph/*.ymlDAG files. They can be:.meta.ymlfiles for chart configs or multidim configs..meta.ymlfile and a.pyfile for complex chart or mdim steps.I replicated some existing charts and mdims. Feel free to run them:
Make changes in the yaml file, or in the admin, and see what happens.
You can check chart-diff for changes.
For mdims (covid cases and covid deaths), you can run:
TODO
If we like the prototype, the following things (at least) would need to happen (they could happen in a separate PR):