Feat: more cores for loading #4427

themisvaltinos · 2025-05-15T18:15:00Z

This update adds a ProcessPoolExecutor for parallel loading of a project's models. It also adds a mock executor for single-process scenarios, such as when the system doesn’t support fork. Also, it refactors optimized_query_cache_pool to use this unified execution logic, eliminating the need for separate sequential and parallel code paths.

izeigerman · 2025-05-19T17:13:50Z

Makefile

@@ -64,7 +64,7 @@ engine-up: engine-clickhouse-up engine-mssql-up engine-mysql-up engine-postgres-
 engine-down: engine-clickhouse-down engine-mssql-down engine-mysql-down engine-postgres-down engine-spark-down engine-trino-down

 fast-test:
-	pytest -n auto -m "fast and not cicdonly" && pytest -m "isolated"
+	pytest -n auto -m "fast and not cicdonly" && pytest -m "isolated" && pytest -m "isolated2"


This label name is not very descriptive. Can we be more precise

izeigerman · 2025-05-19T17:38:30Z

sqlmesh/core/loader.py

+                            for row in YAML().load(file.read())
+                        ]
+
+                    cache.put(external_models, path)


I don't like the fact that we now need to manually manage the cache everywhere by splitting the load from the get. Any way to preserve the previous declarative API?

i'm not sure. this is the simplest because the loading all happens in separate processes and must all be complete separate, we only communicate through the file system

izeigerman · 2025-05-19T17:39:26Z

sqlmesh/core/loader.py

+                        models[model.fqn] = model
+
+        if paths:
+            defaults = dict(


This overlaps with the defaults that is in the dict itself. Can we have a more descriptive name for this?

Renamed to model_loading_defaults for clarity, couldn't think of a better alternative

izeigerman · 2025-05-19T18:11:46Z

sqlmesh/core/loader.py

+    _selected_gateway = selected_gateway
+
+
+def load_sql_models(path: Path) -> t.Tuple[Path, list[Model]]:


I don't see the 1st item of the return value being used anywhere. Is it used?

no this was leftover from before I added the futures_to_paths dictionary. The path is already accessible so it’s no longer needed. I removed it and updated _load_sql_models to use the simplified return value—loaded = future.result() instead of unpacking with _.

izeigerman · 2025-05-19T18:16:16Z

sqlmesh/core/loader.py

-                    self._track_file(seed_path)
+            if errors:
+                error_string = "\n".join(errors)
+                raise ConfigError(f"Failed to load models\n\n{error_string}")


[Nit] I believe Failed to load models\n\n is redundant

izeigerman · 2025-05-19T18:18:24Z

pytest.ini

@@ -8,6 +8,7 @@ markers =
    remote: test that involves interacting with a remote DB
    cicdonly: test that only runs on CI/CD
    isolated: tests that need to run sequentially usually because they use fork
+    isolated2: tests that need to run isolated because they interfere


Why can't we use isolated here, btw?

I've renamed the marker to registry_isolation and updated all references to it, to indicate this test needs isolation because of an issue with the registry. This particular test test_duplicate_python_model_names_raise_error added here: #3945 is successful when it is run in isolation, but it breaks if it is grouped with the rest of the tests, which is why they originally had it with isolated. But having it as before with isolated which are forking tests, since it’s designed to raise an error, leads to these tests failing as well

izeigerman · 2025-05-19T18:19:02Z

sqlmesh/core/loader.py

+    except ConfigError:
+        from sqlmesh.core.console import get_console
+
+        get_console().log_warning(


Does this respect --ignore-warnings when this runs in a worker process?

good catch, I tested it and your suspicion is correct it wasn't working because the console wasn’t shared across processes. I added a console parameter to _init_model_defaults to pass self._console to worker processes when creating the process pool, so that each worker used set_console(console) to match the parent’s console and preserves ignore_warnings along with all console settings

izeigerman · 2025-05-19T18:20:11Z

sqlmesh/core/loader.py

+        pass
+
+    @abc.abstractmethod
+    def get(self, path: Path) -> t.List[Model]:


This interface is only needed in one place but impacts all other places. Can we keep both interfaces (get + put AND get_or_load_models) and revert unrelated changes?

reverted to use in all other places except for sqlmodels the previous interface

izeigerman · 2025-05-19T19:46:52Z

sqlmesh/utils/process.py

+            future.set_exception(e)
+        return future
+
+    def map(self, fn, *iterables, timeout=None, chunksize=1):


Don't you need a shutdown method too?

since shutdown was called in __exit__ and since the synchronous executor runs everything in the main process, there are no resources to release or cleanup and futures are complete from the start this is why I didn't add it. but I revise to have it to keep the api similar to python’s ProcessPoolExecutor

themisvaltinos force-pushed the toby/core branch from 053cb48 to 143d9bb Compare May 16, 2025 07:33

themisvaltinos requested review from izeigerman, tobymao and a team May 16, 2025 18:18

themisvaltinos force-pushed the toby/core branch from 3a9f8b8 to b5bd762 Compare May 19, 2025 13:40

tobymao and others added 10 commits May 19, 2025 17:12

feat: more cores for loading

34c1a1c

add mock executor; fix loader; adapt unit tests

472452f

adapt the query_cache_pool

d025f9f

fix python errors

c5f2845

add as_completed in loader

9db8c8a

fix circleci 3.9 test

b3cdcf9

try to identify circleci issue

417fc5a

rebase; cleanup code

267579c

extend test

0fd5849

fix process pool for microsoft windows

08ba90c

themisvaltinos force-pushed the toby/core branch from b5bd762 to 08ba90c Compare May 19, 2025 14:14

izeigerman reviewed May 19, 2025

View reviewed changes

themisvaltinos added 2 commits May 20, 2025 18:43

refactors and improvements

85ed2e0

revert to use get_or_load_models for nonsqlmodels and dbt

43db517

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: more cores for loading #4427

Feat: more cores for loading #4427

themisvaltinos commented May 15, 2025

izeigerman May 19, 2025

izeigerman May 19, 2025

tobymao May 19, 2025

izeigerman May 19, 2025

themisvaltinos May 20, 2025

izeigerman May 19, 2025

themisvaltinos May 20, 2025

izeigerman May 19, 2025

themisvaltinos May 20, 2025

izeigerman May 19, 2025

themisvaltinos May 20, 2025

izeigerman May 19, 2025

themisvaltinos May 20, 2025

izeigerman May 19, 2025 •

edited

Loading

themisvaltinos May 20, 2025

izeigerman May 19, 2025

themisvaltinos May 20, 2025

		_selected_gateway = selected_gateway


		def load_sql_models(path: Path) -> t.Tuple[Path, list[Model]]:

Feat: more cores for loading #4427

Are you sure you want to change the base?

Feat: more cores for loading #4427

Conversation

themisvaltinos commented May 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izeigerman May 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

izeigerman May 19, 2025 •

edited

Loading