-
Notifications
You must be signed in to change notification settings - Fork 1.6k
11 2017 tristan
This proposal is still very much a work in progress. I'm using this space as a mechanism to share my current thinking as well as to work through it myself. I'm planning to continue to refine this with further thought and discussion. Comments / questions very welcome.
While dbt today is increasingly functional, there areas of its design that could be improved. These problems fall into three areas: a) problems that prevent the user from accomplishing certain tasks at all, b) problems that make certain tasks more challenging than they should be, and c) problems that cause unintuitive or otherwise undesirable user experience. The problems we know of today in each of these categories are outlined below.
The following tasks are not possible today:
- Passing different states into models. For example, passing a map of pre-trained weights into a model responsible for scoring the results of an algorithm. There are a large number of opportunities for creative model-building that require that state be passed to that model.
- Pre-defining run configurations. Running certain models with certain configurations in concert with one another is not currently possible. For instance, the following is not currently possible but is an increasingly common need in more complex dbt deployments:
For 23 hours out of a day, run model_a and model_b as defined in their default configurations. Once a day, run model_a with its default configuration but run model_b as full-refresh.
The following tasks are more challenging than they should be:
-
Configuring models from packages. The only way to do this today is to use
dbt_project.yml
which has clear drawbacks (outlined below). - Overriding models from packages. It is a fairly normal flow to want to make some change to the modeling functionality provided by a package. Today, this is accomplished via disabling certain models in a package and then making new ones locally, but this is a fragile, unintuitive approach.
- Writing smaller data transformations. Currently, models are one-per-file, which encourages users building fewer, larger models so as to avoid having a multitude of model files. This violates the UNIX philosophy of software design.
-
Sharing configuration across multiple models. The only way to accomplish this today is within
dbt_project.yml
(discussed below).
The following tasks are unintuitive or present otherwise undesirable user experience:
- Including a package with models in it. Today, simply by adding a package to the list of dependencies, all the models in that package will get automatically built on the next project run. This is not how packages typically work in software development—typically users must call code in packages explicitly.
This file gets its own section because of the large number of problems we've seen it cause over time.
[TODO]
While it's possible that all of the above problems could be solved independently, I propose that the reason they continue to be problems is that they all arise from dbt's lack of certain core constructs, and once these constructs are added, solutions fall into place.
- Limit "magic" in favor of explicit programming.
- The developer must call the framework; the framework should not call user code.
- Favor code over convention or configuration.
These are meaningful changes in our dbt design philosophy. Currently, we heavily used magic, convention, and configuration to enable dbt to provide a default experience that was easy for many non-technical analysts to pick up and use without needing to think too hard about. The path from dbt init
to making a new file to writing a select statement to dbt run
was short and achievable for our early adopter users.
The goal of these changes in design philosophy is not to reduce the importance of the new user experience. The goal is to built a system that does not rely on magic or convention but instead provides intelligent default behaviors on top of powerful core constructs that provide an excellent first-time user experience. In this way, we can achieve both a good initial experience and then allow that experience to deepen in complexity as users and their projects get more sophisticated.
There are three fundamentally new constructs required for this proposal. This section will outline each.
{% model %}
select * from {{ref('base_model')}}
{% endmodel %}
{% test %} ... {% endtest %}
{% archive %} ... {% endarchive %}
We've discussed some form for this for a long time, but the reason it needs to be implemented as a part of this larger effort is not just to allow users to define multiple models in a single file (the original reason it was discussed), but rather to create an empty space to write code that is not, by default, some object (a model, test, macro, etc). We'll need this ability to just write Jinja code in order to accomplish other goals.
In this proposal, each core dbt asset (tests, archives, etc.) should have a Jinja block to define itself. In this way, every single source file in a dbt project is simply a Jinja file. That Jinja file can contain blocks of any type, or no blocks at all. We can then do away with configuration in dbt_project.yml
that defines objects in specific folders.
{% job full_refresh(models = "model_a, model_b") %}
config(
models = "model_a",
full-refresh = true
)
config(
models = "model_b",
full_refresh = false
)
{% endjob %}
The job
block is conceptually straightforward: it defines one or more selectors of models to run within its signature and then provides an opportunity to configure the state of these models within its body. Arbitrary Jinja can be written in the body to call macros, statements, etc. The point of all code written within a job will be to provide configuration (state) to the models for the scope of that job.
Jobs can be directly invoked from the CLI, as discussed later.
If we moving towards a philosophy wherein users call the framework, we need an entry point where code begins executing so that the user can take control from there. In Python scripts, the file that is called from the command line is frequently just called main.py
, and calls out to other areas of the program to actually execute the functionality of the script. main.jinja2
would be the only file that dbt executes explicitly. From there, the user controls what gets executed.
A basic main.jinja2
file will define several named runs, including a default run definition. A more sophisticated main.jinja2
might include multiple statements and make many configuration calls based on the results. Any of this code could be written directly within main.jinja2
or in macros called from there.
-
In this proposal, configuration for any object is a series of arbitrary key / value pairs. Users should be able to equivalently call
config(materialized = 'table')
andconfig(coefs = [[0,1],[2,3]])
wherecoefs
is an arbitrary user-defined key. -
Our usage of the word "configure" grew out of when the only state that models were able to accept was explicit configuration for how they would be materialized. In this proposal, models can accept arbitrary state, not just materialization configuration. It's not clear to me whether "configure" is the appropriate term for "supply state", but I have continued to use that word with the proviso that this is what is meant.
-
Any core dbt object should be able to be configured: models, tests, archives. I'm not 100% clear on the use cases here, but I feel fairly strongly that all objects should handle state identically.
-
config()
should be callable from inside or outside a block. Ifconfig()
is called within amodel
block, for instance, it should configure that model. Ifconfig()
is called from withinmain.jinja2
, it should additionally require ascope
argument. Scope should define what is being configured and can contain any valid selector. It is worth thinking further about the way selectors would work in this environment--it's possible that the second tier after[package name]
would need to be the object type (snowplow.models.*
). I think this is a detail we can iron out if we decide to move forwards along this path. -
In this proposal, it's necessary to be able to configure any object multiple times. Subsequent calls to
config
should overwrite earlier state when a new value for a key is provided.
This proposal has us moving away from defining "rules" around state and towards explicitly defining state via code. The order in which code is executed, then, is critical. That order is as follows:
-
Compilation. Objects can define their own state by calling
config()
from inside their blocks. Compilation will continue to be the very first thing that happens upon invocation, so this state gets evaluated first. -
main.jinja2
. The next code that dbt executes will bemain.jinja2
. Allconfig()
calls that happen here are next in the evaluation stack. -
CLI. The final stage in execution is to add a
config()
call using the state supplied from the CLI, and then to execute the appropriate job.
In this proposal, the CLI user has two options:
- Invoke dbt by providing a job name. Each
job
will have a unique name and users can invoke it by providing its name. In this option, additional flags are added as aconfig()
call that executes aftermain.jinja2
. See Execution Flow below for more. - Invoke dbt without providing a job name. In this case, dbt will look for a job called
default
and run it if found, providing all additional flags as in above. This default job can be defined by the user in whatever mechanism they choose, although the default project provided bydbt init
should include the definition of a default job that simply selects all models and runs them with their default configuration.
Let's assume that we had all of the various weights for a classification model stored in the database already, likely loaded there from some scheduled job not executed within dbt.
- Write a model that required a variable called
coefs
, wherecoefs
is a list of variable names and coefficients. This model would iterate over this list to construct a SQL statement that would generate the model scores. - Write a job that included a statement to get the coefficients from the database, formatted them into an appropriate data structure, and then supplied them as configuration to the model.
- Invoke dbt from the command line with
dbt run --job score_things
.
This is a single rather simple use case, but I think it's a rather elegant solution to the problem from end to end. I would be very proud to document this use case, and to have someone like Kickstarter use it and mention it in a talk.
Currently, dbt run
automatically runs all models in the local project plus all dependencies. Moving forwards, the default job can be defined by the user as they choose. Likely the default job installed by dbt init
will be:
{% job default(models = "[local_project].*") %}
{% endjob %}
The user could redefine this job to be:
{% job default(models = "[local_project].*, snowplow.*") %}
{% endjob %}
...or a sensible sub-selection of Snowplow models. Theoretically, it's possible that the user would define their job to run all models in all packages, which would function identically to the way dbt functions today. I'm not sure that's a good idea, but I also don't think we should explicitly prevent it. Additionally, all Snowplow packages could be configured globally within main.jinja2
or within the definition of a specific job.
Notes:
- This does not currently deal with overriding models from within a package, although I do think it gives us the tools to do so. I think should probably happen explicitly when opening a model block, but I'd rather discuss this before writing that proposal up.