Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestra docs #2237

Merged
merged 16 commits into from
Feb 19, 2025
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/website/docs/general-usage/credentials/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ keywords: [credentials, secrets.toml, secrets, config, configuration, environmen
---
import DocCardList from '@theme/DocCardList';

`dlt` pipelines usually require configurations and credentials. These can be set up in [various ways](setup):
`dlt` pipelines usually require configurations and credentials. These can be set up in [various ways](./setup):

1. Environment variables
2. Configuration files (`secrets.toml` and `config.toml`)
3. Key managers and vaults

`dlt` automatically extracts configuration settings and secrets based on flexible [naming conventions](setup/#naming-convention). It then [injects](advanced/#injection-mechanism) these values where needed in code.
`dlt` automatically extracts configuration settings and secrets based on flexible [naming conventions](./setup/#naming-convention). It then [injects](./advanced/#injection-mechanism) these values where needed in code.

# Learn details about

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
---
title: Deploy with Orchestra
description: How to deploy a dlt pipeline on Orchestra
keywords: [Orchestra, dlt, elt, data pipeline]
---

# Deploy on Orchestra

## What is Orchestra?

[Orchestra](https://getorchestra.io) unified control plane for Data and AI operations. In addition to providing an end-to-end, declarative [orchestration tool](https://www.getorchestra.io/solutions), Orchestra provides a lightweight execution environment for running python - perfect for getting started with dlt.

Orchestra can also run [dbt core](https://www.getorchestra.io/product/orchestra-dbt-core-tm) which makes it perfect for joining dlt ingestion jobs to downstream transformations.

You can leverage [Orchestra's Free Tier here](https://app.getorchestra.io?utm_campaign=dlt_docs) to get started with dlt straight away.

## Building data pipelines with `dlt`

**`dlt`** is an open-source Python library that allows you to declaratively load data sources
into well-structured tables or datasets. It does this through automatic schema inference and evolution.
The library simplifies building data pipelines by providing functionality to support the entire extract
and load process.

### How does `dlt` integrate with Orchestra?

Fundamentally `dlt` is python code. This means that to run `dlt`, you should first set-up the [Python Connector](https://docs.getorchestra.io/docs/integrations/utility/python/) in Orchestra. You can find a step-by-step [dlt video tutorial here](https://www.youtube.com/watch?v=2V94CRXFZXk).

When configuring the python connection, ensure environment variables such as secrets are added with the [proper naming convention](../../general-usage/credentials/setup).

For example:

```sql
{
"DESTINATION__BIGQUERY__LOCATION": "*******st2",
"DESTINATION__BIGQUERY__CREDENTIALS__CLIENT_EMAIL": "*******com",
"DESTINATION__BIGQUERY__CREDENTIALS__PROJECT_ID": "*******114",
"DESTINATION__BIGQUERY__CREDENTIALS__PRIVATE_KEY": "*******--\n",
"SOURCES__HUBSPOT__API_KEY": "*******1e4"
}
```

![Configure a dlt Python Integration in Orchestra](images/orchestra_dlt_credential.png)

#### Entry point

After configuring you `.dlt` project, you should ensure the pipelines you want to run can be run from a single point of entry i.e. a single file.

Imported file; `pipeline.py`

```py
from typing import List
import dlt

from hubspot import hubspot, hubspot_events_for_objects, THubspotObjectType
from setup_logger import build_logger

logger = build_logger("debug.log")
def run_pipeline() -> None:
"""
This function loads all resources from HubSpot CRM

Returns:
None
"""
logger.info("Starting to run dlt Pipeline")
# Create a DLT pipeline object with the pipeline name, dataset name, and destination database type
# Add dev_mode=(True or False) if you need your pipeline to create the dataset in your destination
p = dlt.pipeline(
pipeline_name="hubspot",
dataset_name="dlt_hubspot",
destination='bigquery',
)

data = hubspot()
data.companies.bind(props=['hs_time_in_opportunity', 'hs_analytics_first_visit_timestamp'])

# Run the pipeline with the HubSpot source connector
info = p.run(data)

# Print information about the pipeline run
print(info)
logger.info("Completed Loading Data from Hubspot to BigQuery")


def load_crm_data_with_history() -> None:
"""
Loads all HubSpot CRM resources and property change history for each entity.
The history entries are loaded to a tables per resource `{resource_name}_property_history`, e.g. `contacts_property_history`

Returns:
None
"""

# Create a DLT pipeline object with the pipeline name, dataset name, and destination database type
# Add dev_mode=(True or False) if you need your pipeline to create the dataset in your destination
p = dlt.pipeline(
pipeline_name="hubspot",
dataset_name="hubspot_dataset",
destination='bigquery',
)

# Configure the source with `include_history` to enable property history load, history is disabled by default
data = hubspot(include_history=True)
data.contacts.bind()
# Run the pipeline with the HubSpot source connector
info = p.run(data)

# Print information about the pipeline run
print(info)


def load_crm_objects_with_custom_properties() -> None:
"""
Loads CRM objects, reading only properties defined by the user.
"""

# Create a DLT pipeline object with the pipeline name,
# dataset name, properties to read and destination database
# type Add dev_mode=(True or False) if you need your
# pipeline to create the dataset in your destination
p = dlt.pipeline(
pipeline_name="hubspot",
dataset_name="hubspot_dataset",
destination='bigquery',
)

source = hubspot()

# By default, all the custom properties of a CRM object are extracted,
# ignoring those driven by Hubspot (prefixed with `hs_`).

# To read fields in addition to the custom ones:
# source.contacts.bind(props=["date_of_birth", "degree"])

# To read only two particular fields:
source.contacts.bind(props=["date_of_birth", "degree"], include_custom_props=False)

# Run the pipeline with the HubSpot source connector
info = p.run(source)

# Print information about the pipeline run
print(info)


def load_web_analytics_events(
object_type: THubspotObjectType, object_ids: List[str]
) -> None:
"""
This function loads web analytics events for a list objects in `object_ids` of type `object_type`

Returns:
None
"""

# Create a DLT pipeline object with the pipeline name, dataset name, and destination database type
p = dlt.pipeline(
pipeline_name="hubspot",
dataset_name="hubspot_dataset",
destination='bigquery',
dev_mode=False,
)

# you can get many resources by calling this function for various object types
resource = hubspot_events_for_objects(object_type, object_ids)
# and load them together passing resources in the list
info = p.run([resource])

# Print information about the pipeline run
print(info)

```

Executed File: `run_pipelines.py`

```py
from pipeline.py import run_pipeline
import os

run_pipeline()
```

#### Add to a Pipeline

After doing this, you should add a Python Task to a Pipeline like so:

![Adding a Python Task that runs dlt in Orchestra](images/orchestra_add_dlt_step.png)

This will generate the below .yml:

```yaml
version: v1
name: 'Archive: Python2'
pipeline:
32be8199-4e28-437e-9a3a-2b1a10107bd9:
tasks:
64569afa-ed14-43d4-8c33-64e6b8309593:
integration: PYTHON
integration_job: PYTHON_EXECUTE_SCRIPT
parameters:
command: python -m run_dlt_pipelines
package_manager: PIP
python_version: '3.12'
build_command: pip install -r requirements.txt
depends_on: []
condition: null
name: Run DLT
tags: []
connection: orchestra_python_96778
operation_metadata: null
treat_failure_as_warning: null
configuration: null
```

Next, run the Pipeline in Orchestra. After running the Pipeline in Orchestra, you will be able to view the status of the dlt job alongside other Tasks in the end-to-end Data Pipeline.


![Visualising dlt in Orchestra](images/dlt_orchestra_node.png)

_Left to right: the dlt Task Node, the raw logs from the dlt task, the Task parameters panel_.



### Additional resources

- Try dlt with Orchestra now:
[here](https://app.getorchestra.io/signup).
- Running dbt core in Orchestra:
[here.](https://www.getorchestra.io/product/orchestra-dbt-core-tm)
- Product Demos:
[here.](https://www.youtube.com/@OrchestraDataPlatform)

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ const sidebars = {
'walkthroughs/deploy-a-pipeline/deploy-with-dagster',
'walkthroughs/deploy-a-pipeline/deploy-with-prefect',
'walkthroughs/deploy-a-pipeline/deploy-with-modal',
'walkthroughs/deploy-a-pipeline/deploy-with-orchestra',
]
},
{
Expand Down