Skip to content

bmiller-dh/datahub-data-access-workflow-examples

Repository files navigation

This repo contains examples for using DataHub Data Access Workflows in conjunction with the DataHub Actions Framework to integrate with external tools.

  • Configuration – Requirements, env vars, and Quick start to run the mock server and all four pipelines.
  • Step-by-step guide – Data Access Request → list pending → approve → grant pipeline → mock server (full flow).
  • Creating and deleting Data Access Workflows – Scripts to create (upsert) or list/delete workflow definitions.
  • See WORKFLOWS.md for how to implement: (1) Data Product Access Requests, (2) Asset Certification, (3) Metadata Approval (Proposal Workflows), and (4) Glossary Approval using what’s built here.

Configuration

Scripts and action pipelines use environment variables for the DataHub URL and token (no secrets in repo).

Requirements and dependencies

To run the scripts and action pipelines (and optionally the mock server), you need:

Requirement Purpose
Python 3.10+ Scripts and DataHub Actions CLI; install via python.org or your OS package manager.
pip Install the project and dependencies; use a virtual environment (python3 -m venv venv).
Project packages Install with pip install -e . (see Dependencies). Includes acryl-datahub-actions and requests.
Docker Optional. Quick start uses Docker for the mock server if available; otherwise it runs the mock server with Python (mock-server/requirements.txt).
DataHub instance A DataHub deployment (e.g. DataHub Cloud) and a Personal Access Token.
  1. Copy the example env file and set your values:
    cp .env.example .env
    # Edit .env: set DATAHUB_URL and DATAHUB_TOKEN
  2. Load the variables before running (e.g. in your shell or via source .env if your shell supports it):
    export DATAHUB_URL=https://<instance-name>.acryl.io
    export DATAHUB_TOKEN=<personal access token>
    For a .env file, you can use set -a && source .env && set +a (bash/zsh) or a tool like direnv. Do not commit .env.

Required variables:

  • DATAHUB_URL – DataHub instance URL (e.g. https://<instance-name>.acryl.io)
  • DATAHUB_TOKENPersonal Access Token

Quick start (mock server + all four pipelines)

After Configuration is set up (pip install -e ., .env with DATAHUB_URL and DATAHUB_TOKEN), you can start the mock server and all four action pipelines in one go:

./quick_start.sh

This script:

  • Ensures the Data Access Workflow exists and updates workflowId in src/grant-external-permissions-pipeline.yaml (runs create_data_access_workflow.py; idempotent).
  • Builds and runs the mock server at http://localhost:8000 (uses Docker if available, otherwise starts it with Python from mock-server/).
  • Starts the grant-external-permissions, certification-event, metadata-proposal, and glossary-proposal pipelines in the background.

Press Ctrl+C to stop all pipelines and the mock server. If the workflow step fails (e.g. no network), run Step 2 manually and set workflowId in the pipeline YAML.


Step-by-step: Data Access Request → Approve → Grant (e.g. mock server)

End-to-end flow: create a workflow, run the grant pipeline, list your pending requests, approve one, and see the approval event hit an external endpoint (mock server).

Prerequisites: DataHub instance (e.g. DataHub Cloud), Personal Access Token, Python 3.10+.

Step 1 – One-time setup

cd datahub-data-access-workflow-examples
python3 -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -e .
cp .env.example .env
# Edit .env: set DATAHUB_URL and DATAHUB_TOKEN

Step 2 – Create the Data Access Workflow (once per instance)

Optional if you use Quick start: the quick start script runs this and updates the pipeline YAML for you.

# From datahub-data-access-workflow-examples with venv activated and .env set
set -a && source .env && set +a
python scripts/data_access/create_data_access_workflow.py

Note the workflow URN printed (e.g. urn:li:actionWorkflow:xxxxxxxx-xxxx-...). Open src/grant-external-permissions-pipeline.yaml and set filter.event.parameters.workflowId to the short ID (the UUID part, e.g. xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx).

Step 3 – (Optional) Run the mock server (with UI)

To simulate an external system that receives approved requests and to use a web UI to Approve/Deny requests (and to see metadata/glossary/certification events when those pipelines run):

cd mock-server
docker build -t grant-permissions-mock .
docker run --rm -d -p 8000:8000 --name grant-permissions-mock --env-file ../.env grant-permissions-mock
# Open http://localhost:8000/ to see pending requests and Approve/Deny buttons
# Check: curl http://localhost:8000/health

Rebuild the image (docker build -t grant-permissions-mock .) after pulling or changing mock-server code so the dashboard shows all sections (approvals, metadata proposals, glossary proposals, certification events). The pipeline is configured to POST to http://localhost:8000/grant_permissions when a request is approved. With --env-file ../.env, the mock server can list pending requests and call DataHub’s review API when you click Approve or Deny in the UI. See mock-server/README.md for a full test procedure.

Step 4 – Start the grant-external-permissions pipeline

cd datahub-data-access-workflow-examples
source venv/bin/activate
set -a && source .env && set +a
datahub actions -c src/grant-external-permissions-pipeline.yaml

Leave this running (or run in background). It will receive events when requests are approved.

Step 5 – List pending requests assigned to you

From the repo root (workflows/) you can use the wrapper (uses the project venv and .env automatically):

cd /path/to/workflows
python scripts/data_access/list_pending_data_access_requests.py --mine --status PENDING

Or from inside the examples project:

cd datahub-data-access-workflow-examples
source venv/bin/activate && set -a && source .env && set +a
python scripts/data_access/list_pending_data_access_requests.py --mine --status PENDING

Copy a request URN from the output (e.g. urn:li:actionRequest:xxxxxxxx-...).

Step 6 – Approve (or reject) the request

cd datahub-data-access-workflow-examples
source venv/bin/activate && set -a && source .env && set +a
python scripts/data_access/review_data_access_request.py --request-urn "urn:li:actionRequest:xxxxxxxx-..." --result ACCEPTED --comment "Approved"

To reject: use --result REJECTED.

Step 7 – Verify the external system received the approval

If the mock server is running, the pipeline will POST the approval payload to it.

  • In the mock UI: Open http://localhost:8000/ (External Authorization Service Simulator) and scroll to “Recent approvals received (from pipeline)” (auto-refreshes every 5 seconds). The dashboard also has sections for Recent metadata proposals, Recent glossary proposals, and Recent certification events when those pipelines are running and POST to the mock server.
  • In logs: docker logs grant-permissions-mock — you should see a line like POST /grant_permissions received with entityUrn, actorUrn, result: ACCEPTED.

How to run the other workflows (2, 3, 4)

Use the same setup: from datahub-data-access-workflow-examples, activate the venv and load .env, then run the pipeline. Each pipeline logs matching events; you can set external_uri in the pipeline config to POST to your own service (or to the mock server at http://localhost:8000/... so events appear on the dashboard).

If you use the global datahub CLI (e.g. installed via pipx): the custom actions (metadata_proposal_action, glossary_proposal_action, certification_event_action) live in this repo. Run once:
pipx inject acryl-datahub /path/to/datahub-data-access-workflow-examples
so the CLI can load them. After you change action code, run
pipx inject --force acryl-datahub /path/to/datahub-data-access-workflow-examples
and restart the pipeline so it uses the latest code.

2. Asset Certification

What it does: Listens for structured property changes on entities (e.g. when someone completes a Compliance Form that sets a certification property).

Run the pipeline:

cd datahub-data-access-workflow-examples
source venv/bin/activate
set -a && source .env && set +a
datahub actions -c src/certification-event-pipeline.yaml

Optional – create a VERIFICATION Compliance Form via API:
Create a structured property in DataHub first (Govern → Settings → Structured Properties), then:

python scripts/certification/create_asset_certification_form.py --form-id asset-cert-2024 --property-urn "urn:li:structuredProperty:yourCertPropertyUrn"
# Or pass just the property id: --property-urn "yourCertPropertyUrn"

How to trigger: In DataHub, add or change a structured property on a dataset (e.g. via a Compliance Form or the asset profile). The pipeline will log the event.


3. Metadata Approval (proposals)

What it does: Listens for metadata proposals (tag, owner, domain, description, structured property). Logs each proposal; optional external_uri to forward to ticketing/approval.

Run the pipeline:

cd datahub-data-access-workflow-examples
source venv/bin/activate
set -a && source .env && set +a
datahub actions -c src/metadata-proposal-pipeline.yaml

How to trigger: In DataHub, propose a tag, owner, domain, description, or structured property on an asset (e.g. suggest a tag on a dataset). The pipeline will log it. To forward to a URL, add external_uri: "http://your-service/..." under action.config in src/metadata-proposal-pipeline.yaml.


4. Glossary Approval (proposals)

What it does: Listens for glossary proposals (new term or term association to an asset). Logs each proposal; optional external_uri to forward.

Run the pipeline:

cd datahub-data-access-workflow-examples
source venv/bin/activate
set -a && source .env && set +a
datahub actions -c src/glossary-proposal-pipeline.yaml

How to trigger: In DataHub, propose a new glossary term or propose adding a glossary term to a dataset. The pipeline will log it. To forward to a URL, add external_uri under action.config in src/glossary-proposal-pipeline.yaml.


Dependencies

pip install -e .

Creating a Data Access Workflow

The script scripts/data_access/create_data_access_workflow.py creates a Data Access Workflow in your DataHub instance using the GraphQL upsertActionWorkflow mutation. The workflow defines how users request access to datasets (e.g. from the dataset profile or home page) and how approvals are routed.

When to use it

  • First-time setup – You need at least one Data Access Workflow for the grant, simple, and create-external-access-request pipelines to receive events.
  • Quick startQuick start runs this script for you and updates workflowId in src/grant-external-permissions-pipeline.yaml. You only need to run the script manually if you are not using quick start or if you want a fresh workflow.

Prerequisites

  • DATAHUB_URL and DATAHUB_TOKEN set (see Configuration).
  • A token with permission to create/manage workflows (e.g. platform admin or “Manage Workflows”).

Command

# From repo root, with .env loaded (e.g. set -a && source .env && set +a)
python scripts/data_access/create_data_access_workflow.py

What it creates

  • A workflow named “External Auth Data Access Workflow” (category ACCESS) with:
    • A form (business justification, access duration, optional permanent-access justification).
    • Entry points: Home page (“Request Dataset Access”) and entity profile (“Request Access”).
    • One approval step assigned to entity owners.
  • The script prints the workflow URN, e.g. urn:li:actionWorkflow:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.

After creating

  • In DataHub, open any Dataset and use the “Request Access” (unlock) action to submit a request.
  • For the grant-external-permissions (and other) pipelines to receive events, set filter.event.parameters.workflowId in the pipeline YAML to the UUID part of that URN (e.g. xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx). Quick start does this automatically for the grant pipeline.

Examples

Simple Action

The file src/simple_action.py implements a simple example DataHub Action that listens for all events pertaining to Data Access Requests and prints them to your terminal.

  1. Set DATAHUB_URL and DATAHUB_TOKEN (see Configuration).
  2. Edit src/simple-pipeline.yaml: set filter.event.parameters.workflowId to the URN of your Data Access Workflow (must be already created).

To run:

datahub actions -c src/simple-pipeline.yaml

Grant External Permissions Action

The file src/grant_external_permissions_action.py implements a simple example DataHub Action that listens for accepted Data Access Requests and sends them off to an external system. This can be used to approve data requests inside DataHub itself, then grant permissions in an external system.

  1. Set DATAHUB_URL and DATAHUB_TOKEN (see Configuration).
  2. Edit src/grant-external-permissions-pipeline.yaml: set filter.event.parameters.workflowId to the URN of your Data Access Workflow (must be already created).
  3. Implement logic that grants permissions inside an external system. Currently, this action simply sends the raw parameters to an imaginary HTTP endpoint.

To run:

datahub actions -c src/grant-external-permissions-pipeline.yaml

Create External Access Request Action

The file src/create_external_access_request_action.py implements a simple example DataHub Action that listens for new Data Access Requests and sends them off to an external system. This can be used to allow users to request data inside DataHub itself, but send the request off to an external system to manage the request workflow and lifecycle elsewhere.

  1. Set DATAHUB_URL and DATAHUB_TOKEN (see Configuration).
  2. Edit src/create-external-access-request-pipeline.yaml: set filter.event.parameters.workflowId to the URN of your Data Access Workflow (must be already created).
  3. Implement logic that creates a request in an external system and stores a reference tying it back to the access request inside DataHub. Currently, this action simply sends the raw parameters to an imaginary HTTP endpoint.

To run:

datahub actions -c src/create-external-access-request-pipeline.yaml

This should also be paired with logic to take the result of the workflow in the external system and push it back into DataHub, so users can be notified that their request has been approved or denied. An example script for programmatically approving or denying a Data Access Request is scripts/data_access/review_data_access_request.py.

Set DATAHUB_URL and DATAHUB_TOKEN (see Configuration).

To list pending (or all) Data Access Requests and get their URNs:

python scripts/data_access/list_pending_data_access_requests.py                    # all workflow form requests
python scripts/data_access/list_pending_data_access_requests.py --status PENDING   # pending only
python scripts/data_access/list_pending_data_access_requests.py --mine --status PENDING   # only requests assigned to you (so you can approve them)

Then approve or reject:

python scripts/data_access/review_data_access_request.py --request-urn <request_urn> --result ACCEPTED --comment "Approved"
# or --result REJECTED

Deleting Data Access Workflows

The script scripts/data_access/delete_data_access_workflows.py lists and deletes Data Access Workflow definitions in DataHub. It uses the GraphQL deleteActionWorkflow mutation. Use it to remove duplicate or unwanted workflows (e.g. after running the create script or quick start multiple times).

What deletion does

  • Removes the workflow definition – New requests can no longer be created with that workflow.
  • Does not remove existing requests – Requests that were already submitted from that workflow remain in DataHub (pending or completed); you can still list and approve/reject them with list_pending_data_access_requests.py and review_data_access_request.py.

Prerequisites

  • DATAHUB_URL and DATAHUB_TOKEN set (see Configuration).
  • A token with permission to manage workflows (e.g. “Manage Workflows”).

List workflows (no delete)

python scripts/data_access/delete_data_access_workflows.py --list-only

Prints each workflow URN, name, and category. Use this to confirm which workflows exist before deleting.

Delete workflows

  • Delete all workflows returned by the list (same as listing then deleting every one):

    python scripts/data_access/delete_data_access_workflows.py --delete-all
  • Delete one or more by URN (copy URNs from --list-only or from the DataHub UI):

    python scripts/data_access/delete_data_access_workflows.py --urn "urn:li:actionWorkflow:<uuid1>" --urn "urn:li:actionWorkflow:<uuid2>"

After deleting

  • To have a single workflow again (and to wire the grant pipeline), run create_data_access_workflow.py once, or run ./quick_start.sh, which creates the workflow and updates workflowId in src/grant-external-permissions-pipeline.yaml.

Other pipelines (see WORKFLOWS.md)

  • src/metadata-proposal-pipeline.yamlMetadata Approval (#3). Listens for metadata proposals (tag, owner, domain, description, structured property). Uses metadata_proposal_action; optional external_uri to forward to a ticketing/approval system.
  • src/glossary-proposal-pipeline.yamlGlossary Approval (#4). Listens for glossary proposals (term association, create glossary term). Uses glossary_proposal_action; optional external_uri to forward.
  • src/certification-event-pipeline.yamlAsset Certification (#2). Listens for structured property changes on entities (e.g. from Compliance Forms). Uses certification_event_action; optional external_uri to forward.

Run with: datahub actions -c src/<pipeline>.yaml (after setting env vars).

Asset Certification (Compliance Forms)

For Requirement #2 (Asset Certification), use DataHub Compliance Forms (VERIFICATION type) and Structured Properties. Optionally create a form via API:

python scripts/certification/create_asset_certification_form.py --form-id asset-cert-2024 --property-urn "urn:li:structuredProperty:yourCertProperty"

Create the structured property in DataHub first (Govern > Settings > Structured Properties). Then assign the form to entities via the UI or GraphQL batchAssignForm.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors