fix(ingest/dataplex): preserve upstream platform in cross-platform lineage URNs by javabrett · Pull Request #16771 · datahub-project/datahub

javabrett · 2026-03-25T01:09:35Z

Summary

Fixes the Dataplex lineage connector generating incorrect upstream URNs when the GCP Data Lineage API reports cross-platform relationships (e.g. GCS file -> BigQuery external table). Two issues fixed:

Wrong platform (commit 1): get_lineage_for_table() used the target entry's platform for all upstream URNs, so a GCS upstream got dataPlatform:bigquery instead of dataPlatform:gcs. Fixed by adding platform field to LineageEdge and preserving the upstream's platform from the FQN.
Wrong entry ID format (commit 2): The GCP Data Lineage API returns GCS FQNs as bucket.`path/*.csv` but DataHub's GCS source creates URNs with bucket/path (slash-separated, no backticks, no glob). Added _normalize_gcs_entry_id() to transform the format so upstream URNs resolve to real GCS-ingested entities.

Changes

Commit 1: Platform fix

Add platform field to LineageEdge dataclass
Rename _extract_entry_id_from_fqn() to _extract_platform_and_entry_id_from_fqn() returning (platform, entry_id) tuple (old name kept as backward-compatible wrapper)
Update build_lineage_map() to extract and store the upstream platform from the FQN
Update get_lineage_for_table() to use lineage_edge.platform instead of the target's platform parameter

Commit 2: GCS entry ID normalization

Add _normalize_gcs_entry_id() method: strips backticks, replaces dot bucket separator with slash, removes trailing glob patterns
Call it from _extract_platform_and_entry_id_from_fqn() when platform is gcs

Test plan

test_cross_platform_lineage_preserves_upstream_platform - verifies GCS upstream gets dataPlatform:gcs URN (commit 1)
test_normalize_gcs_entry_id - unit tests for normalization edge cases: backtick+glob, no glob, no backticks, bucket only, already normalized (commit 2)
test_gcs_entry_id_normalized_to_datahub_format - end-to-end test through full pipeline verifying URN name matches DataHub GCS source format (commit 2)
All 37 existing + new dataplex lineage tests pass
Verify with real GCP Data Lineage API data (GCS -> BigQuery external table lineage)

🤖 Generated with Claude Code

…neage URNs When the GCP Data Lineage API reports a GCS file as upstream of a BigQuery external table, the Dataplex connector was generating the upstream URN with the target's platform (bigquery) instead of the source's platform (gcs). This created phantom entities that never matched GCS-ingested datasets. Root cause: _extract_entry_id_from_fqn() parsed the platform from the FQN (e.g. "gcs:bucket/path") but discarded it, and LineageEdge had no platform field. get_lineage_for_table() then fell back to the target's platform. Fix: add platform field to LineageEdge, extract it from the FQN during build_lineage_map(), and use it in get_lineage_for_table() when building the upstream URN. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-25T01:09:47Z

Linear: ING-2062

codecov · 2026-03-25T01:13:09Z

Codecov Report

❌ Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ahub/ingestion/source/dataplex/dataplex_lineage.py	95.65%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…o DataHub format The GCP Data Lineage API returns GCS FQNs as bucket.`path/*.csv` but DataHub's GCS source creates URNs with bucket/path (slash-separated, no backticks, no glob). Without normalization, the GCS upstream URNs from Dataplex lineage never resolve to real GCS-ingested entities. Add _normalize_gcs_entry_id() to transform the Data Lineage API format: strip backticks, replace bucket.path dot separator with slash, and remove trailing glob patterns (/*.csv, /*.parquet, etc.). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-25T01:59:23Z

Your PR has been assigned to @maggiehays (maggie) for review (ING-2062).

sgomezvillamor · 2026-03-25T17:51:25Z

I'm fully rewriting dataplex connector in #16723
I cannot promise that all issues will be solved there but definitely, new implementation scales better to incorporate more and more entry types.
In particular, my PR is lacking support for gcs, to be completed soon.

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 25, 2026

github-actions bot deployed to datahub-wheels (Preview) March 25, 2026 01:11 View deployment

vercel bot deployed to Preview March 25, 2026 01:26 View deployment

javabrett requested review from sgomezvillamor and treff7es March 25, 2026 01:32

github-actions bot deployed to datahub-wheels (Preview) March 25, 2026 01:47 View deployment

github-actions bot requested a review from maggiehays March 25, 2026 01:59

vercel bot deployed to Preview March 25, 2026 02:01 View deployment

maggiehays added the needs-review Label for PRs that need review from a maintainer. label Mar 25, 2026

maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/dataplex): preserve upstream platform in cross-platform lineage URNs#16771

fix(ingest/dataplex): preserve upstream platform in cross-platform lineage URNs#16771
javabrett wants to merge 2 commits intomasterfrom
javabrett/fix-dataplex-cross-platform-lineage-urn

javabrett commented Mar 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

codecov bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

sgomezvillamor commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

javabrett commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Commit 1: Platform fix

Commit 2: GCS entry ID normalization

Test plan

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

codecov bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

sgomezvillamor commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

javabrett commented Mar 25, 2026 •

edited

Loading

codecov bot commented Mar 25, 2026 •

edited

Loading