Skip to content

fix(ingest/dataplex): preserve upstream platform in cross-platform lineage URNs#16771

Open
javabrett wants to merge 2 commits intomasterfrom
javabrett/fix-dataplex-cross-platform-lineage-urn
Open

fix(ingest/dataplex): preserve upstream platform in cross-platform lineage URNs#16771
javabrett wants to merge 2 commits intomasterfrom
javabrett/fix-dataplex-cross-platform-lineage-urn

Conversation

@javabrett
Copy link
Copy Markdown
Contributor

@javabrett javabrett commented Mar 25, 2026

Summary

Fixes the Dataplex lineage connector generating incorrect upstream URNs when the GCP Data Lineage API reports cross-platform relationships (e.g. GCS file -> BigQuery external table). Two issues fixed:

  • Wrong platform (commit 1): get_lineage_for_table() used the target entry's platform for all upstream URNs, so a GCS upstream got dataPlatform:bigquery instead of dataPlatform:gcs. Fixed by adding platform field to LineageEdge and preserving the upstream's platform from the FQN.

  • Wrong entry ID format (commit 2): The GCP Data Lineage API returns GCS FQNs as bucket.`path/*.csv` but DataHub's GCS source creates URNs with bucket/path (slash-separated, no backticks, no glob). Added _normalize_gcs_entry_id() to transform the format so upstream URNs resolve to real GCS-ingested entities.

Changes

Commit 1: Platform fix

  • Add platform field to LineageEdge dataclass
  • Rename _extract_entry_id_from_fqn() to _extract_platform_and_entry_id_from_fqn() returning (platform, entry_id) tuple (old name kept as backward-compatible wrapper)
  • Update build_lineage_map() to extract and store the upstream platform from the FQN
  • Update get_lineage_for_table() to use lineage_edge.platform instead of the target's platform parameter

Commit 2: GCS entry ID normalization

  • Add _normalize_gcs_entry_id() method: strips backticks, replaces dot bucket separator with slash, removes trailing glob patterns
  • Call it from _extract_platform_and_entry_id_from_fqn() when platform is gcs

Test plan

  • test_cross_platform_lineage_preserves_upstream_platform - verifies GCS upstream gets dataPlatform:gcs URN (commit 1)
  • test_normalize_gcs_entry_id - unit tests for normalization edge cases: backtick+glob, no glob, no backticks, bucket only, already normalized (commit 2)
  • test_gcs_entry_id_normalized_to_datahub_format - end-to-end test through full pipeline verifying URN name matches DataHub GCS source format (commit 2)
  • All 37 existing + new dataplex lineage tests pass
  • Verify with real GCP Data Lineage API data (GCS -> BigQuery external table lineage)

🤖 Generated with Claude Code

…neage URNs

When the GCP Data Lineage API reports a GCS file as upstream of a BigQuery
external table, the Dataplex connector was generating the upstream URN with
the target's platform (bigquery) instead of the source's platform (gcs).
This created phantom entities that never matched GCS-ingested datasets.

Root cause: _extract_entry_id_from_fqn() parsed the platform from the FQN
(e.g. "gcs:bucket/path") but discarded it, and LineageEdge had no platform
field. get_lineage_for_table() then fell back to the target's platform.

Fix: add platform field to LineageEdge, extract it from the FQN during
build_lineage_map(), and use it in get_lineage_for_table() when building
the upstream URN.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Linear: ING-2062

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 25, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ahub/ingestion/source/dataplex/dataplex_lineage.py 95.65% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…o DataHub format

The GCP Data Lineage API returns GCS FQNs as bucket.`path/*.csv` but
DataHub's GCS source creates URNs with bucket/path (slash-separated,
no backticks, no glob). Without normalization, the GCS upstream URNs
from Dataplex lineage never resolve to real GCS-ingested entities.

Add _normalize_gcs_entry_id() to transform the Data Lineage API format:
strip backticks, replace bucket.path dot separator with slash, and
remove trailing glob patterns (/*.csv, /*.parquet, etc.).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Your PR has been assigned to @maggiehays (maggie) for review (ING-2062).

@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label Mar 25, 2026
@sgomezvillamor
Copy link
Copy Markdown
Contributor

I'm fully rewriting dataplex connector in #16723
I cannot promise that all issues will be solved there but definitely, new implementation scales better to incorporate more and more entry types.
In particular, my PR is lacking support for gcs, to be completed soon.

@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants