Skip to content

feat(powerbi): table-to-table lineage and calculated column lineage#16768

Open
askumar27 wants to merge 10 commits intomasterfrom
feat/powerbi-lineage-enhancements
Open

feat(powerbi): table-to-table lineage and calculated column lineage#16768
askumar27 wants to merge 10 commits intomasterfrom
feat/powerbi-lineage-enhancements

Conversation

@askumar27
Copy link
Contributor

@askumar27 askumar27 commented Mar 24, 2026

📋 Summary

Emit lineage edges when a PowerBI table or calculated column references another table in the same dataset — covering both M-Query identifier references and DAX expressions.

🎯 Motivation

PowerBI datasets frequently contain tables whose expressions reference other tables in the same dataset by name — both through M-Query (e.g. Table.Combine({tblA, tblB}), bare DimDate) and through DAX calculated tables (e.g. summarize('FMS Lookup', 'FMS Lookup'[FMSID])). Additionally, calculated columns often reference columns from sibling tables using DAX functions like RELATED(Customers[Name]).

Before this change, DataHub produced zero lineage for all of these cases. The parser saw no external database connection and silently returned empty. This left large gaps in the lineage graph for customers whose PowerBI models use intra-dataset references heavily.

🔧 Changes Overview

New Features:

  • M-Query table-to-table lineage: Bare identifiers and #"Quoted Names" that are unresolved in the M-Query let scope — and match a sibling table name — are now emitted as TRANSFORMED upstream edges
  • DAX calculated table lineage: When the M-Query bridge raises a parse error (DAX is not M-Query), expressions are routed to a new PyDAXLexer-based resolver that extracts table references from DAX calculated table expressions
  • DAX calculated column lineage: Column.expression fields containing DAX (e.g. RELATED(Customers[Name])) are parsed and emitted as column-level (FineGrainedLineage) upstream edges

New Files:

  • m_query/dax_resolver.py — PyDAXLexer integration with two public functions: extract_dax_table_references() for table-level and extract_dax_column_lineage() for column-level DAX lineage

Modifications:

  • m_query/data_classes.pyLineage dataclass gains powerbi_table_upstreams: List[str] field (backward-compatible default)
  • m_query/resolver.py — New resolve_to_table_references() walks the parsed M-Query AST to find unresolved sibling references
  • m_query/parser.py — Two new hooks in get_upstream_tables(): one in the MQueryParseError except block (DAX path), one in the if not data_access_func_details: branch (M-Query path)
  • powerbi.pyMapper.extract_lineage() converts powerbi_table_upstreams to URNs, adds the calculated column CLL loop, broadens emission guard to if upstream or cll_lineage:

Dependencies:

  • Added PyDAXLexer==0.3.0 to the powerbi extras group in setup.py

🏗️ Architecture/Design Notes

Three independent resolution paths, same output shape:

M-Query expression (valid)  →  resolve_to_table_references()  →  powerbi_table_upstreams
DAX expression              →  extract_dax_table_references()  →  powerbi_table_upstreams
Column.expression (DAX)     →  extract_dax_column_lineage()    →  FineGrainedLineage

All table-to-table results flow through a single powerbi_table_upstreams: List[str] field on Lineage and are converted to UpstreamClass(type=TRANSFORMED) entries in powerbi.py. Column lineage flows through the existing FineGrainedLineage path already used for SQL-parsed column lineage.

Detection of DAX vs M-Query: The powerquery-parser bridge raises MQueryParseError for any DAX expression (DAX is syntactically incompatible with M-Query). This is used as the natural routing signal — no heuristic needed.

False positive protection: Sibling table matching only fires when (1) the identifier is genuinely unresolved in the M-Query let scope, AND (2) the name exactly matches (case-insensitive) a real table in the same dataset. Variables, function names, and literals are not matched.

Intentionally excluded: AnalysisServices.Database (semantic layer, no relational table concept) and UsageMetricsDataConnector (PowerBI internal metrics). Measure-to-column lineage is deferred.

🧪 Testing

  • 9 new unit tests in tests/unit/powerbi/test_table_to_table_lineage.py covering:
    • Bare M-Query identifier → sibling table (DimDate)
    • Quoted identifier (#"tbl_PayrollHistory")
    • Table.Combine({tblA, tblB}) → multiple siblings
    • External source regression (Sql.Database still produces upstreams, not powerbi_table_upstreams)
    • DAX summarize('FMS Lookup', ...) → sibling table
    • DAX built-in (CALENDAR(...)) with no matching sibling → no refs
    • extract_dax_column_lineage with RELATED(Customers[Name]) → column ref
    • Unknown table in DAX → no refs
    • Intra-table measure reference ([Total Sales]) → filtered out
  • All existing test_powerbi_parser.py tests pass (pattern handler regression)
  • mypy + ruff clean across 2184 source files

📊 Impact Assessment

Affected Components: PowerBI ingestion source only (m_query/, powerbi.py)

Breaking Changes: None. The powerbi_table_upstreams field on Lineage has a default of [], so all existing call sites are unaffected. External source lineage (SQL Server, Snowflake, etc.) is unchanged.

Performance Impact: Negligible. The M-Query AST walk is O(n) in node count. PyDAXLexer is only invoked for expressions that already failed M-Query parsing — no additional parsing cost for standard M-Query tables.

Risk Level: Low. New code paths only activate for expressions that previously returned empty lineage. The worst case regression is still empty lineage (same as before).

- Add proper return type `Optional[Any]` to `_get_dax_expression` (removes type: ignore)
- Parameterize `frozenset` to `FrozenSet[str]` in `extract_dax_table_references`
- Add comment explaining `stacklevel=3` in the ImportError warning
- Fix misleading assertion message in test_bare_identifier_references_sibling_table
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 24, 2026
@github-actions
Copy link
Contributor

Linear: ING-2059

@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 77.06422% with 50 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...tahub/ingestion/source/powerbi/m_query/resolver.py 81.39% 16 Missing ⚠️
...b/ingestion/source/powerbi/m_query/dax_resolver.py 69.38% 15 Missing ⚠️
...on/src/datahub/ingestion/source/powerbi/powerbi.py 40.00% 15 Missing ⚠️
...datahub/ingestion/source/powerbi/m_query/parser.py 85.71% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@askumar27 askumar27 changed the title feat(powerbi): table-to-table lineage and calculated column lineage (ING-1905) feat(powerbi): table-to-table lineage and calculated column lineage Mar 24, 2026
@datahub-connector-tests
Copy link

datahub-connector-tests bot commented Mar 24, 2026

Connector Tests Results

All connector tests passed for commit 02d6902

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

DAX expressions like [Total Sales] with no table qualifier reference a
column in the current table. Previously these were silently dropped;
now they emit a ColumnRef pointing to table_urn itself, capturing
same-table column-level dependencies.

Also removes the early-return guard on empty sibling_table_urns since
intra-table refs are valid even when a table has no siblings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants