feat(powerbi): table-to-table lineage and calculated column lineage#16768
Open
feat(powerbi): table-to-table lineage and calculated column lineage#16768
Conversation
- Add proper return type `Optional[Any]` to `_get_dax_expression` (removes type: ignore) - Parameterize `frozenset` to `FrozenSet[str]` in `extract_dax_table_references` - Add comment explaining `stacklevel=3` in the ImportError warning - Fix misleading assertion message in test_bare_identifier_references_sibling_table
Contributor
|
Linear: ING-2059 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
ligfx
reviewed
Mar 24, 2026
metadata-ingestion/tests/unit/powerbi/test_table_to_table_lineage.py
Outdated
Show resolved
Hide resolved
DAX expressions like [Total Sales] with no table qualifier reference a column in the current table. Previously these were silently dropped; now they emit a ColumnRef pointing to table_urn itself, capturing same-table column-level dependencies. Also removes the early-return guard on empty sibling_table_urns since intra-table refs are valid even when a table has no siblings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📋 Summary
Emit lineage edges when a PowerBI table or calculated column references another table in the same dataset — covering both M-Query identifier references and DAX expressions.
🎯 Motivation
PowerBI datasets frequently contain tables whose expressions reference other tables in the same dataset by name — both through M-Query (e.g.
Table.Combine({tblA, tblB}), bareDimDate) and through DAX calculated tables (e.g.summarize('FMS Lookup', 'FMS Lookup'[FMSID])). Additionally, calculated columns often reference columns from sibling tables using DAX functions likeRELATED(Customers[Name]).Before this change, DataHub produced zero lineage for all of these cases. The parser saw no external database connection and silently returned empty. This left large gaps in the lineage graph for customers whose PowerBI models use intra-dataset references heavily.
🔧 Changes Overview
New Features:
#"Quoted Names"that are unresolved in the M-Queryletscope — and match a sibling table name — are now emitted asTRANSFORMEDupstream edgesColumn.expressionfields containing DAX (e.g.RELATED(Customers[Name])) are parsed and emitted as column-level (FineGrainedLineage) upstream edgesNew Files:
m_query/dax_resolver.py— PyDAXLexer integration with two public functions:extract_dax_table_references()for table-level andextract_dax_column_lineage()for column-level DAX lineageModifications:
m_query/data_classes.py—Lineagedataclass gainspowerbi_table_upstreams: List[str]field (backward-compatible default)m_query/resolver.py— Newresolve_to_table_references()walks the parsed M-Query AST to find unresolved sibling referencesm_query/parser.py— Two new hooks inget_upstream_tables(): one in theMQueryParseErrorexcept block (DAX path), one in theif not data_access_func_details:branch (M-Query path)powerbi.py—Mapper.extract_lineage()convertspowerbi_table_upstreamsto URNs, adds the calculated column CLL loop, broadens emission guard toif upstream or cll_lineage:Dependencies:
PyDAXLexer==0.3.0to thepowerbiextras group insetup.py🏗️ Architecture/Design Notes
Three independent resolution paths, same output shape:
All table-to-table results flow through a single
powerbi_table_upstreams: List[str]field onLineageand are converted toUpstreamClass(type=TRANSFORMED)entries inpowerbi.py. Column lineage flows through the existingFineGrainedLineagepath already used for SQL-parsed column lineage.Detection of DAX vs M-Query: The powerquery-parser bridge raises
MQueryParseErrorfor any DAX expression (DAX is syntactically incompatible with M-Query). This is used as the natural routing signal — no heuristic needed.False positive protection: Sibling table matching only fires when (1) the identifier is genuinely unresolved in the M-Query
letscope, AND (2) the name exactly matches (case-insensitive) a real table in the same dataset. Variables, function names, and literals are not matched.Intentionally excluded:
AnalysisServices.Database(semantic layer, no relational table concept) andUsageMetricsDataConnector(PowerBI internal metrics). Measure-to-column lineage is deferred.🧪 Testing
tests/unit/powerbi/test_table_to_table_lineage.pycovering:DimDate)#"tbl_PayrollHistory")Table.Combine({tblA, tblB})→ multiple siblingsSql.Databasestill producesupstreams, notpowerbi_table_upstreams)summarize('FMS Lookup', ...)→ sibling tableCALENDAR(...)) with no matching sibling → no refsextract_dax_column_lineagewithRELATED(Customers[Name])→ column ref[Total Sales]) → filtered outtest_powerbi_parser.pytests pass (pattern handler regression)📊 Impact Assessment
Affected Components: PowerBI ingestion source only (
m_query/,powerbi.py)Breaking Changes: None. The
powerbi_table_upstreamsfield onLineagehas a default of[], so all existing call sites are unaffected. External source lineage (SQL Server, Snowflake, etc.) is unchanged.Performance Impact: Negligible. The M-Query AST walk is O(n) in node count. PyDAXLexer is only invoked for expressions that already failed M-Query parsing — no additional parsing cost for standard M-Query tables.
Risk Level: Low. New code paths only activate for expressions that previously returned empty lineage. The worst case regression is still empty lineage (same as before).