Skip to content

Hybrid distance weighting in AIA #305

@kendrickb-nvidia

Description

@kendrickb-nvidia

Priority Level

Medium

Describe the bug

From copilot review on #185, the distance weighting between text and tabular in attribute_inference_protection.py is most likely not what was intended. See code for "Now get the hybrid distance" at

The mixed text+tabular branch’s hybrid weighting below uses len(df_train_use.columns) as the denominator, but df_train_use has already been reduced to tabular-only columns earlier in this function. That makes tab_weight effectively 1 and text_weight > 0 (weights don’t sum to 1), skewing the hybrid distance and potentially changing AIA results. Consider computing weights from the original total column count (e.g., len(text_columns) + len(tabular_columns)) or another normalized scheme that remains valid after splitting.

Steps/Code to reproduce bug

Run AIA on a dataset with 1 or more text columns

Expected behavior

Assume we have m_tab tabular columns and m_text text columns and the tabular distance is d_tab and text distance is d_text. Without any comments to the contrary, the way the code is written suggests the weighting should be (m_tab/(m_tab+m_text)) * d_tab + (m_text/(m_tab+m_text)) * d_text. Ie basic weighted average.

But the actual behavior is d_tab + (m_text/m_tab) * d_text. Notably, it's not a weighted average of any sort, and the hybrid distance is larger than either input distance when text columns exist. Unclear how much of an issue this is downstream, but almost certainly not the intended behavior.

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions