Priority Level
Medium
Describe the bug
From copilot review on #185, the distance weighting between text and tabular in attribute_inference_protection.py is most likely not what was intended. See code for "Now get the hybrid distance" at
The mixed text+tabular branch’s hybrid weighting below uses len(df_train_use.columns) as the denominator, but df_train_use has already been reduced to tabular-only columns earlier in this function. That makes tab_weight effectively 1 and text_weight > 0 (weights don’t sum to 1), skewing the hybrid distance and potentially changing AIA results. Consider computing weights from the original total column count (e.g., len(text_columns) + len(tabular_columns)) or another normalized scheme that remains valid after splitting.
Steps/Code to reproduce bug
Run AIA on a dataset with 1 or more text columns
Expected behavior
Assume we have m_tab tabular columns and m_text text columns and the tabular distance is d_tab and text distance is d_text. Without any comments to the contrary, the way the code is written suggests the weighting should be (m_tab/(m_tab+m_text)) * d_tab + (m_text/(m_tab+m_text)) * d_text. Ie basic weighted average.
But the actual behavior is d_tab + (m_text/m_tab) * d_text. Notably, it's not a weighted average of any sort, and the hybrid distance is larger than either input distance when text columns exist. Unclear how much of an issue this is downstream, but almost certainly not the intended behavior.
Additional context
No response
Priority Level
Medium
Describe the bug
From copilot review on #185, the distance weighting between text and tabular in attribute_inference_protection.py is most likely not what was intended. See code for "Now get the hybrid distance" at
Safe-Synthesizer/src/nemo_safe_synthesizer/evaluation/components/attribute_inference_protection.py
Line 362 in 5ecc0ad
Steps/Code to reproduce bug
Run AIA on a dataset with 1 or more text columns
Expected behavior
Assume we have
m_tabtabular columns andm_texttext columns and the tabular distance isd_taband text distance isd_text. Without any comments to the contrary, the way the code is written suggests the weighting should be(m_tab/(m_tab+m_text)) * d_tab + (m_text/(m_tab+m_text)) * d_text. Ie basic weighted average.But the actual behavior is
d_tab + (m_text/m_tab) * d_text. Notably, it's not a weighted average of any sort, and the hybrid distance is larger than either input distance when text columns exist. Unclear how much of an issue this is downstream, but almost certainly not the intended behavior.Additional context
No response