Skip to content

NSS privacy scores do not compute average correctly when 3 or more text columns #203

@kendrickb-nvidia

Description

@kendrickb-nvidia

Priority Level

Medium (Annoying but has workaround)

Describe the bug

Repeatedly calling np.average on pairs of values, in the style of fold, is not equivalent to calling np.average on the entire list at once.

Existing code in AIA and MIA calculation is not correct.

Example snippet that computes a weighted average:

            # TODO: Is this average what we want? When there are more than 2 columns, we will
            # overweight later columns relative to earlier columns.
            norm = embeddings[df.columns[0]][i]
            for j in range(1, len(df.columns)):
                field = df.columns[j]
                norm = np.average([norm, embeddings[field][i]], axis=0)

Steps/Code to reproduce bug

Run dataset with 3 or more text columns

Expected behavior

All text columns are weighted equally. Instead the order of columns determins weight, 1/2, 1/4, 1/8, etc.

Additional context

Further improvements can also better utilize numpy as recommended by Aaron.

embeddings = {}
for col in df.columns:
    data = [str(r) for r in df[col].to_list()]
    embeddings[col] = embedder.encode(data, show_progress_bar=False, convert_to_numpy=True)

# Stack all column embeddings and compute mean across column
stacked = np.stack([embeddings[col] for col in df.columns], axis=0)  # shape: (n_cols, n_rows, embed_dim)
avg_embeddings = np.mean(stacked, axis=0)  # shape: (n_rows, embed_dim)

return pd.DataFrame({"embedding": list(avg_embeddings)})```

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions