NSS privacy scores do not compute average correctly when 3 or more text columns

### Priority Level

Medium (Annoying but has workaround)

### Describe the bug

Repeatedly calling np.average on pairs of values, in the style of fold, is not equivalent to calling np.average on the entire list at once.

Existing code in AIA and MIA calculation is not correct.

Example snippet that computes a weighted average:

```
            # TODO: Is this average what we want? When there are more than 2 columns, we will
            # overweight later columns relative to earlier columns.
            norm = embeddings[df.columns[0]][i]
            for j in range(1, len(df.columns)):
                field = df.columns[j]
                norm = np.average([norm, embeddings[field][i]], axis=0)
```

### Steps/Code to reproduce bug

Run dataset with 3 or more text columns

### Expected behavior

All text columns are weighted equally. Instead the order of columns determins weight, 1/2, 1/4, 1/8, etc.

### Additional context

Further improvements can also better utilize numpy as recommended by Aaron.

````
embeddings = {}
for col in df.columns:
    data = [str(r) for r in df[col].to_list()]
    embeddings[col] = embedder.encode(data, show_progress_bar=False, convert_to_numpy=True)

# Stack all column embeddings and compute mean across column
stacked = np.stack([embeddings[col] for col in df.columns], axis=0)  # shape: (n_cols, n_rows, embed_dim)
avg_embeddings = np.mean(stacked, axis=0)  # shape: (n_rows, embed_dim)

return pd.DataFrame({"embedding": list(avg_embeddings)})```
````

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NSS privacy scores do not compute average correctly when 3 or more text columns #203

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NSS privacy scores do not compute average correctly when 3 or more text columns #203

Description

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions