Priority Level
Medium (Annoying but has workaround)
Describe the bug
Repeatedly calling np.average on pairs of values, in the style of fold, is not equivalent to calling np.average on the entire list at once.
Existing code in AIA and MIA calculation is not correct.
Example snippet that computes a weighted average:
# TODO: Is this average what we want? When there are more than 2 columns, we will
# overweight later columns relative to earlier columns.
norm = embeddings[df.columns[0]][i]
for j in range(1, len(df.columns)):
field = df.columns[j]
norm = np.average([norm, embeddings[field][i]], axis=0)
Steps/Code to reproduce bug
Run dataset with 3 or more text columns
Expected behavior
All text columns are weighted equally. Instead the order of columns determins weight, 1/2, 1/4, 1/8, etc.
Additional context
Further improvements can also better utilize numpy as recommended by Aaron.
embeddings = {}
for col in df.columns:
data = [str(r) for r in df[col].to_list()]
embeddings[col] = embedder.encode(data, show_progress_bar=False, convert_to_numpy=True)
# Stack all column embeddings and compute mean across column
stacked = np.stack([embeddings[col] for col in df.columns], axis=0) # shape: (n_cols, n_rows, embed_dim)
avg_embeddings = np.mean(stacked, axis=0) # shape: (n_rows, embed_dim)
return pd.DataFrame({"embedding": list(avg_embeddings)})```
Priority Level
Medium (Annoying but has workaround)
Describe the bug
Repeatedly calling np.average on pairs of values, in the style of fold, is not equivalent to calling np.average on the entire list at once.
Existing code in AIA and MIA calculation is not correct.
Example snippet that computes a weighted average:
Steps/Code to reproduce bug
Run dataset with 3 or more text columns
Expected behavior
All text columns are weighted equally. Instead the order of columns determins weight, 1/2, 1/4, 1/8, etc.
Additional context
Further improvements can also better utilize numpy as recommended by Aaron.