Find optimal k for k-means clustering

## Summary

Find optimal $k$ for post diversity clusters, to enable wide range of topics without pointlessly high $k$ or too similar clusters

## Metrics

### Inertia (Within-Cluster Sum of Squares)

- **Definition**: This is the sum of the squared distances between each data point and the centroid of its cluster. Mathematically, for each cluster $C_i$ with centroid $m_i$, the inertia $W$ is \[W = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2\].
  
- **Interpretation**: A lower inertia means that data points are closer to the centroids of their respective clusters, which generally implies better clustering. However, inertia is sensitive to the number of clusters ($k$)—the more clusters you have, the lower the inertia will be, which can be misleading.

### Silhouette Score

- **Definition**: For each sample $x$, compute $a$ as the average distance from $x$ to all other points in the same cluster, and $b$ as the smallest average distance from $x$ to all points in the different clusters, minimized over clusters. The silhouette score $s$ for that single data sample is defined as \[(b - a) / \max(a, b)\].
  
- **Interpretation**: The silhouette score ranges from -1 to 1. If the score is near 1, the sample is well clustered. If the score is 0, it's on or very close to the decision boundary between two clusters. A score near -1 indicates that the point might be assigned to the wrong cluster.

### Davies-Bouldin Index

- **Definition**: For each cluster $i$, its similarity $S(i, j)$ to another cluster $j$ is the ratio of the sum of their average distances to the cluster centers to the distance between the cluster centers. The Davies-Bouldin Index $DB$ is the average of the worst-case $S(i, j)$ values over all clusters $i$, defined as \[DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} S(i, j)\].
  
- **Interpretation**: A lower $DB$ index signifies better clustering. It measures how well-separated the clusters are in relation to the dispersion within each cluster.

### Gap Statistics

- **Definition**: For each $k$, you run the $k$-means algorithm and calculate the inertia. Then, you generate $B$ random datasets and cluster those using $k$-means as well, calculating the average random inertia. The Gap statistic is then $Gap(k) = \log(E^*W_k) - \log(W_k)$, where $E^*W_k$ is the expected inertia for random data.
  
- **Interpretation**: A higher Gap statistic indicates that your clustering deviates favorably from a random clustering, providing some confidence that the clustering structure is meaningful.

These metrics, when used collectively, can provide a more comprehensive evaluation of your clustering strategy. You can integrate them into your pipeline and observe how they change with different chunking strategies, which should provide you with more empirical grounding for your choices.

-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find optimal k for k-means clustering #12

Summary

Metrics

Inertia (Within-Cluster Sum of Squares)

Silhouette Score

Davies-Bouldin Index

Gap Statistics

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Find optimal k for k-means clustering #12

Description

Summary

Metrics

Inertia (Within-Cluster Sum of Squares)

Silhouette Score

Davies-Bouldin Index

Gap Statistics

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions