You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I encountered numerical overflow issues with the current implementation of the all_points_core_distance function, specifically when raising 1 / d_ij to a large power d. For large d values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a comment already mentioned this problem.
I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:
defall_points_core_distance(distance_matrix, d=2.0):
""" Compute the all-points-core-distance for all the points of a cluster. Parameters ---------- distance_matrix : array (cluster_size, cluster_size) The pairwise distance matrix between points in the cluster. d : integer The dimension of the data set, which is used in the computation of the all-point-core-distance as per the paper. Returns ------- core_distances : array (cluster_size,) The all-points-core-distance of each point in the cluster References ---------- Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014. Density-Based Clustering Validation. In SDM (pp. 839-847). #"""distance_matrix[distance_matrix!=0] = (1.0/distance_matrix[
distance_matrix!=0]) **dresult=distance_matrix.sum(axis=1)
result/=distance_matrix.shape[0] -1ifresult.sum() ==0:
result=np.zeros(len(distance_matrix))
else:
result**= (-1.0/d)
returnresult
Below is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:
importnumpyasnpfromscipy.specialimportlogsumexpdefall_points_core_distance(distance_matrix, d=2.0):
""" Compute the all-points-core-distance for all the points of a cluster. Parameters ---------- distance_matrix : array (cluster_size, cluster_size) The pairwise distance matrix between points in the cluster. d : integer The dimension of the data set, which is used in the computation of the all-point-core-distance as per the paper. Returns ------- core_distances : array (cluster_size,) The all-points-core-distance of each point in the cluster References ---------- Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014. Density-Based Clustering Validation. In SDM (pp. 839-847). """N=distance_matrix.shape[0]
dists=distance_matrix.copy()
dists[dists==0] =np.nans_ij=-d*np.log(dists)
np.fill_diagonal(s_ij, -np.inf)
log_S_i=logsumexp(s_ij, axis=1)
log_m_i=log_S_i-np.log(N-1)
log_apcd_i=- (1.0/d) *log_m_iapcd_i=np.exp(log_apcd_i)
apcd_i[np.isinf(apcd_i)] =0apcd_i[np.isnan(apcd_i)] =0returnapcd_i
The text was updated successfully, but these errors were encountered:
Hi, I encountered numerical overflow issues with the current implementation of the
all_points_core_distance
function, specifically when raising1 / d_ij
to a large powerd
. For larged
values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a comment already mentioned this problem.I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:
Below is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:
The text was updated successfully, but these errors were encountered: