Validity index computation overflows in all-points-core-distance calculation #665

bapierro · 2024-12-19T21:00:44Z

Hi, I encountered numerical overflow issues with the current implementation of the all_points_core_distance function, specifically when raising 1 / d_ij to a large power d. For large d values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a comment already mentioned this problem.

I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:

def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
     #"""
    distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
        distance_matrix != 0]) ** d
    result = distance_matrix.sum(axis=1)
    result /= distance_matrix.shape[0] - 1

    if result.sum() == 0:
        result = np.zeros(len(distance_matrix))
    else:
        result **= (-1.0 / d)

    return result

Below is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:

import numpy as np
from scipy.special import logsumexp

def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
    """
    N = distance_matrix.shape[0]
    dists = distance_matrix.copy()
    dists[dists == 0] = np.nan
    s_ij = -d * np.log(dists)
    np.fill_diagonal(s_ij, -np.inf)

    log_S_i = logsumexp(s_ij, axis=1)
    log_m_i = log_S_i - np.log(N - 1)
    log_apcd_i = - (1.0 / d) * log_m_i
    apcd_i = np.exp(log_apcd_i)

    apcd_i[np.isinf(apcd_i)] = 0
    apcd_i[np.isnan(apcd_i)] = 0
    return apcd_i

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validity index computation overflows in all-points-core-distance calculation #665

Validity index computation overflows in all-points-core-distance calculation #665

bapierro commented Dec 19, 2024

Validity index computation overflows in all-points-core-distance calculation #665

Validity index computation overflows in all-points-core-distance calculation #665

Comments

bapierro commented Dec 19, 2024