Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validity index computation overflows in all-points-core-distance calculation #665

Open
bapierro opened this issue Dec 19, 2024 · 0 comments

Comments

@bapierro
Copy link

Hi, I encountered numerical overflow issues with the current implementation of the all_points_core_distance function, specifically when raising 1 / d_ij to a large power d. For large d values or very small distances, this can easily lead to extremely large numbers and subsequent overflow. In #197 a comment already mentioned this problem.

I used logs to keep the values at a manageable scale and then exponentiate the final result. Here’s the original code snippet:

def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
     #"""
    distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
        distance_matrix != 0]) ** d
    result = distance_matrix.sum(axis=1)
    result /= distance_matrix.shape[0] - 1

    if result.sum() == 0:
        result = np.zeros(len(distance_matrix))
    else:
        result **= (-1.0 / d)

    return result

Below is a modified version that uses logarithms to avoid numerical overflow. I tested this on several examples and it produces results identical to the original method:

import numpy as np
from scipy.special import logsumexp

def all_points_core_distance(distance_matrix, d=2.0):
    """
    Compute the all-points-core-distance for all the points of a cluster.

    Parameters
    ----------
    distance_matrix : array (cluster_size, cluster_size)
        The pairwise distance matrix between points in the cluster.

    d : integer
        The dimension of the data set, which is used in the computation
        of the all-point-core-distance as per the paper.

    Returns
    -------
    core_distances : array (cluster_size,)
        The all-points-core-distance of each point in the cluster

    References
    ----------
    Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J.,
    2014. Density-Based Clustering Validation. In SDM (pp. 839-847).
    """
    N = distance_matrix.shape[0]
    dists = distance_matrix.copy()
    dists[dists == 0] = np.nan
    s_ij = -d * np.log(dists)
    np.fill_diagonal(s_ij, -np.inf)

    log_S_i = logsumexp(s_ij, axis=1)
    log_m_i = log_S_i - np.log(N - 1)
    log_apcd_i = - (1.0 / d) * log_m_i
    apcd_i = np.exp(log_apcd_i)

    apcd_i[np.isinf(apcd_i)] = 0
    apcd_i[np.isnan(apcd_i)] = 0
    return apcd_i
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant