evoc.evoc_clusters

evoc.evoc_clusters(data, noise_level=0.5, base_min_cluster_size=5, base_n_clusters=None, approx_n_clusters=None, n_neighbors=15, min_samples=5, n_epochs=50, node_embedding_init='label_prop', symmetrize_graph=True, return_duplicates=False, node_embedding_dim=None, neighbor_scale=1.0, random_state=None, reproducible_flag=True, min_similarity_threshold=0.2, max_layers=10, n_label_prop_iter=20)[source]

Cluster data using the EVoC algorithm.

Parameters:
  • data (array-like of shape (n_samples, n_features)) – The data to cluster. If the data is float valued then it is assumed to use cosine distance as a matric. If the data is int8 valued then it is assumed that a quantized embedding is being used and a quantized version of cosine distance is used. If the data is uint8 valued then it is assumed that a binary embedding is being used, and a bitwise Jaccard distance is used.

  • noise_level (float, default=0.5) – The noise level expected in the data. A value of 0.0 will try to cluster more data, at the expense of getting less accurate clustering. A value of 1.0 will try for accurate clusters, discarding more data as noise to do so.

  • base_min_cluster_size (int, default=5) – The minimum number of points in a cluster at the base layer of the clustering. This gives the finest granularity clustering that will be returned, with less graularity at higher layers.

  • base_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters for the bottom-most layer of clustering. This affects the base layer computation and allows multiple layers to be built on top of this base. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into base_n_clusters exists.

  • approx_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters as the final output. Unlike base_n_clusters, when this parameter is set, only a single clustering layer will be returned – no hierarchical layers will be produced. This is useful when you know the exact number of clusters you want and don’t need the multi-layer analysis. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into approx_n_clusters exists.

  • n_neighbors (int, default=15) – The number of neighbors to use in the nearest neighbor graph construction.

  • min_samples (int, default=5) – The minimum number of samples to use in the density estimation when performing density based clustering on the node embedding.

  • n_epochs (int, default=50) – The number of epochs to use when training the node embedding.

  • node_embedding_init (str or None, default='label_prop') – The method to use to initialize the node embedding. If None, no initialization will be used. If ‘label_prop’, the label propagation method will be used.

  • symmetrize_graph (bool, default=True) – Whether to symmetrize the nearest neighbor graph before using it to construct the node embedding.

  • return_duplicates (bool, default=False) – Whether to return a set of duplicate pairs of points in the data.

  • node_embedding_dim (int or None, default=None) – The number of dimensions to use in the node embedding. If None, a default value of min(max(n_neighbors // 4, 4), 15) will be used.

  • neighbor_scale (float, default=1.0) – The scale factor to use when constructing the nearest neighbor graph. This multiplies the effective number of neighbors used in graph construction (neighbor_scale * n_neighbors). Values > 1.0 create denser graphs with more connectivity, potentially capturing more global structure but at increased computational cost. Values < 1.0 create sparser graphs focused on local structure.

  • random_state (np.random.RandomState or None, default=None) – The random state to use for the random number generator. If None, the random number generator will not be seeded and will use the system time as the seed.

  • reproducible_flag (bool, default=True) – Whether to ensure reproducible results by using deterministic algorithms where possible. When True, the clustering results should be consistent across runs with the same random_state.

  • min_similarity_threshold (float, default=0.2) – The minimum similarity threshold for cluster layer selection. Peaks that result in clusterings with Jaccard similarity above this threshold will be filtered out to ensure diverse cluster layers.

  • max_layers (int, default=10) – The maximum number of cluster layers to return. The algorithm will select up to this many diverse peaks based on persistence and similarity criteria.

  • n_label_prop_iter (int, default=20) – The number of iterations to use in the label propagation algorithm when initializing the node embedding.

Returns:

  • cluster_layers (list of array-like of shape (n_samples,)) – The clustering of the data at each layer of the clustering. Each layer is a clustering of the data into a different number of clusters.

  • membership_strengths (list of array-like of shape (n_samples,)) – The membership strengths of each point in the clustering at each layer. This gives a measure of how strongly each point belongs to each cluster.

  • nn_inds (array-like of shape (n_samples, n_neighbors)) – Indices of nearest neighbors for each sample.

  • nn_dists (array-like of shape (n_samples, n_neighbors)) – Distance from each sample to each nearest neighbor indexed by nn_inds

  • duplicates (set of tuple of int) – Only returned in return_duplicates is True. A set of pairs of indices of potential duplicate points in the data.