API Reference

This section contains the complete API reference for EVoC.

Main Classes and Functions

EVoC

Embedding Vector Oriented Clustering for efficient clustering of high-dimensional embedding vectors such as CLIP-vectors, sentence-transformers output, etc.

evoc_clusters

Cluster data using the EVoC algorithm.

Core Clustering

class evoc.EVoC(noise_level: float = 0.5, base_min_cluster_size: int = 5, base_n_clusters: int | None = None, approx_n_clusters: int | None = None, n_neighbors: int = 15, min_samples: int = 5, n_epochs: int = 50, node_embedding_init: str | None = 'label_prop', symmetrize_graph: bool = True, node_embedding_dim: int | None = None, neighbor_scale: float = 1.0, random_state: int | None = None, min_similarity_threshold: float = 0.2, max_layers: int = 10, n_label_prop_iter=20)[source]

Bases: BaseEstimator, ClusterMixin

Embedding Vector Oriented Clustering for efficient clustering of high-dimensional embedding vectors such as CLIP-vectors, sentence-transformers output, etc. The clustering uses a combination of a node embedding of a nearest neighbour graph, related to UMAP, and a density based clustering approach related to HDBSCAN, improving upon those approaches in efficiency and quality for the specific case of high-dimensional embedding vectors.

Parameters:
  • noise_level (float, default=0.5) – The noise level expected in the data. A value of 0.0 will try to cluster more data, at the expense of getting less accurate clustering. A value of 1.0 will try for accurate clusters, discarding more data as noise to do so.

  • base_min_cluster_size (int, default=5) – The minimum number of points in a cluster at the base layer of the clustering. This gives the finest granularity clustering that will be returned, with less graularity at higher layers.

  • base_n_clusters (int or None, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters for the bottom-most layer of clustering. This affects the base layer computation and allows multiple layers to be built on top of this base. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into base_n_clusters exists.

  • approx_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters as the final output. Unlike base_n_clusters, when this parameter is set, only a single clustering layer will be returned – no hierarchical layers will be produced. This is useful when you know the exact number of clusters you want and don’t need the multi-layer analysis. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into approx_n_clusters exists.

  • n_neighbors (int, default=15) – The number of neighbors to use in the nearest neighbor graph construction.

  • min_samples (int, default=5) – The minimum number of samples to use in the density estimation when performing density based clustering on the node embedding.

  • n_epochs (int, default=50) – The number of epochs to use when training the node embedding.

  • node_embedding_init (str or None, default='label_prop') – The method to use to initialize the node embedding. If None, no initialization will be used. If ‘label_prop’, the label propagation method will be used.

  • symmetrize_graph (bool, default=True) – Whether to symmetrize the nearest neighbor graph before using it to construct the node embedding.

  • node_embedding_dim (int or None, default=None) – The number of dimensions to use in the node embedding. If None, a default value of min(max(n_neighbors // 4, 4), 15) will be used.

  • neighbor_scale (float, default=1.0) – The scale factor to use when constructing the nearest neighbor graph. This multiplies the effective number of neighbors used in graph construction (neighbor_scale * n_neighbors). Values > 1.0 create denser graphs with more connectivity, potentially capturing more global structure but at increased computational cost. Values < 1.0 create sparser graphs focused on local structure.

  • random_state (int or None, default=None) – The random seed to use for the random number generator. If None, the random number generator will not be seeded and will use the system time as the seed.

  • min_similarity_threshold (float, default=0.2) – The minimum similarity threshold for cluster layer selection. Peaks that result in clusterings with Jaccard similarity above this threshold will be filtered out to ensure diverse cluster layers.

  • max_layers (int, default=10) – The maximum number of cluster layers to return. The algorithm will select up to this many diverse peaks based on persistence and similarity criteria.

  • n_label_prop_iter (int, default=20) – The number of iterations to use in the label propagation algorithm when initializing the node embedding. This parameter controls how many steps the label propagation process takes to converge when node_embedding_init is set to ‘label_prop’.

labels_

An array of labels for the data samples; this is a integer array as per other scikit-learn clustering algorithms. A value of -1 indicates that a point is a noise point and not in any cluster.

Type:

array-like of shape (n_samples,)

membership_strengths_

An array of membership strengths for the data samples; this gives a measure of how strongly each point belongs to each cluster. This is a floating point array with values between 0 and 1.

Type:

array-like of shape (n_samples,)

cluster_layers_

The clustering of the data at each layer of the clustering. Each layer is a clustering of the data into a different number of clusters; the earlier the cluster vector is in this list the finer the granularity of clustering.

Type:

list of array-like of shape (n_samples,)

membership_strength_layers_

The membership strengths of each point in the clustering at each layer.

Type:

list of array-like of shape (n_samples,)

cluster_tree_

A dictionary representing the hierarchical clustering of the data. The keys are tuples of (layer, cluster) and the values are lists of tuples of (layer, cluster) representing the children of the key cluster.

Type:

dict

nn_inds_

Indices of nearest neighbors for each sample.

Type:

array-like of shape (n_samples, n_neighbors)

nn_dists_

Distance from each sample to each nearest neighbor (indexed by nn_inds).

Type:

array-like of shape (n_samples, n_neighbors)

duplicates_

A set of pairs of indices of potential duplicate points in the data.

Type:

set of tuple of int

__init__(noise_level: float = 0.5, base_min_cluster_size: int = 5, base_n_clusters: int | None = None, approx_n_clusters: int | None = None, n_neighbors: int = 15, min_samples: int = 5, n_epochs: int = 50, node_embedding_init: str | None = 'label_prop', symmetrize_graph: bool = True, node_embedding_dim: int | None = None, neighbor_scale: float = 1.0, random_state: int | None = None, min_similarity_threshold: float = 0.2, max_layers: int = 10, n_label_prop_iter=20) None[source]
fit_predict(X, y=None, **fit_params)[source]

Fit the model to the data and return the clustering labels.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data to cluster. If the data is float valued then it is assumed to use cosine distance as a matric. If the data is int8 valued then it is assumed that a quantized embedding is being used and a quantized version of cosine distance is used. If the data is uint8 valued then it is assumed that a binary embedding is being used, and a bitwise Jaccard distance is used.

  • y (array-like of shape (n_samples,), default=None) – Ignored. This parameter exists only for compatibility with scikit-learn’s fit_predict method.

  • **fit_params (dict) – Additional fit parameters. Currently unused, included for compatibility with scikit-learn’s fit_predict interface.

Returns:

labels_ – An array of labels for the data samples; this is a integer array as per other scikit-learn clustering algorithms. A value of -1 indicates that a point is a noise point and not in any cluster.

Return type:

array-like of shape (n_samples,)

fit(X, y=None, **fit_params)[source]

Fit the model to the data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data to cluster. If the data is float valued then it is assumed to use cosine distance as a matric. If the data is int8 valued then it is assumed that a quantized embedding is being used and a quantized version of cosine distance is used. If the data is uint8 valued then it is assumed that a binary embedding is being used, and a bitwise Jaccard distance is used.

  • y (array-like of shape (n_samples,), default=None) – Ignored. This parameter exists only for compatibility with scikit-learn’s fit method.

  • **fit_params (dict) – Additional fit parameters. Currently unused, included for compatibility with scikit-learn’s fit interface.

Returns:

self – Returns the instance itself.

Return type:

sklearn Estimator

property cluster_tree_

dict A dictionary representing the hierarchical clustering of the data.

The keys are tuples of (layer, cluster) and the values are lists of tuples of (layer, cluster) representing the children of the key cluster. This provides a tree structure showing how clusters at different layers relate to each other hierarchically.

Only available after fitting the model.

Returns:

Hierarchical tree structure with (layer, cluster) tuples as keys and lists of child (layer, cluster) tuples as values.

Return type:

dict

Raises:

NotFittedError – If the model has not been fitted yet.

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1] to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

evoc.evoc_clusters(data, noise_level=0.5, base_min_cluster_size=5, base_n_clusters=None, approx_n_clusters=None, n_neighbors=15, min_samples=5, n_epochs=50, node_embedding_init='label_prop', symmetrize_graph=True, return_duplicates=False, node_embedding_dim=None, neighbor_scale=1.0, random_state=None, reproducible_flag=True, min_similarity_threshold=0.2, max_layers=10, n_label_prop_iter=20)[source]

Cluster data using the EVoC algorithm.

Parameters:
  • data (array-like of shape (n_samples, n_features)) – The data to cluster. If the data is float valued then it is assumed to use cosine distance as a matric. If the data is int8 valued then it is assumed that a quantized embedding is being used and a quantized version of cosine distance is used. If the data is uint8 valued then it is assumed that a binary embedding is being used, and a bitwise Jaccard distance is used.

  • noise_level (float, default=0.5) – The noise level expected in the data. A value of 0.0 will try to cluster more data, at the expense of getting less accurate clustering. A value of 1.0 will try for accurate clusters, discarding more data as noise to do so.

  • base_min_cluster_size (int, default=5) – The minimum number of points in a cluster at the base layer of the clustering. This gives the finest granularity clustering that will be returned, with less graularity at higher layers.

  • base_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters for the bottom-most layer of clustering. This affects the base layer computation and allows multiple layers to be built on top of this base. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into base_n_clusters exists.

  • approx_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters as the final output. Unlike base_n_clusters, when this parameter is set, only a single clustering layer will be returned – no hierarchical layers will be produced. This is useful when you know the exact number of clusters you want and don’t need the multi-layer analysis. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into approx_n_clusters exists.

  • n_neighbors (int, default=15) – The number of neighbors to use in the nearest neighbor graph construction.

  • min_samples (int, default=5) – The minimum number of samples to use in the density estimation when performing density based clustering on the node embedding.

  • n_epochs (int, default=50) – The number of epochs to use when training the node embedding.

  • node_embedding_init (str or None, default='label_prop') – The method to use to initialize the node embedding. If None, no initialization will be used. If ‘label_prop’, the label propagation method will be used.

  • symmetrize_graph (bool, default=True) – Whether to symmetrize the nearest neighbor graph before using it to construct the node embedding.

  • return_duplicates (bool, default=False) – Whether to return a set of duplicate pairs of points in the data.

  • node_embedding_dim (int or None, default=None) – The number of dimensions to use in the node embedding. If None, a default value of min(max(n_neighbors // 4, 4), 15) will be used.

  • neighbor_scale (float, default=1.0) – The scale factor to use when constructing the nearest neighbor graph. This multiplies the effective number of neighbors used in graph construction (neighbor_scale * n_neighbors). Values > 1.0 create denser graphs with more connectivity, potentially capturing more global structure but at increased computational cost. Values < 1.0 create sparser graphs focused on local structure.

  • random_state (np.random.RandomState or None, default=None) – The random state to use for the random number generator. If None, the random number generator will not be seeded and will use the system time as the seed.

  • reproducible_flag (bool, default=True) – Whether to ensure reproducible results by using deterministic algorithms where possible. When True, the clustering results should be consistent across runs with the same random_state.

  • min_similarity_threshold (float, default=0.2) – The minimum similarity threshold for cluster layer selection. Peaks that result in clusterings with Jaccard similarity above this threshold will be filtered out to ensure diverse cluster layers.

  • max_layers (int, default=10) – The maximum number of cluster layers to return. The algorithm will select up to this many diverse peaks based on persistence and similarity criteria.

  • n_label_prop_iter (int, default=20) – The number of iterations to use in the label propagation algorithm when initializing the node embedding.

Returns:

  • cluster_layers (list of array-like of shape (n_samples,)) – The clustering of the data at each layer of the clustering. Each layer is a clustering of the data into a different number of clusters.

  • membership_strengths (list of array-like of shape (n_samples,)) – The membership strengths of each point in the clustering at each layer. This gives a measure of how strongly each point belongs to each cluster.

  • nn_inds (array-like of shape (n_samples, n_neighbors)) – Indices of nearest neighbors for each sample.

  • nn_dists (array-like of shape (n_samples, n_neighbors)) – Distance from each sample to each nearest neighbor indexed by nn_inds

  • duplicates (set of tuple of int) – Only returned in return_duplicates is True. A set of pairs of indices of potential duplicate points in the data.

Utility Functions

find_peaks

binary_search_for_n_clusters

select_diverse_peaks

build_cluster_tree

find_duplicates

evoc.clustering_utilities.find_peaks(x)[source]
evoc.clustering_utilities.binary_search_for_n_clusters(data, approx_n_clusters, n_threads, *, min_samples=5)[source]
evoc.clustering_utilities.select_diverse_peaks(peaks, total_persistence, sizes, births, deaths, min_similarity_threshold=0.2, max_layers=10)[source]
evoc.clustering_utilities.build_cluster_tree(labels)[source]
evoc.clustering_utilities.find_duplicates(knn_inds, knn_dists)[source]

Tree Operations

mst_to_linkage_tree

condense_tree

extract_leaves

get_cluster_label_vector

get_point_membership_strength_vector

evoc.cluster_trees.mst_to_linkage_tree(sorted_mst)[source]
evoc.cluster_trees.condense_tree(hierarchy, min_cluster_size=10)[source]
evoc.cluster_trees.extract_leaves(condensed_tree, allow_single_cluster=True)[source]
evoc.cluster_trees.get_cluster_label_vector(tree, clusters, cluster_selection_epsilon, n_samples)[source]
evoc.cluster_trees.get_point_membership_strength_vector(tree, clusters, labels)[source]

Graph Construction

knn_graph

Construct a k-nearest neighbor graph using the NN-Descent algorithm.

evoc.knn_graph.knn_graph(data, n_neighbors=30, n_trees=None, leaf_size=None, random_state=None, max_candidates=None, max_rptree_depth=200, n_iters=None, delta=0.001, delta_improv=0.001, n_jobs=None, verbose=False, use_sorted_updates=True)[source]

Construct a k-nearest neighbor graph using the NN-Descent algorithm.

This function builds a k-nearest neighbor graph using random projection trees for initialization followed by the NN-Descent algorithm for refinement. It supports multiple data types (float32 for normalized embeddings, int8 for quantized embeddings, uint8 for binary embeddings) with appropriate distance metrics for each.

Parameters:
  • data (array-like of shape (n_samples, n_features)) – The data for which to compute nearest neighbors. If float32, cosine distance is used. If int8, quantized cosine distance is used. If uint8, Jaccard distance (based on Hamming distance for binary embeddings) is used.

  • n_neighbors (int, default=30) – The number of nearest neighbors to compute for each sample.

  • n_trees (int or None, default=None) – The number of random projection trees to build. If None, defaults to between 4 and 8 depending on the number of available threads.

  • leaf_size (int or None, default=None) – The maximum number of points per leaf in the random projection trees. If None, defaults to max(10, n_neighbors).

  • random_state (int, RandomState instance or None, default=None) – Controls the randomness of the algorithm. Pass an int for reproducible output across multiple function calls.

  • max_candidates (int or None, default=None) – The maximum number of candidate neighbors to evaluate during NN-Descent. If None, defaults to min(60, int(n_neighbors * 1.5)).

  • max_rptree_depth (int, default=200) – Maximum depth of the random projection trees.

  • n_iters (int or None, default=None) – Number of iterations for the NN-Descent algorithm. If None, defaults to max(5, int(round(log2(n_samples)))).

  • delta (float, default=0.001) – Convergence threshold for the NN-Descent algorithm.

  • delta_improv (float, default=0.001) – Improvement threshold for early stopping in NN-Descent.

  • n_jobs (int or None, default=None) – The number of threads to use. If -1, uses all available threads. If None, preserves the current numba thread setting.

  • verbose (bool, default=False) – If True, print progress messages during computation.

  • use_sorted_updates (bool, default=True) – If True, uses a more efficient sorted update strategy in NN-Descent.

Returns:

neighbor_graph – A tuple containing: - indices : array-like of shape (n_samples, n_neighbors)

The indices of the k-nearest neighbors for each sample.

  • distancesarray-like of shape (n_samples, n_neighbors)

    The distances from each sample to its k-nearest neighbors. Distances are transformed to a uniform scale based on the input dtype.

Return type:

tuple of (array, array)

neighbor_graph_matrix

Construct a sparse graph from k-nearest neighbor distances.

evoc.graph_construction.neighbor_graph_matrix(n_neighbors, knn_indices, knn_dists, symmetrize=True)[source]

Construct a sparse graph from k-nearest neighbor distances.

Converts k-nearest neighbor indices and distances into a weighted sparse graph matrix using Gaussian kernel weights. Optionally symmetrizes the graph to create an undirected graph.

Parameters:
  • n_neighbors (float) – The effective number of neighbors. Used in the kernel width (sigma) computation via the smooth_knn_dist function.

  • knn_indices (array-like of shape (n_samples, k)) – The indices of the k-nearest neighbors for each sample.

  • knn_dists (array-like of shape (n_samples, k)) – The distances from each sample to its k-nearest neighbors.

  • symmetrize (bool, default=True) – If True, the graph is symmetrized using the formula: A_sym = A + A^T - A * A^T (union of forward and reverse edges). If False, the graph remains directed (asymmetric).

Returns:

graph – A sparse matrix representing the weighted nearest neighbor graph. The (i, j) entry contains the Gaussian kernel weight from sample i to sample j, or 0 if j is not in the k-nearest neighbors of i. If symmetrize=True, the matrix is symmetric and in CSR format. If symmetrize=False, returns a CSR matrix (asymmetric).

Return type:

scipy.sparse._csr_matrix or scipy.sparse._coo_matrix

Node Embedding

node_embedding

Learn a low-dimensional embedding of a graph using a UMAP-like algorithm.

evoc.node_embedding.node_embedding(graph, n_components, n_epochs, initial_embedding=None, initial_alpha=0.5, negative_sample_rate=1.0, noise_level=0.5, random_state=None, reproducible_flag=True, verbose=False, tqdm_kwds={})[source]

Learn a low-dimensional embedding of a graph using a UMAP-like algorithm.

This function performs stochastic gradient descent optimization to learn a low-dimensional embedding of graph structure. It uses both positive (connected edges) and negative (random) samples to guide the optimization.

Parameters:
  • graph (scipy.sparse matrix, typically csr_matrix or csc_matrix) – A sparse adjacency matrix representing the graph. The weights in the matrix represent connection strengths between nodes.

  • n_components (int) – The number of dimensions in the output embedding.

  • n_epochs (int) – The number of epochs to train the embedding.

  • initial_embedding (array-like of shape (n_vertices, n_components) or None, default=None) – An initial embedding to use as a starting point. If None, a random embedding is generated from a normal distribution with scale 0.25.

  • initial_alpha (float, default=0.5) – The initial learning rate. The learning rate decays linearly over epochs.

  • negative_sample_rate (float, default=1.0) – The rate at which negative samples are drawn relative to positive samples. Controls the ratio of negative to positive updates per epoch.

  • noise_level (float, default=0.5) – Controls the strength of noise in the gradient computation. Higher values increase the tolerance for larger distances before penalizing in the embedding space.

  • random_state (RandomState instance or None, default=None) – Random state for reproducibility. If None, uses system randomness.

  • reproducible_flag (bool, default=True) – If True, uses a deterministic (but slower) update strategy that processes nodes in blocks for reproducibility. If False, uses a faster stochastic approach.

  • verbose (bool, default=False) – If True, display a progress bar during training.

  • tqdm_kwds (dict, default={}) – Additional keyword arguments to pass to tqdm for progress bar customization.

Returns:

embedding – The learned low-dimensional embedding of the graph vertices.

Return type:

array-like of shape (n_vertices, n_components)

label_propagation_init

Initialize a node embedding using label propagation on a sparse graph.

evoc.label_propagation.label_propagation_init(graph, n_label_prop_iter=20, n_embedding_epochs=50, approx_n_parts=512, n_components=2, scaling=0.1, random_scale=1.0, noise_level=0.5, random_state=None, data=None, recursive_init=True, base_init='pca', base_init_threshold=64, upscaling='partition_expander')[source]

Initialize a node embedding using label propagation on a sparse graph.

This function provides a high-quality initialization for node embeddings by combining graph-based label propagation with hierarchical partitioning. For large graphs, it recursively partitions the data and upscales the results. For small graphs, it uses direct methods (PCA, spectral embedding, or random).

Parameters:
  • graph (scipy.sparse matrix) – A sparse adjacency or weighted graph matrix representing connectivity.

  • n_label_prop_iter (int, default=20) – Number of label propagation iterations to perform on the graph.

  • n_embedding_epochs (int, default=50) – Number of epochs when using node embedding for upscaling.

  • approx_n_parts (int, default=512) – Approximate number of partitions to create for recursive partitioning of large graphs. Useful for controlling memory and computation.

  • n_components (int, default=2) – The number of dimensions in the output embedding.

  • scaling (float, default=0.1) – Scaling factor applied to label propagation distances.

  • random_scale (float, default=1.0) – Scaling factor for random noise in the initialization.

  • noise_level (float, default=0.5) – The noise level parameter passed to node embedding algorithms.

  • random_state (RandomState instance or None, default=None) – Controls the randomness of the algorithm. If None, uses system randomness.

  • data (array-like of shape (n_samples, n_features) or None, default=None) – The original data array. Required if base_init=’pca’. Used for direct initialization methods on small graphs.

  • recursive_init (bool, default=True) – If True, uses recursive partitioning for large graphs. If False, applies the base initialization method directly.

  • base_init ({'pca', 'random', 'spectral', 'mds'}, default='pca') – The initialization method to use for small graphs (when graph size is below base_init_threshold). ‘pca’ requires the data parameter.

  • base_init_threshold (int, default=64) – The size threshold below which the base_init method is used directly. Graphs larger than this use recursive partitioning.

  • upscaling ({'partition_expander', 'node_embedding'}, default='partition_expander') – The method to use when upscaling partitions back to the full graph. ‘partition_expander’ uses a fast expansion method, ‘node_embedding’ uses full node embedding (slower but potentially better quality).

Returns:

embedding – The initialized node embedding based on label propagation and graph structure.

Return type:

array-like of shape (n_vertices, n_components)

Algorithm Components

parallel_boruvka

evoc.boruvka.parallel_boruvka(tree, n_threads, min_samples=10, reproducible=False)[source]

build_kdtree

evoc.numba_kdtree.build_kdtree(data, leaf_size=40)[source]