evoc.clustering

evoc.clustering.build_cluster_layers(data, *, min_samples=5, base_min_cluster_size=10, base_n_clusters=None, reproducible_flag=False, min_similarity_threshold=0.2, max_layers=10)[source]

Build hierarchical cluster layers from embedding data.

Parameters:

data (array-like of shape (n_samples, n_features)) – The embedding data to cluster. Typically the output of a node embedding algorithm.
min_samples (int, default=5) – The minimum number of samples to use in the density estimation when performing density based clustering.
base_min_cluster_size (int, default=10) – The minimum number of points in a cluster at the base layer of the clustering. This gives the finest granularity clustering that will be returned.
base_n_clusters (int or None, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters for the bottom-most layer of clustering. This affects the base layer computation and allows multiple layers to be built on top of this base.
reproducible_flag (bool, default=False) – Whether to ensure reproducible results by using deterministic algorithms where possible.
min_similarity_threshold (float, default=0.2) – The minimum similarity threshold for cluster layer selection. Peaks that result in clusterings with Jaccard similarity above this threshold will be filtered out to ensure diverse cluster layers.
max_layers (int, default=10) – The maximum number of cluster layers to return. The algorithm will select up to this many diverse peaks based on persistence and similarity criteria.

Returns:

cluster_layers (list of array-like of shape (n_samples,)) – The clustering of the data at each layer of the clustering. Each layer is a clustering of the data into a different number of clusters.
membership_strength_layers (list of array-like of shape (n_samples,)) – The membership strengths of each point in the clustering at each layer. This gives a measure of how strongly each point belongs to each cluster.
persistence_scores (list of float) – The persistence scores for each cluster layer, indicating the quality or stability of the clustering at that layer.

evoc.clustering.evoc_clusters(data, noise_level=0.5, base_min_cluster_size=5, base_n_clusters=None, approx_n_clusters=None, n_neighbors=15, min_samples=5, n_epochs=50, node_embedding_init='label_prop', symmetrize_graph=True, return_duplicates=False, node_embedding_dim=None, neighbor_scale=1.0, random_state=None, reproducible_flag=True, min_similarity_threshold=0.2, max_layers=10, n_label_prop_iter=20)[source]

Cluster data using the EVoC algorithm.

Parameters:

data (array-like of shape (n_samples, n_features)) – The data to cluster. If the data is float valued then it is assumed to use cosine distance as a matric. If the data is int8 valued then it is assumed that a quantized embedding is being used and a quantized version of cosine distance is used. If the data is uint8 valued then it is assumed that a binary embedding is being used, and a bitwise Jaccard distance is used.
noise_level (float, default=0.5) – The noise level expected in the data. A value of 0.0 will try to cluster more data, at the expense of getting less accurate clustering. A value of 1.0 will try for accurate clusters, discarding more data as noise to do so.
base_min_cluster_size (int, default=5) – The minimum number of points in a cluster at the base layer of the clustering. This gives the finest granularity clustering that will be returned, with less graularity at higher layers.
base_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters for the bottom-most layer of clustering. This affects the base layer computation and allows multiple layers to be built on top of this base. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into base_n_clusters exists.
approx_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters as the final output. Unlike base_n_clusters, when this parameter is set, only a single clustering layer will be returned – no hierarchical layers will be produced. This is useful when you know the exact number of clusters you want and don’t need the multi-layer analysis. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into approx_n_clusters exists.
n_neighbors (int, default=15) – The number of neighbors to use in the nearest neighbor graph construction.
min_samples (int, default=5) – The minimum number of samples to use in the density estimation when performing density based clustering on the node embedding.
n_epochs (int, default=50) – The number of epochs to use when training the node embedding.
node_embedding_init (str or None, default='label_prop') – The method to use to initialize the node embedding. If None, no initialization will be used. If ‘label_prop’, the label propagation method will be used.
symmetrize_graph (bool, default=True) – Whether to symmetrize the nearest neighbor graph before using it to construct the node embedding.
return_duplicates (bool, default=False) – Whether to return a set of duplicate pairs of points in the data.
node_embedding_dim (int or None, default=None) – The number of dimensions to use in the node embedding. If None, a default value of min(max(n_neighbors // 4, 4), 15) will be used.
neighbor_scale (float, default=1.0) – The scale factor to use when constructing the nearest neighbor graph. This multiplies the effective number of neighbors used in graph construction (neighbor_scale * n_neighbors). Values > 1.0 create denser graphs with more connectivity, potentially capturing more global structure but at increased computational cost. Values < 1.0 create sparser graphs focused on local structure.
random_state (np.random.RandomState or None, default=None) – The random state to use for the random number generator. If None, the random number generator will not be seeded and will use the system time as the seed.
reproducible_flag (bool, default=True) – Whether to ensure reproducible results by using deterministic algorithms where possible. When True, the clustering results should be consistent across runs with the same random_state.
min_similarity_threshold (float, default=0.2) – The minimum similarity threshold for cluster layer selection. Peaks that result in clusterings with Jaccard similarity above this threshold will be filtered out to ensure diverse cluster layers.
max_layers (int, default=10) – The maximum number of cluster layers to return. The algorithm will select up to this many diverse peaks based on persistence and similarity criteria.
n_label_prop_iter (int, default=20) – The number of iterations to use in the label propagation algorithm when initializing the node embedding.

Returns:

cluster_layers (list of array-like of shape (n_samples,)) – The clustering of the data at each layer of the clustering. Each layer is a clustering of the data into a different number of clusters.
membership_strengths (list of array-like of shape (n_samples,)) – The membership strengths of each point in the clustering at each layer. This gives a measure of how strongly each point belongs to each cluster.
nn_inds (array-like of shape (n_samples, n_neighbors)) – Indices of nearest neighbors for each sample.
nn_dists (array-like of shape (n_samples, n_neighbors)) – Distance from each sample to each nearest neighbor indexed by nn_inds
duplicates (set of tuple of int) – Only returned in return_duplicates is True. A set of pairs of indices of potential duplicate points in the data.

class evoc.clustering.EVoC(noise_level: float = 0.5, base_min_cluster_size: int = 5, base_n_clusters: int | None = None, approx_n_clusters: int | None = None, n_neighbors: int = 15, min_samples: int = 5, n_epochs: int = 50, node_embedding_init: str | None = 'label_prop', symmetrize_graph: bool = True, node_embedding_dim: int | None = None, neighbor_scale: float = 1.0, random_state: int | None = None, min_similarity_threshold: float = 0.2, max_layers: int = 10, n_label_prop_iter=20)[source]

Bases: BaseEstimator, ClusterMixin

Embedding Vector Oriented Clustering for efficient clustering of high-dimensional embedding vectors such as CLIP-vectors, sentence-transformers output, etc. The clustering uses a combination of a node embedding of a nearest neighbour graph, related to UMAP, and a density based clustering approach related to HDBSCAN, improving upon those approaches in efficiency and quality for the specific case of high-dimensional embedding vectors.

Parameters:

noise_level (float, default=0.5) – The noise level expected in the data. A value of 0.0 will try to cluster more data, at the expense of getting less accurate clustering. A value of 1.0 will try for accurate clusters, discarding more data as noise to do so.
base_min_cluster_size (int, default=5) – The minimum number of points in a cluster at the base layer of the clustering. This gives the finest granularity clustering that will be returned, with less graularity at higher layers.
base_n_clusters (int or None, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters for the bottom-most layer of clustering. This affects the base layer computation and allows multiple layers to be built on top of this base. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into base_n_clusters exists.
approx_n_clusters (int, default=None) – If not None, the algorithm will attempt to find the granularity of clustering that will give exactly this many clusters as the final output. Unlike base_n_clusters, when this parameter is set, only a single clustering layer will be returned – no hierarchical layers will be produced. This is useful when you know the exact number of clusters you want and don’t need the multi-layer analysis. Since the actual number of clusters cannot be guaranteed this is only approximate, but usually the algorithm can manage to get this exact number, assuming a reasonable clustering into approx_n_clusters exists.
n_neighbors (int, default=15) – The number of neighbors to use in the nearest neighbor graph construction.
min_samples (int, default=5) – The minimum number of samples to use in the density estimation when performing density based clustering on the node embedding.
n_epochs (int, default=50) – The number of epochs to use when training the node embedding.
node_embedding_init (str or None, default='label_prop') – The method to use to initialize the node embedding. If None, no initialization will be used. If ‘label_prop’, the label propagation method will be used.
symmetrize_graph (bool, default=True) – Whether to symmetrize the nearest neighbor graph before using it to construct the node embedding.
node_embedding_dim (int or None, default=None) – The number of dimensions to use in the node embedding. If None, a default value of min(max(n_neighbors // 4, 4), 15) will be used.
neighbor_scale (float, default=1.0) – The scale factor to use when constructing the nearest neighbor graph. This multiplies the effective number of neighbors used in graph construction (neighbor_scale * n_neighbors). Values > 1.0 create denser graphs with more connectivity, potentially capturing more global structure but at increased computational cost. Values < 1.0 create sparser graphs focused on local structure.
random_state (int or None, default=None) – The random seed to use for the random number generator. If None, the random number generator will not be seeded and will use the system time as the seed.
min_similarity_threshold (float, default=0.2) – The minimum similarity threshold for cluster layer selection. Peaks that result in clusterings with Jaccard similarity above this threshold will be filtered out to ensure diverse cluster layers.
max_layers (int, default=10) – The maximum number of cluster layers to return. The algorithm will select up to this many diverse peaks based on persistence and similarity criteria.
n_label_prop_iter (int, default=20) – The number of iterations to use in the label propagation algorithm when initializing the node embedding. This parameter controls how many steps the label propagation process takes to converge when node_embedding_init is set to ‘label_prop’.

labels_

An array of labels for the data samples; this is a integer array as per other scikit-learn clustering algorithms. A value of -1 indicates that a point is a noise point and not in any cluster.

Type:: array-like of shape (n_samples,)

membership_strengths_

An array of membership strengths for the data samples; this gives a measure of how strongly each point belongs to each cluster. This is a floating point array with values between 0 and 1.

Type:: array-like of shape (n_samples,)

cluster_layers_

The clustering of the data at each layer of the clustering. Each layer is a clustering of the data into a different number of clusters; the earlier the cluster vector is in this list the finer the granularity of clustering.

Type:: list of array-like of shape (n_samples,)

membership_strength_layers_

The membership strengths of each point in the clustering at each layer.

Type:: list of array-like of shape (n_samples,)

cluster_tree_

A dictionary representing the hierarchical clustering of the data. The keys are tuples of (layer, cluster) and the values are lists of tuples of (layer, cluster) representing the children of the key cluster.

Type:: dict

nn_inds_

Indices of nearest neighbors for each sample.

Type:: array-like of shape (n_samples, n_neighbors)

nn_dists_

Distance from each sample to each nearest neighbor (indexed by nn_inds).

Type:: array-like of shape (n_samples, n_neighbors)

duplicates_

A set of pairs of indices of potential duplicate points in the data.

Type:: set of tuple of int

__init__(noise_level: float = 0.5, base_min_cluster_size: int = 5, base_n_clusters: int | None = None, approx_n_clusters: int | None = None, n_neighbors: int = 15, min_samples: int = 5, n_epochs: int = 50, node_embedding_init: str | None = 'label_prop', symmetrize_graph: bool = True, node_embedding_dim: int | None = None, neighbor_scale: float = 1.0, random_state: int | None = None, min_similarity_threshold: float = 0.2, max_layers: int = 10, n_label_prop_iter=20) → None[source]

fit_predict(X, y=None, **fit_params)[source]

Fit the model to the data and return the clustering labels.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to cluster. If the data is float valued then it is assumed to use cosine distance as a matric. If the data is int8 valued then it is assumed that a quantized embedding is being used and a quantized version of cosine distance is used. If the data is uint8 valued then it is assumed that a binary embedding is being used, and a bitwise Jaccard distance is used.
y (array-like of shape (n_samples,), default=None) – Ignored. This parameter exists only for compatibility with scikit-learn’s fit_predict method.
**fit_params (dict) – Additional fit parameters. Currently unused, included for compatibility with scikit-learn’s fit_predict interface.

Returns:

labels_ – An array of labels for the data samples; this is a integer array as per other scikit-learn clustering algorithms. A value of -1 indicates that a point is a noise point and not in any cluster.

Return type:

array-like of shape (n_samples,)

fit(X, y=None, **fit_params)[source]

Fit the model to the data.

Parameters:

X (array-like of shape (n_samples, n_features)) – The data to cluster. If the data is float valued then it is assumed to use cosine distance as a matric. If the data is int8 valued then it is assumed that a quantized embedding is being used and a quantized version of cosine distance is used. If the data is uint8 valued then it is assumed that a binary embedding is being used, and a bitwise Jaccard distance is used.
y (array-like of shape (n_samples,), default=None) – Ignored. This parameter exists only for compatibility with scikit-learn’s fit method.
**fit_params (dict) – Additional fit parameters. Currently unused, included for compatibility with scikit-learn’s fit interface.

Returns:

self – Returns the instance itself.

Return type:

sklearn Estimator

property cluster_tree_

dict A dictionary representing the hierarchical clustering of the data.

The keys are tuples of (layer, cluster) and the values are lists of tuples of (layer, cluster) representing the children of the key cluster. This provides a tree structure showing how clusters at different layers relate to each other hierarchically.

Only available after fitting the model.

Returns:: Hierarchical tree structure with (layer, cluster) tuples as keys and lists of child (layer, cluster) tuples as values.
Return type:: dict
Raises:: NotFittedError – If the model has not been fitted yet.