User Guide
This end-user oriented guide covers EVoC’s features, parameters, and best practices for different use cases. To better understand the parameters that are available, it help help to bgin with an overview of the algorithm and its key components.
Algorithm Overview
EVoC (Embedding Vector Oriented Clustering) combines two key techniques:
Graph Embedding: Constructs a k-nearest neighbor graph and learns a lower-dimensional embedding (similar to UMAP)
Density Clustering: Applies hierarchical density-based clustering to the embedding (similar to HDBSCAN and PLSCAN)
The advantage of EVoC is that it can optimize every part of these tasks for the specific task of clustering high-dimensional embedding vectors, providing both improved performance and quality compared to general-purpose clustering algorithms. That is to say, EVoC not only runs much faster than a combination of UMAP and HDBSCAN, but also produces better clusters as a result.
The combination of dimension reduction/manifold learning and density clustering tailored to embedding vectors provides several advantages for clustering embedding vectors:
Efficient processing of dense, high-dimensional data
Multiple clustering granularities through hierarchical layers
Robust handling of noise and outliers
Optimized distance metrics for different embedding types
Parameter Reference
With that core idea – a two part algorithm – in mind, let’s explore the key parameters that control EVoC’s behavior. The parameters can be broadly categorized into three groups:
Core Parameters
These are the main parameters that most users will want to adjust based on their specific dataset and clustering goals:
- base_min_cluster_sizeint, default=5
Minimum number of points required to form a cluster at the base (finest) granularity level. Larger values produce fewer, more stable clusters.
- n_neighborsint, default=15
Number of neighbors used in k-NN graph construction. More neighbors capture more global structure but increase computational cost.
- min_samplesint, default=5
Minimum samples for density estimation in the final clustering step. Should typically match or be smaller than base_min_cluster_size.
Clustering Control
These parameters control the clustering behavior and granularity:
- base_n_clustersint, optional
Target number of clusters for the base layer. When specified, EVoC will search for the clustering granularity that produces approximately this many clusters, then build additional layers on top.
- approx_n_clustersint, optional
Target number of clusters for the final output. When specified, EVoC returns only a single clustering layer (no hierarchy) with approximately this many clusters.
- max_layersint, default=10
Maximum number of hierarchical clustering layers to generate. More layers provide finer control over clustering granularity but increase computation time.
- min_similarity_thresholdfloat, default=0.2
Minimum Jaccard similarity threshold for layer selection. Prevents nearly identical clustering layers in the hierarchy.
Advanced Parameters
These parameters provide more fine-grained control over the algorithm and are typically only adjusted by advanced users:
- noise_levelfloat, default=0.5
Controls the noise threshold for cluster membership. Higher values produce more noise points and fewer clusters, while lower values produce more clusters and fewer noise points. In practice this only provides fine-tuning over the amount of noise, and is not as important as base_min_cluster_size and min_samples.
- node_embedding_dimint, optional
Dimensionality of the intermediate node embedding. If None, defaults to min(max(n_neighbors // 4, 4), 15). Higher dimensions can capture more complex structure but increase computation.
- neighbor_scalefloat, default=1.0
Scales the effective number of neighbors (neighbor_scale × n_neighbors). Values > 1.0 create denser graphs, values < 1.0 create sparser graphs focused on local structure.
- n_epochsint, default=50
Number of optimization epochs for the node embedding. More epochs improve embedding quality but increase computation time.
- node_embedding_init{‘label_prop’, None}, default=’label_prop’
Initialization method for the node embedding. ‘label_prop’ uses label propagation for initialization, None uses random initialization.
- n_label_prop_iterint, default=20
Number of label propagation iterations when using ‘label_prop’ initialization.
- symmetrize_graphbool, default=True
Whether to make the k-NN graph symmetric. Recommended for most use cases.
- random_stateint, optional
Random seed for reproducible results. When specified, enables deterministic mode.
Best Practices
As a general rule EVoC is desgined to largely be as parameter-free as possible. The default parameters should work well for a wide range of datasets and use cases, and most users will not need to adjust them. So the best place to start is just running with default parameters and then adjusting based on the results. However, here are some best practices for different scenarios:
Working with Hierarchical Output
EVoC provides multiple clustering layers with different granularities:
clusterer = EVoC(max_layers=5)
clusterer.fit(X)
# Explore different granularities
for i, layer in enumerate(clusterer.cluster_layers_):
n_clusters = len(np.unique(layer[layer >= 0]))
n_noise = np.sum(layer == -1)
persistence = clusterer.persistence_scores_[i]
print(f"Layer {i}: {n_clusters} clusters, {n_noise} noise points, "
f"persistence: {persistence:.3f}")
# Use cluster tree for hierarchical analysis
tree = clusterer.cluster_tree_
# ... analyze hierarchical structure ...
The layer 0 is always the most fine-grained layer as determined by base_min_cluster_size or base_n_clusters.
Each subsequent layer provides a coarser clustering, with fewer clusters. In general the most fine-grained layers
will have the most noise points, and the coarser layers will have fewer noise points. The persistence score
provides a measure of how stable each layer is across different parameter settings, with higher scores indicating more robust clusters.
If you are interested in getting very fine-grained clusters it is worth setting base_min_cluster_size or base_n_clusters
explicitly to ensure you get clustering at that granularity. You can then inspect the other layers to see if the other natural
granularities align with your use case. If you are only interested in a single clustering, you can set approx_n_clusters
to get the layer that is closest to that number of clusters.
You can also make use of the tree structure to analyze how clusters evolve across layers, and to identify stable clusters that persist across multiple layers. Alternatively you can use the tree structure to create a “mixed” resolution layer by selecting clusters at a given layer, and then also selecting any clusters in lower layers that are no children of any of your selected clusters. This allows you to get a more fine-grained clustering in some parts of the data, while keeping a coarser clustering i n other parts of the data.
Performance Optimization
Depending on your needs you may be willing to trade off some accuracy for speed, or vice versa. The default EVoC parameters are designed primarily for exploratory clustering, and thus produce clusters very quickly. If you are looking for a more robust higher quality clustering, it can be worth tweaking the parameters to spend more time to produce a better clustering result. For example, for a medium sized dataset (e.g. 10k-100k points) you can increase the number of epochs and neighbors to get a better embedding, which will lead to better clusters. In such cases you will also likely want to fix a random seed to ensure reproducibility, as the optimization process is stochastic.
clusterer = EVoC(
n_epochs=150, # More epochs for better embedding
random_state=42 # Enable optimizations
)
For larger datasets, you may want to reduce the number of neighbors and epochs to get a faster result, at the cost of some cluster quality. In that case not setting a random seed can actually improve performance, as it allows the algorithm to skip some of the overhead of ensuring reproducibility.
clusterer = EVoC(
n_neighbors=10, # Balance between quality and speed
n_epochs=30, # Fewer epochs for faster embedding
max_layers=3, # Limit hierarchy depth
)
Troubleshooting
- Problem: Too many small clusters
Solution: Increase base_min_cluster_size or noise_level
- Problem: Most points classified as noise
Solution: Decrease noise_level or reduce min_samples
- Problem: Clustering too slow
Solution: Reduce n_neighbors, n_epochs, or max_layers
- Problem: Poor cluster quality
Solution: Increase n_neighbors, n_epochs, or try different node_embedding_init
- Problem: Inconsistent results
Solution: Set random_state for reproducible results