User Guide ========== This end-user oriented guide covers EVoC's features, parameters, and best practices for different use cases. To better understand the parameters that are available, it help help to bgin with an overview of the algorithm and its key components. Algorithm Overview ------------------ EVoC (Embedding Vector Oriented Clustering) combines two key techniques: 1. **Graph Embedding**: Constructs a k-nearest neighbor graph and learns a lower-dimensional embedding (similar to UMAP) 2. **Density Clustering**: Applies hierarchical density-based clustering to the embedding (similar to HDBSCAN and PLSCAN) The advantage of EVoC is that it can optimize every part of these tasks for the specific task of clustering high-dimensional embedding vectors, providing both improved **performance** and **quality** compared to general-purpose clustering algorithms. That is to say, EVoC not only runs much faster than a combination of UMAP and HDBSCAN, but also produces better clusters as a result. The combination of dimension reduction/manifold learning and density clustering tailored to embedding vectors provides several advantages for clustering embedding vectors: * Efficient processing of dense, high-dimensional data * Multiple clustering granularities through hierarchical layers * Robust handling of noise and outliers * Optimized distance metrics for different embedding types Parameter Reference ------------------- With that core idea -- a two part algorithm -- in mind, let's explore the key parameters that control EVoC's behavior. The parameters can be broadly categorized into three groups: Core Parameters ~~~~~~~~~~~~~~~ These are the main parameters that most users will want to adjust based on their specific dataset and clustering goals: **base_min_cluster_size** : int, default=5 Minimum number of points required to form a cluster at the base (finest) granularity level. Larger values produce fewer, more stable clusters. **n_neighbors** : int, default=15 Number of neighbors used in k-NN graph construction. More neighbors capture more global structure but increase computational cost. **min_samples** : int, default=5 Minimum samples for density estimation in the final clustering step. Should typically match or be smaller than base_min_cluster_size. Clustering Control ~~~~~~~~~~~~~~~~~~ These parameters control the clustering behavior and granularity: **base_n_clusters** : int, optional Target number of clusters for the base layer. When specified, EVoC will search for the clustering granularity that produces approximately this many clusters, then build additional layers on top. **approx_n_clusters** : int, optional Target number of clusters for the final output. When specified, EVoC returns only a single clustering layer (no hierarchy) with approximately this many clusters. **max_layers** : int, default=10 Maximum number of hierarchical clustering layers to generate. More layers provide finer control over clustering granularity but increase computation time. **min_similarity_threshold** : float, default=0.2 Minimum Jaccard similarity threshold for layer selection. Prevents nearly identical clustering layers in the hierarchy. Advanced Parameters ~~~~~~~~~~~~~~~~~~~ These parameters provide more fine-grained control over the algorithm and are typically only adjusted by advanced users: **noise_level** : float, default=0.5 Controls the noise threshold for cluster membership. Higher values produce more noise points and fewer clusters, while lower values produce more clusters and fewer noise points. In practice this only provides fine-tuning over the amount of noise, and is not as important as base_min_cluster_size and min_samples. **node_embedding_dim** : int, optional Dimensionality of the intermediate node embedding. If None, defaults to min(max(n_neighbors // 4, 4), 15). Higher dimensions can capture more complex structure but increase computation. **neighbor_scale** : float, default=1.0 Scales the effective number of neighbors (neighbor_scale × n_neighbors). Values > 1.0 create denser graphs, values < 1.0 create sparser graphs focused on local structure. **n_epochs** : int, default=50 Number of optimization epochs for the node embedding. More epochs improve embedding quality but increase computation time. **node_embedding_init** : {'label_prop', None}, default='label_prop' Initialization method for the node embedding. 'label_prop' uses label propagation for initialization, None uses random initialization. **n_label_prop_iter** : int, default=20 Number of label propagation iterations when using 'label_prop' initialization. **symmetrize_graph** : bool, default=True Whether to make the k-NN graph symmetric. Recommended for most use cases. **random_state** : int, optional Random seed for reproducible results. When specified, enables deterministic mode. Best Practices -------------- As a general rule EVoC is desgined to largely be as parameter-free as possible. The default parameters should work well for a wide range of datasets and use cases, and most users will not need to adjust them. So the best place to start is just running with default parameters and then adjusting based on the results. However, here are some best practices for different scenarios: Working with Hierarchical Output ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ EVoC provides multiple clustering layers with different granularities: .. code-block:: python clusterer = EVoC(max_layers=5) clusterer.fit(X) # Explore different granularities for i, layer in enumerate(clusterer.cluster_layers_): n_clusters = len(np.unique(layer[layer >= 0])) n_noise = np.sum(layer == -1) persistence = clusterer.persistence_scores_[i] print(f"Layer {i}: {n_clusters} clusters, {n_noise} noise points, " f"persistence: {persistence:.3f}") # Use cluster tree for hierarchical analysis tree = clusterer.cluster_tree_ # ... analyze hierarchical structure ... The layer 0 is always the most fine-grained layer as determined by ``base_min_cluster_size`` or ``base_n_clusters``. Each subsequent layer provides a coarser clustering, with fewer clusters. In general the most fine-grained layers will have the most noise points, and the coarser layers will have fewer noise points. The persistence score provides a measure of how stable each layer is across different parameter settings, with higher scores indicating more robust clusters. If you are interested in getting very fine-grained clusters it is worth setting ``base_min_cluster_size`` or ``base_n_clusters`` explicitly to ensure you get clustering at that granularity. You can then inspect the other layers to see if the other natural granularities align with your use case. If you are only interested in a single clustering, you can set ``approx_n_clusters`` to get the layer that is closest to that number of clusters. You can also make use of the tree structure to analyze how clusters evolve across layers, and to identify stable clusters that persist across multiple layers. Alternatively you can use the tree structure to create a "mixed" resolution layer by selecting clusters at a given layer, and then also selecting any clusters in lower layers that are no children of any of your selected clusters. This allows you to get a more fine-grained clustering in some parts of the data, while keeping a coarser clustering i n other parts of the data. Performance Optimization ~~~~~~~~~~~~~~~~~~~~~~~~ Depending on your needs you may be willing to trade off some accuracy for speed, or vice versa. The default EVoC parameters are designed primarily for exploratory clustering, and thus produce clusters very quickly. If you are looking for a more robust higher quality clustering, it can be worth tweaking the parameters to spend more time to produce a better clustering result. For example, for a medium sized dataset (e.g. 10k-100k points) you can increase the number of epochs and neighbors to get a better embedding, which will lead to better clusters. In such cases you will also likely want to fix a random seed to ensure reproducibility, as the optimization process is stochastic. .. code-block:: python clusterer = EVoC( n_epochs=150, # More epochs for better embedding random_state=42 # Enable optimizations ) For larger datasets, you may want to reduce the number of neighbors and epochs to get a faster result, at the cost of some cluster quality. In that case not setting a random seed can actually improve performance, as it allows the algorithm to skip some of the overhead of ensuring reproducibility. .. code-block:: python clusterer = EVoC( n_neighbors=10, # Balance between quality and speed n_epochs=30, # Fewer epochs for faster embedding max_layers=3, # Limit hierarchy depth ) Troubleshooting --------------- **Problem**: Too many small clusters **Solution**: Increase base_min_cluster_size or noise_level **Problem**: Most points classified as noise **Solution**: Decrease noise_level or reduce min_samples **Problem**: Clustering too slow **Solution**: Reduce n_neighbors, n_epochs, or max_layers **Problem**: Poor cluster quality **Solution**: Increase n_neighbors, n_epochs, or try different node_embedding_init **Problem**: Inconsistent results **Solution**: Set random_state for reproducible results