Quick Start Guide

This guide provides a quick introduction to using EVōC for clustering high-dimensional embedding vectors. EVōC specifically targets modern embedding vectors such as those produced by CLIP, sentence-transformers, and other dense vector representations. It seeks to provide fast and effective results with as little parameter tuning as possible.

Basic Usage

The simplest way to use EVōC is to import the EVoC class, create an instance with default parameters, and call fit_predict on your data:

from evoc import EVoC
from sklearn.datasets import make_blobs
import numpy as np

# Generate sample embedding data
blob_data, blob_labels = make_blobs(n_samples=10_000, n_features=512, centers=256)

# Create and fit the clusterer
clusterer = EVoC()
labels = clusterer.fit_predict(blob_data)

# Analyze results
n_clusters = len(np.unique(labels[labels >= 0]))
n_noise = np.sum(labels == -1)

print(f"Found {n_clusters} clusters")
print(f"Noise points: {n_noise}")

EVōC uses the sklearn API, so you can drop it in to any existing clustering workflow that expects a fit_predict method. The default parameters are designed to work well for typical embedding data, but you can adjust them as needed (see the Parameter Selection section below).

Understanding the Output

EVōC uses standard sklearn conventions for its output. After fitting, the clusterer will have the following attributes:

labels_: Cluster labels for each point (-1 for noise)
membership_strengths_: Confidence scores for cluster membership
cluster_layers_: Multiple clustering granularities
cluster_tree_: Hierarchical structure of clusters

The labels_ attribute is the expected vector of cluster assignments you would get from any sklearn clustering algorithm. The membership_strengths_ attribute provides additional information about how strongly each point belongs to its assigned cluster, which can be useful for filtering or analyzing borderline cases; the is equivalent to the probabilities_ attribute in HDBSCAN.

The cluster_layers_ and cluster_tree_ attributes are more novel. EVōC is not a hierarchical clustering algorithm in the traditional sense, instead it produces multiple layers of clustering resolution, that can be results that can be cast into a hierarchy.

# Access different clustering layers
print(f"Available layers: {len(clusterer.cluster_layers_)}")

# Get membership strengths
strengths = clusterer.membership_strengths_
print(f"Average membership strength: {np.mean(strengths):.3f}")

# Access the cluster hierarchy
tree = clusterer.cluster_tree_
print(f"Hierarchical structure: {tree}")

Layers are sorted from most fine-grained (many small clusters) at index 0 to most coarse-grained (fewer large clusters). Each layer is a label vector, just like labels_, but with a different clustering resolution. The labels_ attribute corresponds to the layer that has clusters persisting across the widest range of cluster resolution scales, and is usually the most stable and meaningful clustering result. However, depending on your needs, other cluster layers may be more appropriate.

The cluster_tree_ attribute provides a hierarchical structure of the clusters across layers. It shows how clusters in finer layers relate to clusters in coarser layers, effectively creating a tree of cluster relationships. This can be useful for understanding the multi-scale structure of your data and for selecting clusters at different levels of granularity.

The tree is structured as a dictionary. Each cluster is identified as a tuple of (layer_index, cluster_id), and the value is a list of child clusters in the more fine-grained layers.

Parameter Selection

Key parameters to adjust:

n_neighbors (default=15): Number of neighbors for graph construction. Increase for more global connectivity.
base_min_cluster_size (default=5): Minimum cluster size at the base layer.
approx_n_clusters (default=None): Target number of clusters (returns single layer if specified).

# Example with custom parameters
clusterer = EVoC(
    n_neighbors=25,          # More neighbors for denser graphs
    base_min_cluster_size=10, # Larger minimum clusters
    max_layers=5             # Limit hierarchy depth
)

labels = clusterer.fit_predict(blob_data)

Working with Different Data Types

EVoC automatically detects data types and uses appropriate distance metrics:

float32/float64: Cosine distance (default for embeddings)
int8: Quantized cosine distance
uint8: Bitwise Jaccard distance (for binary embeddings)

We can take out blob data and convert it to different formats to see how EVoC handles them. In practice, you would typically be working with actual embedding data that comes pre-quantized or binarized depending on the model and/or storage format you are using.

embeddings = normalize(blob_data) # Example embedding data

# For standard embeddings (float) X_float = embeddings.astype(np.float32) labels_cosine = EVoC().fit_predict(X_float)

# For quantized embeddings (int8) X_quantized = (StandardScaler().fit_transform(embeddings) * 127).astype(np.int8) labels_quantized = EVoC().fit_predict(X_quantized)

# For binary embeddings (packed uint8) X_binary = np.packbits(embeddings > 0.0, axis=1) labels_binary = EVoC().fit_predict(X_binary)

Next Steps

See the User Guide for detailed parameter explanations
Refer to API Reference for complete API documentation