Quick Start Guide
================

This guide provides a quick introduction to using EVōC for clustering high-dimensional embedding vectors. EVōC 
specifically targets modern embedding vectors such as those produced by CLIP, sentence-transformers, and other 
dense vector representations. It seeks to provide fast and effective results with as little parameter tuning as possible.

Basic Usage
-----------

The simplest way to use EVōC is to import the EVoC class, create an instance with default parameters, and call fit_predict on your data:

.. code-block:: python

   from evoc import EVoC
   from sklearn.datasets import make_blobs
   import numpy as np

   # Generate sample embedding data
   blob_data, blob_labels = make_blobs(n_samples=10_000, n_features=512, centers=256)

   # Create and fit the clusterer
   clusterer = EVoC()
   labels = clusterer.fit_predict(blob_data)

   # Analyze results
   n_clusters = len(np.unique(labels[labels >= 0]))
   n_noise = np.sum(labels == -1)

   print(f"Found {n_clusters} clusters")
   print(f"Noise points: {n_noise}")

EVōC uses the sklearn API, so you can drop it in to any existing clustering workflow that expects a fit_predict method. 
The default parameters are designed to work well for typical embedding data, but you can adjust them as needed
(see the Parameter Selection section below).

Understanding the Output
------------------------

EVōC uses standard sklearn conventions for its output. After fitting, the clusterer will have the following attributes:

* **labels_**: Cluster labels for each point (-1 for noise)
* **membership_strengths_**: Confidence scores for cluster membership
* **cluster_layers_**: Multiple clustering granularities 
* **cluster_tree_**: Hierarchical structure of clusters

The ``labels_`` attribute is the expected vector of cluster assignments you would get from any sklearn clustering algorithm. 
The ``membership_strengths_`` attribute provides additional information about how strongly each point belongs to its assigned 
cluster, which can be useful for filtering or analyzing borderline cases; the is equivalent to the ``probabilities_`` attribute 
in HDBSCAN.

The ``cluster_layers_`` and ``cluster_tree_`` attributes are more novel. EVōC is not a hierarchical clustering algorithm in the 
traditional sense,  instead it produces multiple layers of clustering resolution, that can be results that can be cast into a 
hierarchy.

.. code-block:: python

   # Access different clustering layers
   print(f"Available layers: {len(clusterer.cluster_layers_)}")

   # Get membership strengths
   strengths = clusterer.membership_strengths_
   print(f"Average membership strength: {np.mean(strengths):.3f}")

   # Access the cluster hierarchy
   tree = clusterer.cluster_tree_
   print(f"Hierarchical structure: {tree}")

Layers are sorted from most fine-grained (many small clusters) at index 0 to most coarse-grained (fewer large clusters).
Each layer is a label vector, just like ``labels_``, but with a different clustering resolution. The ``labels_`` attribute 
corresponds to the layer that has clusters persisting across the widest range of cluster resolution scales, and is usually 
the most stable and meaningful clustering result. However, depending on your needs, other cluster layers may be more appropriate.

The ``cluster_tree_`` attribute provides a hierarchical structure of the clusters across layers. 
It shows how clusters in finer layers relate to clusters in coarser layers, effectively creating a tree of cluster relationships. 
This can be useful for understanding the multi-scale structure of your data and for selecting clusters at 
different levels of granularity.

The tree is structured as a dictionary. Each cluster is identified as a tuple of (layer_index, cluster_id), 
and the value is a list of child clusters in the more fine-grained layers.

Parameter Selection
-------------------

Key parameters to adjust:

**n_neighbors** (default=15)
   Number of neighbors for graph construction. Increase for more global connectivity.

**base_min_cluster_size** (default=5)
   Minimum cluster size at the base layer.

**approx_n_clusters** (default=None)
   Target number of clusters (returns single layer if specified).

.. code-block:: python

   # Example with custom parameters
   clusterer = EVoC(
       n_neighbors=25,          # More neighbors for denser graphs
       base_min_cluster_size=10, # Larger minimum clusters
       max_layers=5             # Limit hierarchy depth
   )

   labels = clusterer.fit_predict(blob_data)

Working with Different Data Types
---------------------------------

EVoC automatically detects data types and uses appropriate distance metrics:

* **float32/float64**: Cosine distance (default for embeddings)
* **int8**: Quantized cosine distance  
* **uint8**: Bitwise Jaccard distance (for binary embeddings)

We can take out blob data and convert it to different formats to see how EVoC handles them.
In practice, you would typically be working with actual embedding data that comes
pre-quantized or binarized depending on the model and/or storage format you are using.

   embeddings = normalize(blob_data)  # Example embedding data

   # For standard embeddings (float)
   X_float = embeddings.astype(np.float32)
   labels_cosine = EVoC().fit_predict(X_float)

   # For quantized embeddings (int8)
   X_quantized = (StandardScaler().fit_transform(embeddings) * 127).astype(np.int8)  
   labels_quantized = EVoC().fit_predict(X_quantized)

   # For binary embeddings (packed uint8)
   X_binary = np.packbits(embeddings > 0.0, axis=1)
   labels_binary = EVoC().fit_predict(X_binary)

Next Steps
----------

* See the :doc:`user_guide` for detailed parameter explanations
* Refer to :doc:`api/index` for complete API documentation