DocsClustering
Clustering Tools
Unsupervised learning tools for discovering patterns and groupings in data.
Overview
Partitioning Methods
K-Means, K-Medoids
Hierarchical
Agglomerative, Divisive
Density-Based
DBSCAN, OPTICS, HDBSCAN
Model-Based
Gaussian Mixture Models
Evaluation
Silhouette, Calinski-Harabasz
K-Means Clustering
The most popular partitioning algorithm for spherical clusters.
Parameters
- Number of Clusters (k): Target cluster count
- Initialization: k-means++, Random, or Manual
- Max Iterations: Default 300
- Convergence Tolerance: Default 1e-4
Usage
from cyxwiz.clustering import KMeans kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42) kmeans.fit(X) labels = kmeans.labels_ centroids = kmeans.cluster_centers_ inertia = kmeans.inertia_
DBSCAN (Density-Based)
Discover clusters of arbitrary shapes based on density.
Parameters
- Epsilon (eps): Neighborhood radius
- Min Samples: Core point threshold
- Metric: Euclidean, Manhattan, or Cosine
Usage
from cyxwiz.clustering import DBSCAN dbscan = DBSCAN(eps=0.5, min_samples=5) labels = dbscan.fit_predict(X) # Identify noise (-1 labels) noise_mask = labels == -1 n_noise = noise_mask.sum()
Hierarchical Clustering
Build a tree of clusters using agglomerative (bottom-up) or divisive (top-down) approaches.
Linkage Methods
| Ward | Minimizes variance increase |
| Complete | Maximum distance between clusters |
| Average | Average distance between all pairs |
| Single | Minimum distance between clusters |
from cyxwiz.clustering import AgglomerativeClustering, dendrogram, linkage Z = linkage(X, method='ward') dendrogram(Z) model = AgglomerativeClustering(n_clusters=3, linkage='ward') labels = model.fit_predict(X)
Gaussian Mixture Models
Probabilistic clustering assuming data comes from mixture of Gaussians.
from cyxwiz.clustering import GaussianMixture gmm = GaussianMixture(n_components=4, covariance_type='full') gmm.fit(X) probs = gmm.predict_proba(X) labels = gmm.predict(X) samples = gmm.sample(100)
Evaluation Metrics
| Metric | Description | Range | Best |
|---|---|---|---|
| Silhouette Score | Cohesion vs separation | -1 to 1 | Higher |
| Calinski-Harabasz | Between/within variance ratio | 0 to infinity | Higher |
| Davies-Bouldin | Average cluster similarity | 0 to infinity | Lower |
| Inertia | Within-cluster sum of squares | 0 to infinity | Lower |
Evaluation Usage
from cyxwiz.clustering.metrics import (
silhouette_score,
calinski_harabasz_score,
davies_bouldin_score
)
silhouette = silhouette_score(X, labels)
ch_score = calinski_harabasz_score(X, labels)
db_score = davies_bouldin_score(X, labels)
print(f"Silhouette: {silhouette:.3f}")
print(f"Calinski-Harabasz: {ch_score:.2f}")
print(f"Davies-Bouldin: {db_score:.3f}")