CyxWiz LogoCyxWiz
DocsClustering

Clustering Tools

Unsupervised learning tools for discovering patterns and groupings in data.

Overview

Partitioning Methods
K-Means, K-Medoids
Hierarchical
Agglomerative, Divisive
Density-Based
DBSCAN, OPTICS, HDBSCAN
Model-Based
Gaussian Mixture Models
Evaluation
Silhouette, Calinski-Harabasz

K-Means Clustering

The most popular partitioning algorithm for spherical clusters.

Parameters

  • Number of Clusters (k): Target cluster count
  • Initialization: k-means++, Random, or Manual
  • Max Iterations: Default 300
  • Convergence Tolerance: Default 1e-4

Usage

from cyxwiz.clustering import KMeans

kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
kmeans.fit(X)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_

DBSCAN (Density-Based)

Discover clusters of arbitrary shapes based on density.

Parameters

  • Epsilon (eps): Neighborhood radius
  • Min Samples: Core point threshold
  • Metric: Euclidean, Manhattan, or Cosine

Usage

from cyxwiz.clustering import DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Identify noise (-1 labels)
noise_mask = labels == -1
n_noise = noise_mask.sum()

Hierarchical Clustering

Build a tree of clusters using agglomerative (bottom-up) or divisive (top-down) approaches.

Linkage Methods

WardMinimizes variance increase
CompleteMaximum distance between clusters
AverageAverage distance between all pairs
SingleMinimum distance between clusters
from cyxwiz.clustering import AgglomerativeClustering, dendrogram, linkage

Z = linkage(X, method='ward')
dendrogram(Z)

model = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = model.fit_predict(X)

Gaussian Mixture Models

Probabilistic clustering assuming data comes from mixture of Gaussians.

from cyxwiz.clustering import GaussianMixture

gmm = GaussianMixture(n_components=4, covariance_type='full')
gmm.fit(X)

probs = gmm.predict_proba(X)
labels = gmm.predict(X)
samples = gmm.sample(100)

Evaluation Metrics

MetricDescriptionRangeBest
Silhouette ScoreCohesion vs separation-1 to 1Higher
Calinski-HarabaszBetween/within variance ratio0 to infinityHigher
Davies-BouldinAverage cluster similarity0 to infinityLower
InertiaWithin-cluster sum of squares0 to infinityLower

Evaluation Usage

from cyxwiz.clustering.metrics import (
    silhouette_score,
    calinski_harabasz_score,
    davies_bouldin_score
)

silhouette = silhouette_score(X, labels)
ch_score = calinski_harabasz_score(X, labels)
db_score = davies_bouldin_score(X, labels)

print(f"Silhouette: {silhouette:.3f}")
print(f"Calinski-Harabasz: {ch_score:.2f}")
print(f"Davies-Bouldin: {db_score:.3f}")