Unsupervised Manifold Learning · Financial Fraud Detection

LatentDiscovery.

Comparing clustering philosophies — K-Means, GMM, and HDBSCAN — on a 28-dimensional financial dataset projected into 3D latent space via UMAP to isolate fraudulent transaction topologies.

Dimensions 28 → 3D

Noise Rate 0.9%

Algorithm HDBSCAN

Best Silhouette 0.76↑

Full Pipeline Overview

Clustering Dashboard

Master 2-row grid: 2D UMAP projections coloured by each method alongside ground-truth fraud overlay (top), and model selection diagnostics including elbow curve, silhouette comparison, and GMM BIC/AIC curve (bottom).

Comparative Analysis

Three Philosophies

Each algorithm encodes a different belief about what a cluster is. The fraud topology demands one.

Method 01

K-Means

k = 4 clusters

Highest geometric separation (Silhouette 0.7597) due to its assumption of spherical, equally-sized clusters. Excellent at partitioning the dense core manifold but blind to non-convex fraud tendrils and outlier topology — it assigns noise to the nearest centroid rather than flagging it.

Silhouette 0.7597

DB Index 0.2885

Clusters 4

Method 02

GMM

n = 10 components

Gaussian Mixture Models add soft, probabilistic assignments and allow ellipsoidal cluster shapes. BIC/AIC selection converged at n=10 components — capturing more latent structure than K-Means but still constrained to parametric Gaussian geometry. Fraud boundaries remain probabilistically blurred, not topologically isolated.

Silhouette 0.4843

DB Index 0.6032

Clusters 10

★ Selected

Method 03

HDBSCAN

9 clusters + noise isolation

Density-based hierarchical clustering requires no assumption about cluster shape or count. It naturally discovers the irregular, filamentary topology of fraud transaction manifolds, explicitly labels low-density noise (0.9% of points) as outliers rather than assigning them, and produces a fraud enrichment hierarchy directly aligned with the UMAP embedding structure.

Silhouette 0.4231

DB Index 0.5241

Clusters 9

Why lower Silhouette score wins: Silhouette rewards compact, convex clusters — exactly what fraud is not. HDBSCAN's lower score reflects its honest representation of irregular, density-sparse fraud topologies that parametric methods artificially smooth over.

UMAP Topology

Local vs. Global Structure

UMAP constructs a weighted k-nearest-neighbour graph in the original 28-dimensional space, then optimizes a low-dimensional embedding to preserve that graph's topological structure. The n_neighbors parameter controls the local-to-global tradeoff: small values reveal fine-grained manifold structure; larger values preserve macro-topology. At n_neighbors=15, the embedding resolves both the dense legitimate transaction core and the sparse, filamentary fraud periphery simultaneously.

The 3D projection retains sufficient degrees of freedom to avoid the false separation artefacts that appear in 2D — critical when fraud clusters span multiple topological sheets.

Density heatmaps — K-Means, GMM, HDBSCAN

Density Heatmaps with Centroid Overlay — All Three Methods

Quantitative Evaluation

Performance Matrix

Method	Clusters	Silhouette ↑	Davies-Bouldin ↓	Verdict
K-Means	4	0.7597	0.2885	Geometric Baseline
GMM	10	0.4843	0.6032	Probabilistic
HDBSCAN	9	0.4231	0.5241	★ Selected

1.29M Credit card transactions in source dataset

15,000 Analysis subset · 15% fraud oversampled

0.9% HDBSCAN noise isolation rate

Z-Score Feature Importance · Clusters × 14 Features

Feature Attribution

What Defines a Fraud Cluster?

The diverging Magenta↔Cyan Z-score heatmap maps each of the 14 PCA-derived features against all discovered clusters. High-magnitude deviation in the Magenta direction (negative Z) signals anomalously low feature values — the fingerprint of suppressed legitimate transaction patterns that define the fraudulent topology.

Clusters with strong, multi-feature signatures are candidates for targeted fraud rules; sparse signatures indicate ambiguous boundary regions requiring probabilistic scoring.

Fraud Enrichment Analysis

Noise as Signal

In density-based clustering, points rejected as noise are not errors — they are topologically isolated. The two-panel analysis below shows exactly where HDBSCAN noise points sit in the 2D UMAP projection (Magenta) and how each cluster's fraud enrichment rate compares. The highest-enrichment clusters directly correspond to the manifold's low-density periphery.

HDBSCAN noise analysis and fraud enrichment

UMAP 3D Projection HDBSCAN Gaussian Mixture Models K-Means Scikit-Learn Matplotlib · Seaborn 1.29M Transactions