WikiPulse.
A real-time unsupervised topic discovery dashboard. This pipeline ingests Wikipedia's live edit stream (~150 edits/min), computes 384-dim sentence embeddings with MiniLM-L6-v2, clusters semantically related articles using HDBSCAN, and projects the topic landscape to 2D via Procrustes-aligned UMAP. Everything below is live — the data refreshes every 30 seconds from a production backend.
Each topic is named after its most representative article — the one closest to the cluster centroid in embedding space. The keywords below each title are the top c-TF-IDF terms that uniquely characterize the cluster compared to all others. Hover any card for the full title.
Pipeline Methodology — click to expand
01. Stream Ingestion
Wikimedia EventStreams SSE → async queue → hourly DuckDB rollups. Filters enwiki mainspace non-bot edits with exponential backoff reconnection.
02. Semantic Embedding
Article summaries encoded to 384-dim unit vectors via all-MiniLM-L6-v2 (distilled Sentence-BERT). L2-normalized for cosine similarity via dot product. 24h LRU cache.
03. Density-Based Clustering
HDBSCAN chosen over K-Means/GMM: no predefined cluster count, handles arbitrary shapes, naturally isolates noise. c-TF-IDF extracts discriminative labels per cluster.
04. Procrustes-Aligned UMAP
384D → 2D projection via UMAP. SVD-based Procrustes rotation matching on overlapping anchors keeps the layout stable between runs. Cluster movements reflect real topic shifts.