WikiPulse.

A real-time unsupervised topic discovery dashboard. This pipeline ingests Wikipedia's live edit stream (~150 edits/min), computes 384-dim sentence embeddings with MiniLM-L6-v2, clusters semantically related articles using HDBSCAN, and projects the topic landscape to 2D via Procrustes-aligned UMAP. Everything below is live — the data refreshes every 30 seconds from a production backend.

Edits / Min
Active Articles
Clusters
Embedded
Period:
UMAP Projection
Waiting for data…
Topic Treemap (size = articles, color = momentum)
Hourly Activity Heatmap (rows = clusters, columns = hours, intensity = edits)
Discovered Topics (sorted by article count)

Each topic is named after its most representative article — the one closest to the cluster centroid in embedding space. The keywords below each title are the top c-TF-IDF terms that uniquely characterize the cluster compared to all others. Hover any card for the full title.

Waiting for cluster data…
Pipeline Methodology — click to expand

01. Stream Ingestion

Wikimedia EventStreams SSE → async queue → hourly DuckDB rollups. Filters enwiki mainspace non-bot edits with exponential backoff reconnection.

02. Semantic Embedding

Article summaries encoded to 384-dim unit vectors via all-MiniLM-L6-v2 (distilled Sentence-BERT). L2-normalized for cosine similarity via dot product. 24h LRU cache.

03. Density-Based Clustering

HDBSCAN chosen over K-Means/GMM: no predefined cluster count, handles arbitrary shapes, naturally isolates noise. c-TF-IDF extracts discriminative labels per cluster.

04. Procrustes-Aligned UMAP

384D → 2D projection via UMAP. SVD-based Procrustes rotation matching on overlapping anchors keeps the layout stable between runs. Cluster movements reflect real topic shifts.

Data source & license: Edit data is sourced in real time from the Wikimedia EventStreams API. Article summaries are fetched from the Wikipedia REST API. All Wikipedia content is available under the Creative Commons Attribution-ShareAlike 4.0 license. This project uses the data for non-commercial research and visualization purposes only. Wikimedia® is a registered trademark of the Wikimedia Foundation.
HDBSCAN UMAP Sentence-BERT c-TF-IDF Procrustes DuckDB Supabase FastAPI WebSocket Docker asyncio