Semantic Proprioception: Teaching Data to Understand Itself

machine learning

embeddings

data structures

How LSH density analysis reveals hidden structure in text collections—from customer support tickets to academic papers—using Krapivin hash tables for O(1) semantic awareness.

Author

Justin Donaldson

Published

November 22, 2025

Semantic Proprioception: Teaching Data to Understand Itself

Just as proprioception lets you sense where your body is in space without looking, semantic proprioception gives data the ability to understand its own internal structure. No manual labeling, no predefined categories—just the data revealing patterns within itself.

I’ve built a live demo that shows this in action across three very different datasets: Twitter customer support conversations, ArXiv research papers, and Hacker News discussions. The same technique discovers meaningful themes in all three, adapting to each domain’s unique semantics.

The Core Insight

Traditional clustering requires you to specify how many clusters you want, tune distance thresholds, or provide seed examples. But what if the data could just tell you what patterns exist?

The key is LSH bucket density. When you hash similar embeddings into buckets using Locality-Sensitive Hashing, the density of each bucket reveals something fundamental.

LSH maps high-dimensional vectors to binary signatures using random hyperplanes:

\[h_i(\mathbf{v}) = \begin{cases} 1 & \text{if } \mathbf{w}_i \cdot \mathbf{v} > 0 \\ 0 & \text{otherwise} \end{cases}\]

where \(\mathbf{w}_i\) is a random hyperplane. Combining \(k\) such hash functions creates a bucket signature. Similar vectors collide in the same bucket with high probability:

\[P(h(\mathbf{u}) = h(\mathbf{v})) = 1 - \frac{\theta(\mathbf{u}, \mathbf{v})}{\pi}\]

where \(\theta(\mathbf{u}, \mathbf{v})\) is the angle between vectors. Closer vectors (smaller angle) → higher collision probability.

The density distribution then tells us:

Dense buckets (≥5 items): Common themes, frequently occurring concepts
Medium buckets (2-4 items): Boundary cases, transitional concepts
Sparse buckets (1 item): Novel or rare content

This isn’t just clustering—it’s the data developing awareness of its own distribution.

Why Krapivin Hash Tables Matter

Traditional LSH implementations have a problem: to find dense buckets, you’d have to scan every bucket and count items. That’s O(n) where n is the number of buckets—expensive and slow.

Enter Krapivin hash tables (Krapivin et al. 2025), which provide O(1) density queries through their hierarchical structure. You can instantly ask: “Which buckets have ≥5 items?” without scanning anything.

This transforms LSH from a search index into a semantic awareness system. The data doesn’t just answer “what’s similar to X?”—it can answer “what patterns exist in me?”

Three Datasets, One Technique

The demo shows how the same approach works across wildly different domains:

Twitter Customer Support (1,000 tweets)

Discovered themes: Password resets, billing issues, account access, network problems

The short, action-oriented nature of support tickets creates tight, well-defined clusters. Users express problems in similar ways, leading to high-density buckets around common pain points.

ArXiv Research Papers (1,000 abstracts)

Discovered themes: Deep learning architectures, quantum mechanics, genomics, optimization methods

Academic writing has longer, more varied language, but technical concepts still cluster. Papers about “attention mechanisms” use similar terminology even when discussing different applications.

Hacker News (684 posts)

Discovered themes: AI/ML developments, startup advice, privacy concerns, programming tools

HN posts mix news headlines with discussion text. The clusters reflect both trending topics and perennial themes in the tech community.

How It Works

Embed: Use sentence-transformers to convert text → 384 or 768-dimensional vectors
Hash: Apply LSH with a fixed seed (12345) so embeddings from different files map to the same bucket space
Discover: Query Krapivin hash table for buckets with ≥5 items (O(1) operation)
Label: Use an LLM or keyword extraction to generate semantic labels for each dense bucket
Merge: Combine similar themes using Jaccard similarity on tokenized labels:

\[J(A, B) = \frac{|A \cap B|}{|A \cup B|}\]

where \(A\) and \(B\) are sets of tokens from theme labels. Themes with \(J \geq 0.5\) get merged automatically.

All embeddings are pre-computed (~24 MB total), so the demo runs with zero API costs or inference overhead.

The Composability Advantage

Because we use a fixed LSH seed across all files, the bucket spaces are compatible. This means:

Add new data files → just compute their LSH signatures → merge with existing index
Remove files → delete their entries from affected buckets
Query across multiple datasets → buckets naturally align

Traditional approaches would require rebuilding the entire index when adding data. Krapivin hash tables with fixed seeds enable incremental, compositional updates.

Code Example

Here’s how to query dense buckets directly from the Parquet files:

import polars as pl

# Load dense buckets (≥5 items) from Parquet index
dense = (pl.scan_parquet("twitter_lsh_index.parquet")
    .group_by('bucket_id')
    .agg(pl.count('row_id').alias('count'))
    .filter(pl.col('count') >= 5)
    .sort('count', descending=True)
    .collect())

print(f"Found {len(dense)} dense buckets")
# Found 42 dense buckets

# Get contents of bucket 132 (e.g., "password reset" theme)
bucket_contents = (pl.scan_parquet("twitter_lsh_index.parquet")
    .filter(pl.col('bucket_id') == 132)
    .collect())

print(f"Bucket 132 contains {len(bucket_contents)} items")
# Bucket 132 contains 16 items

The key: no scanning required. Parquet’s columnar format + Polars’ lazy evaluation means we only read the columns we need.

Try It Yourself

Direct link: semantic-proprioception-demo.streamlit.app

Source code: github.com/jdonaldson/semantic-proprioception-demo

Select a dataset, choose an embedding model, and watch themes emerge automatically. Click into any theme to see the actual text samples that cluster together.

You can also: - Compare how different models (MiniLM-L3/L6/L12, MPNet-base) cluster the same data - Adjust the semantic merging threshold to consolidate or separate themes - Search for similar items using both brute-force cosine similarity and LSH-accelerated lookup

What This Enables

Semantic proprioception isn’t just about visualization—it unlocks new capabilities:

Hallucination detection: If an LLM generates text with high confidence but low embedding density (sparse bucket), it’s likely hallucinating content outside its training distribution.

Active learning: Sample from sparse regions (novel concepts) or high-entropy buckets (boundary cases) to maximize labeling efficiency.

Content gap analysis: Compare query density (what users search for) vs. corpus density (what you have) to find opportunities.

Concept drift detection: Track density distributions over time windows—sudden shifts indicate changing semantics.

The Research Behind It

Key papers: - Krapivin et al. (2025): Optimal Hash Tables with Deletion - Indyk & Motwani (1998): Approximate Nearest Neighbors via LSH

Technical Details

Built with: - Streamlit for the interactive UI - Polars for fast DataFrame operations - sentence-transformers (HuggingFace) for embeddings - Krapivin hash tables (Rust + Python bindings) for O(1) density queries - Parquet (zstd compression) for efficient storage

Total dataset size: ~24 MB (1,000 tweets + 1,000 papers + 684 HN posts, 4 models each)

The key insight: data can understand itself. Give it the right structure (LSH + Krapivin), and patterns emerge without manual intervention. Not clustering, not search—semantic self-awareness.

Try the demo and see what patterns hide in your own data.