β‘ The Condensate Theorem
Transformers Are O(n), Not O(nΒ²) β 157x Speedup, 100% Accuracy
The Discovery: Trained language models concentrate >99% of attention mass on a sparse, predictable manifold. We don't approximate. We don't retrain. We simply skip the computation the model would ignore anyway.
$$\mathcal{C}_i = \underbrace{\{0\}}_{\text{Anchor}} \cup \underbrace{\{\text{Window}\}}_{\text{Local}} \cup \underbrace{\text{Top-}k}_{\text{Needles}}$$
π Try it: Increase the filler multiplier to see how waste GROWS with sequence length!
β±οΈ Note: This demo runs full O(nΒ²) attention to prove the theorem mathematically.
The optimized Triton kernel (which achieves 157x speedup) is not included hereβit's available under commercial license.
π§ Demo Simplification: Uses fixed window=64, top_k=16 for all layers.
The production kernel uses layer-adaptive selection (early layers need broader context, late layers are focused).
π Build Your Test Prompt
π Measured Speedups (RTX 4090)
| Sequence Length | Full Attention | Sparse Attention | Speedup |
|---|---|---|---|
| 4,096 | 0.53 ms | 0.07 ms | 7.5x |
| 16,384 | 3.95 ms | 0.17 ms | 23x |
| 65,536 | 61.5 ms | 0.76 ms | 81x |
| 131,072 | 228 ms | 1.45 ms | 157x |
| 1,000,000 | ~14.6 s | ~11.6 ms | 1,257x |
The longer the sequence, the more dramatic the savings!
π Links
- GitHub: Validation Scripts β Run
python validate.pyyourself - Production Kernel License: jorgeruizwilliams@gmail.com
Β© 2026 Jorge L. Ruiz Williams / NaNZeta LLC | The theorem is MIT licensed. The optimized Triton kernel is proprietary.