Pruned but Not Muted: How Frequency-Aware Token Reduction Saves Vision Transformers
Opening — Why this matters now Vision Transformers (ViTs) are everywhere—classification, segmentation, medical imaging, robotics. But their quadratic attention cost has become a tax on progress. Every extra token turns into disproportionately more compute, memory, and latency. Businesses want ViT‑level accuracy, but not the bill from the GPU vendor. Token reduction—merging, pruning, squeezing—has been the industry’s workaround. Yet these methods quietly erode the very signal ViTs rely on. By stripping away high‑frequency structure, they trigger an internal entropy spiral known as rank collapse: the model gradually forgets how to differentiate tokens at all. ...