Images are expensive. Not emotionally, although some product managers do try. They are expensive because modern visual models turn an image into a sequence of tokens, then let those tokens attend to one another. In a Vision Transformer, more tokens usually mean more detail, but also more attention cost. The obvious response is to reduce the number of tokens.

The less obvious problem is that not every token is expensive in the same way.

The paper Frequency-Aware Token Reduction for Efficient Vision Transformer makes a useful correction to the usual efficiency story: token reduction should not be treated as a simple cleanup operation that removes “redundant” patches. It is also a frequency operation. Delete or merge the wrong tokens, and the model may become faster while quietly losing the very local detail that made it useful in the first place.1

That is the paper’s central contribution. It reframes token pruning as frequency management: preserve tokens that carry high-frequency variation, compress low-frequency content into compact direct-current tokens, and lightly modify attention so the remaining high-frequency tokens do not get smoothed into the same representation anyway. The result is not just another “your ViT, but faster” trick. The interesting part is diagnostic: it explains why some token-reduction methods work better in some layers, models, and image sizes than others.

Token reduction is not just throwing away patches

A Vision Transformer sees an image as a sequence of patch tokens. Self-attention then mixes these tokens through an attention matrix. Since attention cost grows heavily with token length, reducing tokens is one of the most direct routes to cheaper inference.

Existing methods usually fall into two families.

One family merges similar tokens. If two image patches look redundant, combine them. The other family prunes tokens. If a token seems unimportant, drop it. Both can reduce multiply-accumulate operations and improve throughput. So far, so sensible.

The paper’s objection is that this view misses what self-attention is already doing to the signal.

Repeated self-attention tends to behave like a low-pass filter. In plain terms, it preserves the broad average component of the feature map while weakening differences among tokens. This is related to rank collapse: token representations become increasingly similar, sometimes approaching a state where different patches carry nearly the same representation. A model that reaches that condition has not become wise and minimalist. It has become smooth.

The paper uses the high-frequency component of a feature matrix to formalize this idea. If $H_f[X]$ denotes the mean-centered, high-pass part of token features, self-attention can reduce it roughly as:

$$ |H_f[\text{SA}(X)]|_F \leq \lambda |H_f[X]|_F $$

where $\lambda < 1$ captures the collapsing tendency.

Now insert token reduction into this picture. A pruning or merging matrix $M$ reduces the token set before attention. The paper argues that this operation can further reduce the high-frequency component:

$$ |H_f[\text{SA}(MX)]|_F \leq |H_f[\text{SA}(X)]|_F $$

The equation is not there to decorate the method with algebraic perfume. It explains the failure mode. If a token-reduction method suppresses token diversity, it can accelerate the same smoothing process that already threatens deep attention blocks.

This is the misconception worth killing: token reduction is not neutral compression. It changes the signal geometry of the model.

The mechanism: keep the detail tokens, compress the smooth ones

The paper divides tokens into two sets: high-frequency tokens and low-frequency tokens.

High-frequency tokens are the ones contributing more to local variation. In images, these are often associated with boundaries, textures, and discriminative details. Low-frequency tokens carry more of the broad average component of the feature map, closer to the direct-current, or DC, signal.

A naive implementation would calculate the DC signal directly, then compare every token against it. That would defeat the purpose by adding too much computation. Instead, the authors decompose the attention map $A$ into a low-frequency component and a high-frequency residual:

$$ A^{LP} = \frac{1}{n}11^T $$
$$ A^{HP} = A - A^{LP} $$

The low-frequency component broadcasts the mean token representation. The high-frequency residual captures token differences. The method then scores tokens by their column-wise contribution to $A^{HP}$ across heads and queries. Tokens with the highest scores are treated as high-frequency tokens. Tokens with the lowest scores are treated as low-frequency tokens.

From there, the method does three things.

First, it preserves the high-frequency tokens. This protects local detail.

Second, it aggregates the low-frequency tokens into one or more DC tokens. This avoids simply throwing away the broad context. In early layers, where locality still matters, the authors use local DC tokens by grouping spatially adjacent patches. Later layers can use more global aggregation because the features are already more collapsed.

Third, it modifies attention with small learnable reweighting terms. The revised attention matrix gives the model a way to emphasize high-frequency tokens and adjust attention to DC tokens. This matters because a DC token is an average of several tokens, and the attention assigned to that average may not naturally match the combined attention that the individual tokens would have received.

The key design is therefore not “prune harder.” It is:

Component What it does Why it matters
High-frequency token selection Keeps tokens that contribute more to token-level variation Protects boundaries, textures, and discriminative patch differences
DC token aggregation Compresses low-frequency tokens into compact summary tokens Preserves broad context instead of deleting it
Local DC tokens Uses spatial grouping in earlier layers Avoids losing local structure before the model has built higher-level features
Attention reweighting Adjusts attention for HF and DC tokens Prevents the remaining representation from collapsing too easily

This is a clean mechanism. It is also slightly annoying, in the good scientific way: it says the cheap answer was not quite wrong, just under-specified.

The main results: lower compute without the usual accuracy tax

The main classification evidence is on ImageNet-1K with pretrained ViT-family models fine-tuned for 30 epochs. The default reduction setup applies token reduction at the 4th, 7th, and 10th layers, reducing token count by 30% at each selected layer.

The headline pattern is consistent: the proposed method cuts MACs substantially while preserving, and sometimes slightly improving, top-1 accuracy.

Model Baseline accuracy / MACs Frequency-aware result Practical interpretation
DeiT-T 72.2% / 1.3 72.3% / 0.8 Tiny model keeps accuracy while reducing compute
DeiT-S 79.8% / 4.6 79.8% / 3.0 Same reported accuracy with roughly one-third lower MACs
DeiT-B 81.8% / 17.6 81.8% / 11.6 Larger model preserves baseline accuracy with lower compute
ViT-S 78.8% / 4.6 79.0% / 3.0 More visible gain where rank collapse is stronger
ViT-S-21K 81.1% / 4.6 81.2% / 3.0 Similar pattern under stronger pretraining
LV-ViT-S 83.3% / 6.6 83.3% / 4.7 Preserves accuracy under another ViT-style model

The self-supervised cases are more nuanced. For ViT-B-MAE, the baseline reports 83.7% at 17.6 MACs, while the proposed method reports 83.2% at 12.1 MACs. That is a drop, but smaller than ToMe and DiffRate under the same reported MACs. ViT-L-MAE shows a similar pattern: baseline 86.0%, proposed method 85.7%, ToMe and DiffRate 85.6%. For DINO, the method also loses accuracy versus the uncompressed baseline but remains above ToMe in the table.

This matters because MAE and DINO already encourage richer token diversity through their pretraining objectives. The method’s biggest advantage appears where rank collapse is more painful. It is not magic dust. It helps most when the disease it treats is actually present.

The ablations show which parts are doing real work

The paper’s ablations are not side decorations. They explain why the method is structured the way it is.

The high-frequency token ablation tests whether preserving HF tokens actually matters. The answer is yes. Removing high-frequency tokens and retaining mainly low-frequency/DC information causes much larger degradation than removing low-frequency tokens. The authors interpret this as evidence that ViTs continue to rely on high-frequency signals even though self-attention tends to smooth them.

The DC token ablation tests the opposite risk: perhaps high-frequency tokens are all that matter. Not quite. Removing DC tokens also hurts. Low-frequency information is less fragile than high-frequency detail, but it is not disposable. The point is balance: keep the sharp pieces, summarize the smooth background.

The local-window ablation asks where local DC tokens matter. The reported pattern is that local DC tokens help more in earlier layers than later layers. This is consistent with the mechanism: early ViT layers still behave more locally, so overly global compression can lose useful spatial structure. Later layers are already more semantically mixed and more prone to smoothing, so global compression is less dangerous.

The attention-reweighting ablation is modest but informative. On DeiT-S, the paper reports a slight improvement from 79.8% without reweighting to 79.9% with reweighting, and notes faster fine-tuning convergence. That makes reweighting a supporting mechanism, not the hero. The hero is still the frequency-aware selection plus DC aggregation.

Test Likely purpose What it supports What it does not prove
HF vs LF pruning ablation Ablation High-frequency tokens are disproportionately important for accuracy That every selected HF token is semantically important to humans
DC token removal Ablation Low-frequency context still matters after HF preservation That one DC design is optimal for all tasks
Local window size Sensitivity test Earlier layers benefit more from local DC aggregation That the default window schedule is globally optimal
Prior-method frequency analysis Comparison with prior work Merging can suppress high-frequency content more severely than token selection That all merging methods fail under every architecture
Image-size and patch-size study Robustness/sensitivity test Benefits are stronger for longer token sequences That pruning is safe for very short token sequences
Dense prediction extension Exploratory extension The idea can transfer beyond classification That it is production-ready for every segmentation architecture

This distinction is important. The paper has strong mechanism-aligned evidence, but not every appendix test is a second thesis. Some tests validate the mechanism; others probe where it may generalize.

Why previous token reducers look different through the frequency lens

One of the more useful parts of the paper is its analysis of existing methods.

ToMe, a merging-based method, merges similar tokens. From a redundancy viewpoint, this sounds efficient. From a frequency viewpoint, it risks suppressing variation. The paper’s single-layer reduction analysis shows that ToMe has lower high-frequency amplitude than token-selection methods and even lower than a pooling baseline in the reported comparison. The accuracy impact is especially worse when merging is applied in shallower layers, where local detail is still valuable.

EViT behaves differently. It uses attention from the CLS token to decide which tokens to retain. The paper finds that EViT and the proposed HF-token selection differ more in early layers but become more similar in deeper layers. The interpretation is elegant: as depth increases, the CLS token gradually incorporates high-frequency signals, so CLS-based selection starts to overlap more with frequency-aware selection.

That explains why some prior pruning methods may work better late in the network. They are not necessarily “understanding” frequency. They may simply become frequency-aligned once the CLS token has accumulated the right information. Earlier in the network, that shortcut is less reliable.

This is the paper’s strongest diagnostic insight for practitioners: layer placement is not a detail hidden in the implementation appendix. It changes what kind of signal the compression method is allowed to destroy.

Long-token settings are where the business case becomes clearer

For business use, the most relevant question is not whether the method wins a benchmark by 0.1 percentage points. It is whether the accuracy-compute tradeoff improves in settings where inference cost actually hurts.

The appendix on different image and patch sizes is especially useful here. The gains are most visible when token sequences are longer.

For ViT-B at 16/384, corresponding to 576 image tokens, the method improves accuracy from 86.6% to 87.0% while reducing GFLOPs from 55.5 to 35.9. At 8/224, corresponding to 784 tokens, it improves accuracy from 85.8% to 86.5% while reducing GFLOPs from 78.3 to 50.0. These are the kinds of cases that matter for high-resolution visual inspection, dense retrieval, document image understanding, medical imaging pipelines, and edge vision systems where finer patching or larger images can quickly make token counts unpleasant.

The same table also gives a warning. At short token lengths, especially 32/224 with only 49 image tokens, pruning hurts badly: ViT-B drops from 80.7% to 77.9%, and ViT-S drops from 76.0% to 73.5. When the sequence is already short, there may not be much redundant low-frequency mass to remove. A cheap model can become too cheap. Astonishingly, subtraction still subtracts.

So the business reading is not “apply this everywhere.” It is more specific: frequency-aware reduction becomes more attractive as token length grows and as high-resolution inputs make attention cost a real operating constraint.

Throughput evidence is encouraging, but hardware-specific

MAC reductions are useful, but deployment teams care about real throughput. The paper includes end-to-end throughput measurements under FP32 on an NVIDIA RTX 4090.

For ViT-S, the baseline runs at 2,803 images per second with 79.8% accuracy. The proposed method reaches 3,659 images per second with 79.9% accuracy in that benchmark setting. EViT is faster at 3,742 images per second, but with lower accuracy at 79.5%.

For ViT-B, the baseline runs at 918 images per second with 81.8% accuracy. The proposed method reaches 1,310 images per second while preserving 81.8% accuracy. EViT is slightly faster at 1,338 images per second, but reports 81.4% accuracy.

The paper also examines compatibility with FlashAttention. In an FP16 FlashAttention setup on RTX 4090, DeiT-S with FlashAttention reports 9,303 images per second, while the frequency-aware version reports 12,759 images per second. The method is designed to require lightweight post-processing rather than replacing the core optimized attention kernel.

This is promising, but not universal proof. Throughput depends on hardware, precision, batch size, kernel implementation, memory bandwidth, model shape, and serving framework. A method that looks good on an RTX 4090 may still require validation on mobile NPUs, browser inference, cloud GPUs, or specialized accelerators. The paper gives a credible direction, not a procurement guarantee.

Dense prediction suggests the method is not classification-only

The segmentation appendix is an exploratory extension, but a useful one.

The authors integrate the method into a Segmenter-style pipeline with an ADE20K-pretrained ViT-B encoder. The baseline reports 21.7 images per second and 51.3 mIoU. A light pruning configuration reports 68.0 images per second and 51.1 mIoU. A stronger configuration, pruning 30% at layers 4, 7, and 10, reports 92.1 images per second and 50.8 mIoU.

That is a 0.5 mIoU drop for more than 4x throughput in the reported setup. For dense prediction, where output quality depends heavily on spatial detail, this is an important sanity check. A token method that destroys boundaries would not be very attractive for segmentation, however lovely its classification table looks.

Still, this is one segmentation framework and dataset. Treat it as evidence that the mechanism travels, not proof that it has already arrived everywhere.

The business value is not just cheaper inference; it is safer compression

For companies deploying ViT-based systems, the paper’s practical value lies in how it changes the compression question.

The usual question is: “How many tokens can we remove before accuracy drops?”

The better question is: “Which signals are we removing, and at which layer?”

That shift matters in several business settings.

For visual inspection, high-frequency signals often correspond to edges, scratches, tiny defects, or texture anomalies. These are precisely the details that naive smoothing can erase. A model that runs faster but misses rare defects is not efficient. It is a discounted liability.

For semantic segmentation, boundaries and local structure matter. The dense prediction appendix suggests that frequency-aware pruning can accelerate segmentation while preserving most mIoU, but teams should test boundary-sensitive classes separately. Mean metrics can hide exactly the kind of local failure that frequency arguments are about.

For visual search and retrieval, token reduction can reduce embedding or inference cost, but pruning must preserve discriminative visual cues. If two products differ only by subtle texture, logo detail, or shape edge, low-frequency averaging may make them look closer than they should.

For edge AI, the method is attractive because it targets inference cost without requiring a new model family trained from scratch. The paper applies the method to pretrained models and fine-tunes them. That is operationally easier than rebuilding the architecture. Easier, not effortless.

A useful deployment framework would look like this:

Deployment question What the paper suggests checking
Are images high-resolution or patch-dense? Frequency-aware reduction is more promising when token sequences are long
Are errors boundary-sensitive? Validate on local-detail failure cases, not only aggregate accuracy
Is the model already trained with diversity-preserving objectives? Gains may be smaller for MAE/DINO-style pretraining
Is compression applied early? Use local DC tokens carefully; early layers are more detail-sensitive
Is throughput measured on target hardware? Re-benchmark; RTX 4090 results do not automatically transfer
Is the task classification or dense prediction? Classification evidence is stronger; segmentation evidence is encouraging but narrower

This is where the paper is genuinely useful for applied AI teams. It gives a vocabulary for auditing compression risk. Token budgets become signal budgets.

Where the claim stops

The paper’s boundaries are clear enough.

First, the experiments are vision-focused. The authors note that rank collapse is relevant to transformer models more broadly and suggest possible extension to multimodal models, but the evidence here is for Vision Transformers. It would be premature to claim the same method directly solves token reduction for language, audio, video-language models, or general multimodal agents.

Second, most classification results rely on fine-tuning with self-distillation from the original model. That is a practical recipe, but it means this is not a zero-effort plug-in. Teams should budget for fine-tuning, validation, and regression testing.

Third, benchmark preservation is not the same as business-risk preservation. Top-1 accuracy, MACs, and throughput are useful, but production deployments often care about rare classes, edge cases, calibration, latency percentiles, explainability, and stability under distribution shift.

Fourth, the method is not always beneficial. The image-size ablation shows meaningful degradation when token length is already short. Efficiency methods have operating regimes. This one is no exception.

Finally, the attention-reweighting component appears supportive rather than transformative. The core idea is frequency-aware selection and DC aggregation. The reweighting helps, but the paper’s own ablation suggests it should not be oversold.

The takeaway: compression should preserve what the model still needs to see

This paper is valuable because it changes the mental model of token reduction. The old story says: find redundancy, remove it, enjoy speed. The frequency-aware story says: before removing tokens, ask whether you are deleting the model’s local variation or only compressing its smooth background.

That distinction is not cosmetic. In Vision Transformers, self-attention already tends to smooth token representations. A careless token-reduction method can accelerate that collapse. A better method should protect high-frequency signals and compress low-frequency content in a controlled way.

The paper’s evidence supports that mechanism across ImageNet classification, different ViT variants, model sizes, prior-method comparisons, dense prediction, throughput tests, FlashAttention compatibility, and token-length sensitivity studies. The result is not a universal deployment recipe, but it is a strong design principle.

For businesses using visual AI, the message is simple: cheaper vision models are useful only if they remain able to see the details that matter. Pruned is fine. Muted is not.

Cognaptus: Automate the Present, Incubate the Future.


  1. Dong-Jae Lee, Jiwan Hur, Jaehyun Choi, Jaemyung Yu, and Junmo Kim, “Frequency-Aware Token Reduction for Efficient Vision Transformer,” arXiv:2511.21477, 2025. https://arxiv.org/abs/2511.21477 ↩︎