Four Bits, One Identity Crisis: What W4A4 Video Quantization Actually Breaks

TL;DR for operators

The useful surprise in Tail-Aware HiFloat4 is not that a 4-bit video model gets worse. That part is not exactly a Nobel-level plot twist. The useful surprise is where it gets worse. The paper reports a W4A4 HiFloat4 post-training quantization pipeline for Wan2.2-I2V-A14B, and under matched generation settings the unweighted mean score drops from 0.6800 to 0.5880. But the collapse is concentrated: subject consistency falls from 0.9331 to 0.5324, while aesthetic quality is effectively unchanged, overall consistency is comparable, and motion smoothness drops only slightly from 0.9923 to 0.9803.¹

For operators, that means this is not a simple “low-bit video generation is good enough” story. It is a routing story. Aggressive W4A4 quantization may be acceptable for rough creative drafts, internal concept exploration, background motion, or non-identity-critical clips. It is much less reassuring for workflows where a product, person, mascot, package, vehicle, character, or brand asset must remain recognisably the same across frames.

The method itself is practical: it adapts a ViDiT-Q-style post-training quantization workflow, quantizes 800 main transformer linear layers in HiFloat4 W4A4, keeps 12 boundary linear layers in BF16, and stores compact PTQ state rather than duplicating a full transformed BF16 model. Its calibration trick is also sensible: replace hard maximum activation statistics with high-percentile statistics when building SmoothQuant-style channel masks, so rare activation tails do not consume the entire 4-bit representational budget.

The boundary is equally important. The paper does not report a latency benchmark, hardware cost benchmark, percentile ablation, calibration-size sweep, or comparison against the same pipeline using the old max statistic. It shows that this submitted W4A4 configuration has a specific quality profile. That is valuable. It is not yet a universal licence to push every video workload through four bits and call the finance department heroic.

The headline result is not the average drop

Video generation has an obvious cost problem. A model must process many latent tokens across space, time, denoising steps, and conditioning branches. When the backbone is a large diffusion transformer, every extra frame quietly invoices the GPU. Low-bit inference is the natural temptation: use fewer bits, move less memory, run cheaper arithmetic, and hope the video still looks expensive enough.

The paper tests that temptation in a constrained challenge setting. The target model is Wan2.2-I2V-A14B, an image-to-video model used here through a prompt-driven setup with a fixed blank placeholder image. The quantized system uses HiFloat4 for both weights and activations in selected linear layers. The baseline remains BF16. The sampling protocol is matched: same base checkpoint, same placeholder image condition, same resolution, same frame count, same denoising steps, same guidance scale, and same prompt source.

That matters because it narrows the interpretation. The reported differences are not supposed to come from changing the sampler, changing the prompt pipeline, changing the number of frames, or quietly giving the quantized model easier homework. They come from quantization and PTQ-state restoration.

Here is the core evidence.

Metric	BF16	HiFloat4 W4A4	Delta	Operational reading
Imaging quality	0.7027	0.6507	-0.0520	Visible quality degrades, but not catastrophically
Aesthetic quality	0.5456	0.5458	+0.0002	Essentially comparable; do not over-celebrate the extra decimal
Overall consistency	0.2263	0.2308	+0.0045	Comparable under this metric; not proof of broad superiority
Subject consistency	0.9331	0.5324	-0.4007	The dominant failure mode
Motion smoothness	0.9923	0.9803	-0.0120	Temporal flow mostly survives
Unweighted mean	0.6800	0.5880	-0.0920	Average hides the real damage pattern

The lazy summary is “W4A4 reduces quality.” True, but not useful. The operator-relevant summary is sharper: the model still moves well enough, still often looks aesthetically plausible, and still preserves broad scene continuity, but it becomes much less reliable at keeping the subject intact.

That distinction is not academic garnish. In production video, “looks okay” and “the same object persists correctly” are different service-level agreements.

The misconception: cheaper video will mainly break motion

The intuitive fear with quantized video generation is temporal failure. Frames might flicker. Motion might become jerky. The denoising trajectory might lose coherence. Viewers are sensitive to motion defects, so this sounds like the obvious place to look.

The paper’s reported metric profile points elsewhere. Motion smoothness barely moves relative to BF16. The generated clips, according to the qualitative examples, can preserve scene layout and plausible motion. The damage appears more in fine subject details and local object geometry.

This is the useful correction: W4A4 quantization, at least in this configuration, does not primarily fail by turning video into a flipbook. It fails by weakening the repeated preservation of identity-sensitive information. The person, object, costume, furniture, product geometry, or local detail may drift, blur, collapse, or mutate across the denoising process.

That is exactly the failure mode business users are tempted to miss when they evaluate with a single aggregate score or a quick human glance. A clip can look smooth and still be unusable. A brand mascot that moves fluidly while slowly becoming a cousin of itself is not a win. It is a very efficient way to generate review comments.

What the system actually changes

The paper’s method has three main moving parts.

First, it adapts a ViDiT-Q-style post-training quantization pipeline to Wan2.2-I2V-A14B. This matters because video diffusion transformers are not ordinary feed-forward classifiers. Activations vary across denoising timesteps, spatial tokens, temporal tokens, prompt conditions, and classifier-free guidance branches. Calibration is therefore not just bookkeeping. It decides what numerical range the 4-bit representation will protect.

Second, it applies HiFloat4 W4A4 fake quantization to the main linear layers in both Wan transformer modules. The default configuration quantizes 400 linear layers in Transformer-1 and 400 in Transformer-2, for 800 HiFloat4 linear layers total. It keeps 6 linear layers per transformer in full precision, for 12 BF16 linear layers total. These retained layers are boundary modules such as time/text embedding projections and output projections. No full transformer block is retained in BF16 in the submitted configuration.

Third, it stores compact PTQ state: masks and quantization descriptors needed to reproduce the W4A4 model from the original floating-point checkpoint. The alternative would be to store a full transformed copy of the model weights. The compact-state approach is cleaner operationally because it keeps the quantized artefact explicitly tied to the declared base checkpoint.

That last point is not merely storage hygiene. In an organisation comparing many compression variants, “this artefact is a delta from this exact base model under this exact calibration setup” is much easier to govern than a zoo of near-identical model copies with unclear lineage. Model registries already have enough ways to become archaeological sites.

The calibration trick: four bits should not worship outliers

The central technical idea is tail-aware percentile calibration for channel-mask construction.

A linear layer can be rescaled without changing its floating-point computation. If $M = \operatorname{diag}(m)$ is a diagonal channel mask, then:

$$ xW^T = (xM)(WM^{-1})^T $$

Before quantization, this is algebraically equivalent. After quantization, it is not equivalent, because the activation path and weight path are approximated by a limited HiFloat4 code set. The choice of $m$ changes which values receive useful resolution and which values get rounded into embarrassment.

SmoothQuant-style channel balancing builds this mask from weight magnitudes and activation magnitudes. The paper defines a per-channel weight statistic:

$$ w_i = \max_o |W_{o,i}| $$

A conservative activation statistic would be the maximum absolute activation observed during calibration:

$$ a_i^{\max} = \max_j |x_{j,i}| $$

The problem is that a maximum can be held hostage by rare outliers. If one unusual activation spike appears during calibration, a 4-bit quantizer may allocate too much of its tiny representational range to covering that spike. The common values—the ones repeatedly used across many tokens and frames—then receive coarser resolution.

The paper’s proposed alternative is a high-percentile statistic:

$$ a_i^p = Q_p({|x_{j,i}|}_j) $$

where $Q_p$ is the empirical $p$-th percentile for that channel. The channel mask is then constructed as:

$$ m_i = \frac{w_i^\alpha}{(a_i + \epsilon)^{1-\alpha}} $$

The practical interpretation is simple: the calibration process accepts that some rare extremes may be clipped, in exchange for better resolution over the main body of the activation distribution. This is a reasonable bet in 4-bit video generation, where systematic rounding error across many tokens and frames can be more damaging than a few isolated extremes.

But the evidence boundary must be kept clean. The paper introduces and uses this percentile calibration, but it does not report an ablation comparing percentile calibration against a hard-max calibration under otherwise identical conditions. So the article-level claim should be: this is a plausible and implemented mechanism inside the submitted system. It should not be upgraded into “percentile calibration alone caused the reported score profile.” That would be the usual compression-paper overreach, and we are not obliged to participate.

The experiments show a failure profile, not a full optimisation map

The paper’s experimental section is compact. That is not a criticism; challenge reports often are. But it means readers should be precise about what each component proves.

Paper component	Likely purpose	What it supports	What it does not prove
Matched BF16 vs W4A4 evaluation	Main evidence	The submitted W4A4 system preserves some global metrics while sharply degrading subject consistency	That W4A4 is production-ready, or that this is the best W4A4 recipe
Default calibration and generation settings	Implementation detail and control	The comparison uses 16 calibration prompts, 720 × 1280 resolution, 61 frames, 40 denoising steps, guidance scale 3.5, seed 42	That the result is robust to other resolutions, prompt distributions, seeds, or calibration sizes
Layer conversion table	Implementation detail	The scope of quantization is broad: 800 HiFloat4 linear layers, 12 FP linear layers, no FP transformer blocks	That these exact retained layers are optimal
Qualitative examples	Diagnostic illustration	The visual failure pattern aligns with the metric drop: plausible motion and layout, weaker fine subject detail	A comprehensive human preference study
Limitation discussion	Boundary statement	Calibration prompts, percentile hyperparameters, missing rotation path, and W4A4 severity are material constraints	A sensitivity analysis of those constraints

This table is the difference between reading and worshipping. The paper provides enough evidence to diagnose a trade-off. It does not provide enough evidence to map the full design space.

The absent tests matter. There is no reported sweep over percentile $p$. No calibration-prompt-count sweep. No comparison to max-based calibration. No test retaining one or two full transformer blocks in BF16. No optional ViDiT-Q rotation path. No latency or memory benchmark. No study of prompt categories where subject consistency collapses more or less severely.

For a research report, that is acceptable. For a deployment decision, it is an invitation to build your own evaluation harness.

Why subject consistency is the expensive metric

Subject consistency sounds like a metric. In business use, it is often the product.

Consider three different video workflows.

In a mood-board generator, subject drift may be tolerable. The goal is to explore lighting, pacing, composition, and general visual direction. If the chair becomes slightly less chair-like, nobody calls legal.

In a social ad draft, subject drift is more costly. The product must remain recognisable, the packaging cannot mutate, and the human model cannot become visually inconsistent across frames. A smooth video of an inconsistent product is not creative. It is defective.

In a brand-controlled or compliance-sensitive workflow, subject drift is a hard stop. Pharmaceutical packaging, financial-product disclaimers, vehicle designs, character IP, uniforms, safety equipment, and public figures do not get to “mostly persist.” The tolerance is asymmetric: one obvious mutation can invalidate the clip.

This is why the paper’s metric profile is commercially more interesting than the average score. The W4A4 system may preserve enough of the global cinematic envelope to be useful in low-stakes generation. It does not preserve enough subject identity, under the reported evaluation, to be trusted blindly in identity-critical production.

The deployment question is therefore not “Can we quantize video generation?” It is “Which jobs can survive this particular degradation pattern?”

The business value is cheaper iteration, not quality parity

The immediate business pathway is not replacing BF16 generation everywhere. It is tiered generation.

A practical video pipeline could use a W4A4 model for early-stage exploration: prompt search, scene blocking, rough motion testing, storyboard variants, background atmosphere, internal review clips, and bulk ideation. Then it could route selected candidates to a higher-precision model for identity-critical refinement. This is not glamorous. It is also how cost control usually enters production: not as magic, but as filtration.

The compact PTQ state also supports an operationally useful pattern: many compression variants can be compared without creating a full duplicate of every transformed checkpoint. For model governance, that means compression becomes a reproducible configuration attached to a base model, not a mysterious sibling model wandering around the registry with a similar name and worse manners.

The paper directly shows quality metrics under a matched challenge setup. Cognaptus infers the workflow implication: W4A4 is promising as a cheap candidate-generation layer where output review and escalation are already part of the process. What remains uncertain is the actual infrastructure ROI, because the paper does not report measured speedup, VRAM reduction, throughput improvement, or hardware-specific deployment performance.

That distinction matters. “Four-bit arithmetic is attractive” is not the same as “your production stack will be cheaper next quarter.” The latter depends on kernels, hardware support, batching, memory bandwidth, scheduler overhead, video length, serving architecture, fallback rates, and how often rejected clips must be regenerated in higher precision. Finance departments enjoy numbers. They are peculiar like that.

Procurement should evaluate the failure mode, not the compression label

A vendor pitch around this class of method will naturally emphasise W4A4, 4-bit floating-point, post-training quantization, and lower inference cost. Those are relevant. They are not acceptance criteria.

A useful evaluation checklist should separate four questions.

Procurement question	Why it matters	What to measure
Does motion survive?	Smoothness defects are visible and easy to reject	Motion smoothness, flicker, temporal coherence, human review
Does the subject survive?	Product, person, character, or object identity may be the business asset	Subject consistency, object geometry, logo/package persistence, frame-by-frame inspection
Does calibration match the workload?	A 16-prompt calibration set may not cover production distributions	Prompt-category tests, calibration-size sweeps, percentile sensitivity
Does compression actually save money?	Quality loss can erase infrastructure savings through rework	Latency, memory, throughput, rejection rate, fallback rate, cost per accepted clip

The most dangerous procurement mistake is to accept an average video-quality number as a proxy for production readiness. Average metrics are polite. Production failures are not.

If the workflow cares about stable subject identity, the acceptance test must over-sample identity-sensitive prompts. Use branded objects, repeated characters, specific garments, packaging, hands interacting with objects, tools, vehicles, furniture, and scenes where local geometry matters. Then compare the quantized model not only against BF16, but against the cost of human correction and regeneration.

A smooth 61-frame video is not useful if frame 38 decides the product has a new logo.

The missing ablations are not academic housekeeping

The paper’s limitation section correctly names several boundaries: calibration prompts, percentile hyperparameters, the absence of the optional ViDiT-Q rotation path, and the severity of W4A4 activation quantization. For business readers, these are not footnotes. They are where the deployment risk lives.

The missing percentile ablation is especially important. The paper’s mechanism says percentile calibration reduces sensitivity to rare activation tails. Sensible. But without comparing it against the maximum statistic, we cannot quantify how much it helps. Without sweeping $p$, we cannot tell whether the chosen percentile is robust or merely convenient. Without calibration-distribution tests, we cannot tell whether the method behaves differently across prompts with people, animals, text, product objects, fast motion, small details, or complex scenes.

The missing rotation path also matters. ViDiT-Q includes optional rotation to reduce outliers through equivalent transformations. This paper does not use that path. The authors note that unresolved channel imbalance may remain. That leaves an obvious improvement direction: combine percentile calibration with better outlier handling and selective high-precision retention for identity-sensitive layers.

The lack of speed and memory measurements is another practical boundary. The method is motivated by lower-cost inference, and the use of W4A4 is clearly aligned with that goal. But the paper’s reported evidence is quality-focused. It does not tell an operator how much faster or cheaper the system is on a particular inference stack. That is not a fatal flaw. It just means ROI remains an engineering measurement, not a paragraph in the introduction.

The right operating model is precision routing

The paper’s evidence points toward a precision-routing architecture rather than a one-model replacement decision.

Use W4A4 where the cost of being slightly wrong is low and the value of rapid iteration is high. Use BF16 or a stronger mixed-precision variant where identity stability is contractually, commercially, or reputationally important. Keep the quantized path inside a measured pipeline with explicit gates: subject consistency threshold, prompt-category checks, human review sampling, and automatic escalation for identity-sensitive cases.

This creates a more realistic operating model:

Workload type	W4A4 suitability	Recommended control
Internal mood boards	High	Human review before sharing
Prompt exploration	High	Cheap bulk generation, then upscale selected candidates
Background or atmospheric clips	Medium to high	Check for scene-level coherence
Product advertising drafts	Medium	Subject and object consistency gate
Brand mascot or character video	Low to medium	Higher-precision fallback by default
Regulated or identity-sensitive content	Low	Avoid W4A4 unless validated on the exact content class

This is the unromantic lesson: compression is not a model property in isolation. It is a workflow policy. The same W4A4 model can be valuable in one stage and unacceptable in another.

The companies that benefit will not be the ones that announce “we quantized video.” They will be the ones that know exactly which failure modes they bought, where they are acceptable, and when to route around them.

The paper’s real contribution is diagnostic discipline

Tail-Aware HiFloat4 is not a grand theory of efficient video generation. It is a practical challenge-system report with a useful metric pattern. That makes it valuable in a different way.

It shows that aggressive 4-bit PTQ can leave broad visual appeal and motion surprisingly intact while damaging subject identity. It gives a concrete pipeline for applying HiFloat4 W4A4 to a dual-transformer Wan2.2 model. It introduces a reasonable tail-aware calibration mechanism. It keeps sensitive boundary layers in high precision. It stores compact PTQ state for reproducibility.

But its most useful contribution for operators is the warning: do not evaluate compressed video models by asking whether the clips “look fine.” Ask what must stay the same across frames. Ask whether your business value lives in motion, style, subject identity, product geometry, or all of them at once. Then measure the thing that can actually break the workflow.

Four bits may be enough to keep the scene moving. They may not be enough to keep the subject itself.

That is not failure. It is information. In AI operations, those are too often confused.

Cognaptus: Automate the Present, Incubate the Future.

Zhanfeng Feng, Shuai Guo, Xin Di, Long Peng, Yang Cao, and Zhengjun Zha, “Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2,” arXiv:2605.26628, 2026. ↩︎

TL;DR for operators#

The headline result is not the average drop#

The misconception: cheaper video will mainly break motion#

What the system actually changes#

The calibration trick: four bits should not worship outliers#

The experiments show a failure profile, not a full optimisation map#

Why subject consistency is the expensive metric#

The business value is cheaper iteration, not quality parity#

Procurement should evaluate the failure mode, not the compression label#

The missing ablations are not academic housekeeping#

The right operating model is precision routing#

The paper’s real contribution is diagnostic discipline#