One Step, Not One Trick: SOM and the Q-Guided MeanFlow Policy
TL;DR for operators A control policy that needs twenty denoising steps before it can choose one action is not merely “expressive”. It is also late. In online reinforcement learning, that matters because policy inference is not a side calculation; it sits inside the loop that collects the next piece of experience. The paper on Score-Based One-step MeanFlow Policy Optimization, or SOM, tackles this operationally awkward trade-off: diffusion and flow policies can represent multimodal action distributions, but they often pay for that expressiveness through iterative sampling. SOM keeps the generative-policy idea but moves action generation into a one-step MeanFlow policy.1 ...