Audio-Models

TL;DR for operators The paper is not really saying “use a smaller speech model.” That would be too convenient, and reality hates convenience. It is saying something more useful: audio-model efficiency is a budget allocation problem. Model size, audio duration, encoder token resolution, and adaptation depth are different ways to spend compute, and they do not buy the same thing. Agarwal, Gangrade, Pal, and Wu study this across automatic speech recognition using Whisper on LibriSpeech and speech emotion recognition using wav2vec2 on CREMA-D.1 ...