Whisper

Whispers Against the Noise: How Contrastive Decoding Tames Long‑Form ASR Hallucinations

A transcript is usually treated as boring infrastructure. It sits underneath meeting summaries, call-center analytics, podcast search, earnings-call review, legal discovery, medical documentation, and the cheerful dashboard that tells managers everything is now “AI-powered.” Then the transcript invents a sentence. Not a typo. Not a small mishearing. A fluent, confident, context-shaped sentence that nobody said. In short clips, this is irritating. In long recordings, it becomes structural. One bad segment can become context for the next segment; the next segment inherits the mistake; and soon the system is not transcribing a recording so much as continuing a badly seeded story. ...

When 30 Seconds Isn’t Enough: Engineering Long-Form Bangla ASR & Diarization

Call recordings are rude. They do not arrive in clean 15-second snippets. They run for minutes or hours. Speakers interrupt each other. Background noise leaks in. Someone moves away from the microphone. Someone else speaks over music, traffic, or a ceiling fan that apparently believes it deserves co-author status. This is where many speech AI demos quietly stop being impressive. ...

Whispering Feelings: When ASR Models Learn to Read Emotion

Voice systems have an awkward problem. They are getting better at hearing words, but words are not always the message. A customer says, “Fine.” A patient says, “I’m okay.” A caller says, “No problem.” The transcript is calm. The voice may not be. For call centers, mental-support triage, voice assistants, social robots, and compliance monitoring, that gap is not poetic. It is operational. ...