GPU Scheduling

Mixed Feelings: When LLM Batching Stops Being Obviously Better Queues are where infrastructure theories go to become invoices. In LLM serving, the popular theory has been simple enough: mix the work. During inference, a model first reads the prompt in the prefill phase, then generates tokens one by one in the decode phase. Prefill wants compute. Decode wants memory bandwidth. So the obvious move is to combine them in the same batch, letting one part of the GPU do prefill while another part handles decode. This is mixed batching, and it has become the default posture in modern inference engines. ...