The growing demand for language model (LM) inference is placing significant strain on datacenter resources, particularly GPUs, which are costly and often scarce. This leads service operators to face long request queues or to throttle users to cope with limited GPU availability. The conventional response is to scale out GPU-equipped servers, but this incurs substantial capital and operational expenses. In this work, we propose an alternative strategy that leverages idle CPU nodes, a resource commonly available in modern datacenter clusters. Our approach exploits GPU virtualization to forward GPU API calls from CPU-only nodes to remote GPUs, while performing CPU-intensive computations locally. For LMs where the primary bottleneck is CPU execution rather than GPU utilization, this mechanism allows idle CPUs to effectively augment serving capacity without requiring additional GPUs. Assuming high-speed interconnects typical of modern datacenters, the overhead of remote CPU-GPU communication is amortized, yielding improvements in job completion time and overall throughput. By converting idle CPUs into cost-free contributors to LM serving, our method reduces request queueing delays and provides a practical pathway to increase service efficiency without incurring additional GPU provisioning costs or sacrificing model accuracy, thereby saving on operational expenses. Extensive experimentation on a testbed with ten popular LMs and across four widely used datasets demonstrates that our ready-to-use open-source system can reduce LM inference-serving delays by up to 98%.
Reducing Language Model Inference Latency using CPU-Assisted Serving
Montella Raffaele;
2026-01-01
Abstract
The growing demand for language model (LM) inference is placing significant strain on datacenter resources, particularly GPUs, which are costly and often scarce. This leads service operators to face long request queues or to throttle users to cope with limited GPU availability. The conventional response is to scale out GPU-equipped servers, but this incurs substantial capital and operational expenses. In this work, we propose an alternative strategy that leverages idle CPU nodes, a resource commonly available in modern datacenter clusters. Our approach exploits GPU virtualization to forward GPU API calls from CPU-only nodes to remote GPUs, while performing CPU-intensive computations locally. For LMs where the primary bottleneck is CPU execution rather than GPU utilization, this mechanism allows idle CPUs to effectively augment serving capacity without requiring additional GPUs. Assuming high-speed interconnects typical of modern datacenters, the overhead of remote CPU-GPU communication is amortized, yielding improvements in job completion time and overall throughput. By converting idle CPUs into cost-free contributors to LM serving, our method reduces request queueing delays and provides a practical pathway to increase service efficiency without incurring additional GPU provisioning costs or sacrificing model accuracy, thereby saving on operational expenses. Extensive experimentation on a testbed with ten popular LMs and across four widely used datasets demonstrates that our ready-to-use open-source system can reduce LM inference-serving delays by up to 98%.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


