In deep learning, batch processing refers to feeding multiple inputs into a model. Although it’s essential during training, it can be very helpful to manage the cost and optimize throughput during inference time as well. Hardware accelerators are optimized for parallelism, and batching helps saturate the compute capacity and often leads to higher throughput.
Batching can be helpful in several scenarios during model deployment in production. Here we broadly categorize them into two use cases:
Real-time applications where several inference requests are received from different clients and are dynamically batched and fed to the serving model. Latency is usually important in t …