Dynamic Batching for Enhanced Hardware Utilization

LLaMA 2gRPCRedisPerformance

Integrated dynamic batching into the app server for LLaMA 2 (7B and 70B), achieving notable improvements in Time-to-First-Token, throughput, and hardware utilization for a major customer release.

Situation

After successfully onboarding several models onto SambaStudio, our focus shifted to maximizing hardware utilization — a key metric for customer value. Dynamic batching was identified as an effective lever, but implementing it required significant changes at the app server layer to manage a dynamically filling request queue.

Task

Integrate the dynamic batching component within the app server for LLaMA 2 (7B and 70B) models, and ensure a stable, high-performance release to one of our most demanding customers.

Actions

Collaborated closely with our chief software architect to implement the batching integration, navigating both the app layer and the complex interactions needed for optimal batch management.
Identified and resolved subtle performance bottlenecks during testing: optimized the gRPC server's thread count and tuned Python garbage collection for the request sender — both critical to sustaining high throughput.
Worked with the ML platform team to introduce a Redis-based global queue across workers, significantly improving Time-to-First-Token (TTFT) by reducing queuing delays and enabling better batch efficiency.

Result

The dynamic batching-enabled LLaMA 2 models were deployed successfully, achieving meaningful improvements in TTFT, throughput, and hardware utilization. The rollout met our customer's high performance bar and demonstrated our ability to deliver efficient, production-ready solutions.