Skip to main content
Back to Blog
EngineeringDecember 15, 20248 min read

Scaling LLM Inference to Millions of Requests

Learn how we architected our infrastructure to handle millions of inference requests per second while maintaining sub-100ms latency.

Sarah Chen

Sarah Chen

Chief Technology Officer

Scaling LLM Inference to Millions of Requests

Scaling LLM Inference to Millions of Requests

When we set out to build Infiner, we knew that traditional scaling approaches wouldn't cut it. Large Language Models present unique challenges that require rethinking infrastructure from the ground up.

The Challenge

Modern LLMs are compute-intensive beasts. A single inference request can require billions of floating-point operations, and the memory footprint of these models can exceed 100GB. Traditional horizontal scaling doesn't work well when each node needs such significant resources.

Our Approach

We developed a hybrid architecture that combines:

  1. Smart Request Routing: Not all requests are created equal. Short completions can be routed to smaller, faster instances, while complex reasoning tasks go to our most powerful clusters.
  1. Speculative Decoding: By predicting likely token sequences, we can batch verify multiple tokens at once, dramatically improving throughput.
  1. Dynamic Batching: Requests are intelligently grouped based on expected completion length, maximizing GPU utilization without sacrificing latency.
  1. Edge Caching: Common prompts and their variations are cached at edge locations, reducing round-trip time for repeat queries.

Results

After implementing these optimizations, we achieved:

  • 99th percentile latency under 100ms for standard completions
  • 10x improvement in throughput compared to naive scaling
  • 40% reduction in cost per token through better resource utilization

Looking Ahead

We're continuing to push the boundaries of what's possible. Our next focus areas include:

  • Multi-region inference with automatic failover
  • Custom silicon optimization for specific model architectures
  • Real-time model switching based on request characteristics

Stay tuned for more technical deep-dives into our infrastructure.