System Design Question Preview
You are designing the serving layer for a large language model product. Users send prompts to an API and expect generated text back, either as a complete response or as a streamed sequence of tokens. Behind the API, the system runs multiple model replicas on GPU machines and must keep those expensive GPUs highly utilized without making user-facing latency unpredictable.