System Design Question Preview
You are building an internal deployment service that needs to roll out a new model checkpoint to every machine in a compute cluster. The checkpoint is very large, often hundreds of GB, and it starts from a single model repository. Before workers can serve traffic, each machine must have a complete and verified local copy of the model weights.