Model Sharding
Splitting a model’s weights across multiple GPUs (or across GPU and CPU/RAM) when it’s too large to fit on a single device. Model sharding enables running 70B+ parameter models on consumer hardware by distributing layers across available memory. The trade-off is slower inference due to inter-device communication. Tools like llama.cpp support CPU/GPU sharding out of the box, while multi-GPU sharding requires frameworks like DeepSpeed or vLLM.