Model Sharding

Splitting a model’s weights across multiple GPUs (or across GPU and CPU/RAM) when it’s too large to fit on a single device. Model sharding enables running 70B+ parameter models on consumer hardware by distributing layers across available memory. The trade-off is slower inference due to inter-device communication. Tools like llama.cpp support CPU/GPU sharding out of the box, while multi-GPU sharding requires frameworks like DeepSpeed or vLLM.

More Terms

Metal Performance Shaders (MPS)MLX MoE (Mixture of Experts)NPU (Neural Processing Unit)NVLink

Back to Glossary

Model Sharding

Related Products

More Terms