The goal was simple: make SAM3 fast enough to run on a laptop CPU in under a second. The encoder — ViT-H at 815M parameters — is the bottleneck. Everything else in the pipeline is cheap.
Why knowledge distillation
We chose TinyViT as the student. At 14.7M parameters it's 58× smaller. The alternative was pruning or quantization, but both felt like losing weight by cutting limbs. KD lets the student learn the shape of the task, not just mimic the weights.
What surprised me
- The distillation converged to 79% IoU retention within the first epoch. We expected it to take much longer.
- CPU speedup was more dramatic than GPU speedup — 19× vs 8.5×. The small model fits in L2 cache; the big one doesn't.
- Benchmarking is harder than training. Writing reproducible latency measurements across three machines took longer than the model itself.
The code is at github.com/ricardotk002/flash-sam3 if you want to look at the distillation setup.