rdiaz.dev
ricardo díaz·software engineer·machine learning
← back to posts

Distilling SAM3 down to something fast

2026-03-15 · Knowledge distillation from ViT-H to TinyViT: what held, what broke, and why the first epoch matters.

The goal was simple: make SAM3 fast enough to run on a laptop CPU in under a second. The encoder — ViT-H at 815M parameters — is the bottleneck. Everything else in the pipeline is cheap.

Why knowledge distillation

We chose TinyViT as the student. At 14.7M parameters it's 58× smaller. The alternative was pruning or quantization, but both felt like losing weight by cutting limbs. KD lets the student learn the shape of the task, not just mimic the weights.

What surprised me

  • The distillation converged to 79% IoU retention within the first epoch. We expected it to take much longer.
  • CPU speedup was more dramatic than GPU speedup — 19× vs 8.5×. The small model fits in L2 cache; the big one doesn't.
  • Benchmarking is harder than training. Writing reproducible latency measurements across three machines took longer than the model itself.

The code is at github.com/ricardotk002/flash-sam3 if you want to look at the distillation setup.

bufposts·themedark