Our approach for 5.8× cheaper Kimi K2.6 inference

Kimi K2.6 has 1T parameters with 32B active per token, quantized to INT4. Three configurations serve it at 256K context. The standard configuration, labeled Case A, is an 8×H200 HGX baseboard running tensor parallelism (TP=8) and disaggregated context parallelism (DCP=8) under vLLM. Replacing the baseboard with a single H200 that holds 16 hot experts per layer in VRAM and fetches the remaining 368 cold experts on demand from pinned host RAM over PCIe Gen5 produces Case C. Scaling Case C horizontally to eight independent instances behind a session-aware load balancer, running on spot pricing with a KV-cache migration orchestrator that moves active sessions when reclaim notifications arrive, produces Case C × 8 on spot.

$0$4$8$12$1605001,0001,500AGGREGATE OUTPUT (TOK/S)COST ($/M TOKENS)CASE A · ON-DEMAND · $14.45CASE A · SPOT + HOT SPARE · $11.56CASE C × 8 · ON-DEMAND · $5.56CASE C × 8 · SPOT · $2.50

Inputs

ModelKimi K2.6 (1T / 32B)H200 on-demand$4.00 / hr
Context256KH200 spot$1.60 / hr
KV precisionFP88×H200 HGX$32.00 / hr
Per-user target25–30 tok/sPCIe Gen564 GB/s

Derivation

Cost per million tokens = hourly $ ÷ (tok/s × 3600) × 10⁶.

Case A: 8×H200 HGX

All weights reside in VRAM, and NVLink all-reduce runs on every layer. Under vLLM with TP=8 and DCP=8, 30 to 40 percent of HBM bandwidth is consumed by tensor-parallel collectives, which bounds the batch at 22 users at 256K context. Per-user OTPS is 28 and aggregate output is 615 tok/s. At a hosting cost of $32 per hour:

$32.00 / HR ÷ (615 × 3600) TOK/HR × 1E6  =  $14.45 / M

Case C: single H200 with prefetch

Case C runs on a single H200 paired with 1 TB of pinned host RAM. Sixteen hot experts per layer reside in VRAM and consume 141 GB, while the remaining 368 cold experts per layer sit on the host. A thin predictor runs once per decode step to forecast each layer’s expert assignments from the prior step’s hidden state; any predicted cold experts are queued for prefetch on a dedicated PCIe Gen5 copy stream while the main compute stream begins the forward pass. Under the assumption that the hot set tracks the top-frequency experts closely enough that at most 1 of the 8 active experts per layer is cold at the P95, each 0.34 ms transfer is hidden behind cumulative prior-layer compute rather than any single attention window, so cold fetches arrive well ahead of the layer that needs them and never enter the critical path. There is no NVLink and no all-reduce on this configuration. Batch is 8, per-user OTPS is 25, and aggregate output is 200 tok/s. Hosting cost is $4 per hour.

$4.00 / HR ÷ (200 × 3600) TOK/HR × 1E6   =  $5.56 / M

Case C × 8: eight independent instances

Eight Case-C instances run behind a session-aware load balancer with no cross-node communication. The fleet’s hourly cost is 8 × $4.00 = $32.00, aggregate output is 8 × 200 = 1,600 tok/s, and the fleet supports 8 × 8 = 64 concurrent users at 256K context.

$32.00 / HR ÷ (1,600 × 3600) TOK/HR × 1E6  =  $5.56 / M

Case C × 8 on spot: derivation of $2.50 / M

Spot pricing is 60 percent lower than on-demand, but spot instances can be reclaimed by the cloud provider with 30 seconds to 2 minutes of notice. On the HGX, a single preemption removes all 22 active sessions, so a hot spare at $12.80 per hour is required. On the farm, a single preemption removes only 8 of 64 active sessions, so a single small spare at $1.60 per hour is sufficient.

8 × $1.60 (SPOT) + 1 × $1.60 (SPARE)        =  $14.40 / HR
$14.40 / HR ÷ (1,600 × 3600) TOK/HR × 1E6   =  $2.50 / M

Prefetcher miss sensitivity

All cost figures above assume the prefetcher hits on every request. When the prefetcher misses, the cold-expert fetch lands on the critical path and per-user throughput drops. Published hit rates and the corresponding cost per million tokens:

Hit rateSource$/M
87%Architectural target$2.50
85%fMoE (2025)$2.65
80%MoE-Infinity (2024)$2.84
75%Pre-gated MoE$3.10

At the 80 percent hit rate that MoE-Infinity reports with reproducible open-source code, the configuration sits at $2.84 per million tokens. The cited models and batch configurations differ from Kimi K2.6 at 256K context; we expect to tune these techniques further for the workload here.

These cost calculations do not include standard industry inference optimizations like speculative decoding.

References

The papers below establish the prefetcher and offloading mechanisms on related MoE models (DeepSeek, Mixtral, Switch, NLLB, Switch-Base-128), not on Kimi K2.6 itself. They are cited as evidence that the techniques generalize across MoE configurations, not as direct measurements of this workload.