Cloudflare open-sourced a lossless LLM compressor that shaves 22% off model weights
Unweight is Cloudflare Research's new BF16 weight compressor. 22% smaller bundles, 13% smaller inference footprint, 30-40% throughput overhead, BSD license.
Cloudflare Research open-sourced Unweight on April 17, a lossless compression scheme for LLM weight tensors that cuts distribution bundles by about 22% and inference memory footprint by about 13%. Unlike quantization, it’s bit-exact: compressed weights decode back to the original BF16 values. The GPU kernels shipped on GitHub under a BSD-3-clause license, targeting NVIDIA H100 and H200.
What Unweight actually does
The problem is memory bandwidth. On an H100, tensor cores can process data roughly 600 times faster than the memory subsystem can feed them. Token generation reads every model weight once per token, so the bottleneck isn’t compute; it’s how fast you can pull those weights into the SMs.
Unweight attacks that by compressing the exponent byte of each BF16 weight with Huffman coding. A BF16 number is 1 sign bit, 8 exponent bits, and 7 mantissa bits. Across a trained transformer layer, about 99% of weights land in the top 16 most common exponent values, so those 8 exponent bits encode into roughly 2.6 bits on average. That alone shrinks the exponent stream by ~30%.
The sign and mantissa get left alone. They look random. Lossless compression can’t help there. Unweight also skips attention weights, embeddings, and layer norms entirely. It compresses only the MLP gate/up/down projections, which are where the bandwidth pressure actually lives.
The clever part is the decompression
Shipping smaller weights is easy. Not giving back the savings to decompression overhead at inference time is the hard part. Cloudflare’s solution is a reconstructive matrix multiplication kernel: the compressed weight tiles get pulled from VRAM into on-chip shared memory, decompressed there in 227 KB of the SM’s 228 KB shared-memory budget, then handed directly to the tensor cores without a round-trip through main memory.
Four execution pipelines dynamically pick between full Huffman decode plus cuBLAS, exponent-only decode, palette transcode, and direct palette depending on batch size and workload. An autotuner measures end-to-end throughput on the target card and sweeps configurations per projection, rather than using heuristics.
On Llama 3.1 8B, end-to-end throughput overhead sits at roughly 41% at batch size 1 and drops to ~30% at batch 1024. That’s the honest trade right now: you save ~3 GB of VRAM per 8B-class model, but you pay 30-40% in tokens/second at current kernel maturity. The team frames this as a research release, not a production drop-in, and lists multiple optimization paths still open.
Who this matters for
Anyone serving models from a GPU where VRAM is the binding constraint. A 70B-parameter model that barely fits on an 80GB H100 today could, with Unweight on the MLP weights, leave room for a longer KV cache or a bigger batch. Anyone distributing weights over the network gets a straight 22% bandwidth win with no quality loss, which is relevant for Cloudflare’s own Workers AI product and for anyone building a model-distribution service.
It’s also a reference point for how to think about lossless compression in 2026. Quantization (INT8, FP8, INT4) gets you 50-75% smaller models but changes outputs in ways that are hard to audit. Unweight’s ceiling is lower, but it’s a free ride on quality: the same bits come out the other side. The two approaches compose; a Q8 model plus Unweight is a legal stack.
What this means for you
If you run your own inference on H100s or H200s, grab the unweight-kernels repo and benchmark it on a real workload before you bet on it. The 30-40% throughput hit is real at current kernel maturity; whether that’s worth the VRAM headroom depends entirely on whether you were memory-bound or compute-bound before. If you’re building anything that ships model weights over the network (a HuggingFace mirror, an edge-deployment system, an on-device distribution channel), Unweight is cheap to slot in for distribution-only use: decompress once at load time, keep the uncompressed tensors in memory, and you pay zero inference overhead for the 22% transfer savings. And if you care about lossless-vs-lossy trade-offs in the broader “how do we keep shrinking these models” debate, this paper is worth reading end to end. It’s the first time a major infrastructure company has bet on lossless as a serious lever, not just a curiosity.
Sources
- Unweight: how we compressed an LLM 22% without sacrificing quality — Cloudflare Blog
- unweight-kernels: Lossless compression of BF16 MLP weights for LLM inference on NVIDIA Hopper GPUs — GitHub / Cloudflare Research
- Lossless MLP Weight Compression for LLM Inference — Cloudflare Research