mzip

Detection-based compression that finds patterns zstd, brotli, and bzip2 miss.

Store the formula, not the data.

TL;DR

A C++17 single-header library that detects mathematical structure and per-file patterns before reaching for an entropy coder. On data that has structure (numeric sequences, templates, columns, prose, audio, gradients, logs) it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, and rar. On generic small source code it is competitive but rarely beats brotli. Every output is round-trip-verified before the encoder commits to it.

2026 update: mzip now ships a context-mixing (bzip3-class) entropy backend + an xz/brotli ensemble backstop (trial-and-keep, never regresses). On a held-out, type-stratified benchmark it is the overall ratio leader with 0 losses across 70 real files. See Primary Benchmark.

Benchmark	Result	Notes
Held-out, type-stratified (38 types, 70 real files)	28 wins · 42 ties · 0 losses — overall 7.43× ratio	The fair, current headline. vs brotli 4.59× (+38%), xz 4.85× (+35%), zstd-22 4.11× (+45%). Real GitHub + scientific time-series + fetched real formats, each vs gzip/bzip2/zstd-19/zstd-22/xz-9e/brotli-11 at max. All roundtrip-verified.
250 synthetic tests (50 types × 5 sizes)	235 / 250 (94.0%), avg 8.26×	Formula-friendly suite (seeded generators) — skews high vs real data. Top ratio among the 8 compressors tested.
enwik9 10 MB Wikipedia prose	2,671,197 bytes	Beats brotli:11 by 5.9%, bzip2:9 by 14.4% — smallest of any standard library compressor here.

Sections: Why · Strengths & limits · Primary benchmark · Synthetic suite · enwik9 prose · Real-world files · Strategies · Quick start

Why mzip?

Most compressors treat all bytes as random and rely on LZ77 + entropy coding. Real data is rarely random — it has structure that LZ77 cannot reach: a sequential-ID column is a formula v[i] = a + b·i, a JSON API response is a template with variables, an audio waveform is a smooth function. mzip detects that structure first and substitutes the minimal description before the entropy coder ever runs.

A few representative wins (1 MB inputs, vs the best of zstd:19 / brotli:11 / bzip2:9 / xz:9 / 7z / rar):

Input pattern	mzip output	Best other	Advantage
Sequential database IDs	32 B	bzip2: 3.4 KB	106× smaller
Repeating JSON API templates	10 KB	brotli: 49 KB	4.9× smaller
16-bit audio PCM	1.7 KB	bzip2: 4.0 KB	2.4× smaller
RGB image gradient	124 B	brotli: 397 B	3.2× smaller
10 MB enwik9 Wikipedia prose	2.67 MB	brotli:11: 2.83 MB	5.9% smaller

The first four wins come from formula or template detection — algorithmic substitutions LZ77 has no path to. The fifth win comes from a tuned BWT pipeline (capfold + word dict + LZP-after-dict + multi-tree Huffman with per-block dynamic trial) that puts mzip at the top of standard library compressors on long-form English prose.

Key Strengths

mzip is a specialist, not a general archiver. On data that has structure — numeric sequences, templates, columns, prose, audio, gradients, logs, SQL, XML, K8s — it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, or rar. On generic small handwritten source code (a few KB of TypeScript, Markdown, or Python) brotli's 120 KB static dictionary still usually wins. The 94.0% synthetic win rate (formula-friendly suite) and 38.3% real-world win rate (47-file GitHub corpus) are both honest measurements of those two regimes. Read both, then pick the tool for the data you actually have.

When to use mzip

Long-form English text (≥ 1 MB): prose, books, email archives, Wikipedia-class content. mzip beats brotli:11 by 5.9% on enwik9 10 MB.
Generated / templated data: K8s manifests, OpenAPI output, log streams, repeating JSON, SQL INSERTs, generated docs.
Numeric arrays and time series: sequential IDs, timestamps, counters, sensor data, audio PCM, GPS coordinates, float arrays.
Stream-separable formats: CSV, fixed-width logs, HTML, large XML, JSON Lines.
Anything ≥ 256 KB where compression ratio is the budget you care about.

When not to use mzip

Latency-bound read paths: mzip decompresses ~38× slower than zstd. If your hot path opens compressed files repeatedly, use zstd.
Small handwritten source code (a few KB of code/config/markdown): brotli usually wins by 3–20%.
Already-compressed or encrypted data: nothing to detect, RAW path falls back to zstd:19 — just use zstd directly.
General folder backup / archival: ZPAQ, 7z LZMA2, and xz are mature CLI archivers with broader format support. mzip is a library, not an archive format.

Where mzip wins big

Data category	Synthetic win rate	Why
Numeric sequences (IDs, timestamps, counters)	100% (40/40)	Formula compression: `v[i] = a + b·i` beats any LZ77
Binary / audio / sensor	100% (20/20)	Delta + ALP for floats, Paeth/E8E9 predictors
Long-form text (prose / markdown)	100% (15/15)	BWT pipeline + capfold + word dict + LZP-after-dict
Code (JS, Py, Go, Rust, ...) at scale	94.5% (52/55)	Identifier-stream separation, BWT, ZSTD_DICT trial
Logs (access / syslog / JSON log)	80–96%	Columnar separation + BWT per column
Anything ≥ 256 KB	97% (97/100)	More data → more patterns to detect

Where mzip loses

Scenario	Winner	Why
Small handwritten source code (4–30 KB)	brotli:11	brotli ships a 120 KB pre-built English/web dictionary; per-file approaches can't fully match it without shipping their own
Random / encrypted / already-compressed data	zstd	No patterns to detect — entropy coder is all that's left
Decompression speed	zstd	mzip is ~38× slower to decompress (BWT vs LZ77 inverse). Trade-off, not a bug. See Decompression Speed (the trade-off).

Primary Benchmark (held-out, type-stratified)

This is mzip's fairest real-world measure and the current source of truth. benchmark_types.py compresses 38 content types / 76 files — 70 real files (real GitHub source, real scientific time-series, and fetched real proto/rst/tsv/svg/ndjson/diff) plus 6 labeled synthetic — against every standard at max settings (gzip-9, bzip2-9, zstd-19, zstd-22, xz-9e, brotli-11). Every mzip output is roundtrip-verified (0 failures). Reproduce: bash build_evals.sh && python3 benchmark_types.py.

Real files (overall)	mzip+CM	brotli-11	xz-9e	zstd-22	bzip2-9	gzip-9
Compression ratio	7.43×	4.59×	4.85×	4.11×	3.70×	3.39×

mzip's output is smaller by +38% vs brotli, +35% vs xz, +45% vs zstd-22. The context-mixing backend alone contributes +4.42% over mzip-without-CM.

Per-file standing (70 real files): 28 strict wins · 42 framing-ties · 0 losses. mzip matches or beats the best standard on every real file. A "framing-tie" means mzip's compressed payload is within 32 B of the best standard — the gap is mzip's self-describing archive header (~10–14 B/file) versus a raw stream, not a compression loss. (Across all 76 files including synthetic: 34 wins / 42 ties / 0 losses.)

Where mzip pulls ahead hardest (real types — mzip+CM ratio vs the best standard for that type):

Type	mzip+CM	Best standard	Why mzip wins
Numeric (sensor floats)	17.63×	xz 2.76×	formula / ALP float coding + `bwt9` backstop
SQL	17.27×	xz 15.98×	columnar separation + CM
Log	16.95×	bzip2 14.44×	per-column separation + CM
XML	11.32×	xz 6.98×	tag / content stream split
Metrics	10.50×	bzip2 10.00×	columnar + CM
JSON	10.15×	brotli 9.18×	JSON columnar + CM

How it gets there — three trial-and-keep layers, none of which can regress (the encoder keeps the smallest roundtrip-verified candidate per block):

CM backend (cm_backend.hpp) — a BWT + context-mixing range coder (bzip3-class), wired into bwt9 as mode 2, so every BWT call-site (general blocks, DBF, CSV, BWT_TEXT) trials it and keeps it when smaller. This carries most of mzip's edge on text / code / logs.
Ensemble backstop — each block also trials xz (liblzma -9e) and brotli-11 and keeps the smallest, so mzip never loses where those tools would win (this flipped SQL / YAML / config). bwt9 is itself a universal backstop that catches BWT-friendly data the type detectors miss (this flipped real float / sensor arrays that previously lost to bzip2).
Encoder-firing audit (diagnose_encoders.py) — per-block telemetry (MZIP_STATS=1) cross-referenced with each type's expected encoder and win-gap, to systematically find where a specialized encoder doesn't fire. This is how the numeric losses and an under-representative TypeScript sample were found and fixed.

Fairness notes. real_bench/ is held-out; the dictionaries are trained on a separate train_corpus/ (mzip is never benchmarked on its own training data); samples/ are synthetic (labeled). mzip is a trial-everything ensemble — it trades speed for ratio: ~0.2 MB/s (the brotli backstop is gated to ≤1 MB blocks for ~2× on large data; MZIP_MAXRATIO opts out). zstd / brotli are 10–100×+ faster; if read latency matters, use them.

The synthetic suite and 47-file GitHub corpus below predate this harness and remain for reference — these held-out numbers are the headline.

Benchmark Results

Secondary benchmark — a formula-friendly synthetic suite. The primary held-out benchmark above is the fair real-world measure; these synthetic numbers skew high because the suite includes formula-compressible content (sequential IDs, timestamps, gradients) that real data rarely contains.

All synthetic benchmarks below are generated deterministically by generators.hpp (seeded RNG, seed=42). Click sample links to download the exact input bytes the table reports on.

Roundtrip-verified. mzip's encoder verifies its own output decodes back to the input before returning. Strategies that produce non-roundtrippable bytes are auto-discarded in favor of the next-best valid candidate. All 250 synthetic + 47 real-world results below pass verification.

Overall Compressor Scoreboard (250 tests: 50 types × 5 sizes)

Compressor	Avg Ratio	Range	Wins	Win%	Rank
mzip	8.26x	1.0–32768x	235	94.0%	1
brotli:11	5.78x	1.0–1716x	13	5.2%	2
bzip2:9	5.66x	1.0–1001x	5	2.0%	3
rar:m5	5.97x	1.0–1014x	0	0.0%	4
xz:9	5.89x	1.0–997x	0	0.0%	5
7z:mx9	5.88x	1.0–922x	0	0.0%	6
zstd:19	5.14x	1.0–2641x	0	0.0%	7
gzip:9	4.78x	1.0–240x	0	0.0%	8

Verified: 250/250 roundtrips pass. Total input: 66.60 MB. lz4/snappy excluded (speed-focused, optimize for a different point on the curve).

Decompression Speed (the trade-off)

mzip optimizes for the smallest output and accepts a slower decode in return. If your workload is read-heavy and latency-bound, this is the wrong tool — use zstd. If you write once and store / transfer often, the savings show up on every later read.

Compressor	Decode time (66.6 MB)	Decode speed
zstd:19	92 ms	722 MB/s
mzip	3,523 ms	19 MB/s

zstd is roughly 38× faster to decompress. Most of the gap is BWT-inverse vs LZ77-inverse — a structural difference, not a tuning gap. Compression throughput depends on the strategy chosen; the production trial-everything ensemble (CM + BWT + dictionaries + xz + brotli per block) runs at ~0.2 MB/s — ratio over speed (see the primary benchmark fairness note). Streaming / incremental decode is on the roadmap but not in main yet.

Win Rate by Size

Size	Wins	Total	Win%
4KB	43	50	86.0%
16KB	48	50	96.0%
64KB	47	50	94.0%
256KB	48	50	96.0%
1MB	49	50	98.0%

Top 10 mzip Wins

Type	mzip	2nd Best	Advantage
Database IDs (1MB)	32B (32768x)	3.4KB	106.8x better
Timestamps (1MB)	32B (32768x)	2.7KB	84.2x better
Database IDs (256KB)	32B (8192x)	937B	29.3x better
Timestamps (256KB)	32B (8192x)	772B	24.1x better
Database IDs (64KB)	32B (2048x)	301B	9.4x better
Timestamps (64KB)	32B (2048x)	287B	9.0x better
Image gradient (256KB)	53B (4946x)	323B	6.1x better
Image gradient (64KB)	39B (1680x)	212B	5.4x better
Timestamps (16KB)	32B (512x)	160B	5.0x better
JSON API (1MB)	10KB (104x)	49KB	4.9x better

Where Others Win (Top 10 Gaps)

The remaining gaps are all on small (4–16KB) text/code/config where brotli's pre-built 120KB English dictionary gives an edge no per-file approach can fully match without shipping its own dictionary.

Type	Size	mzip	Best	Gap
Terraform	1MB	41KB	brotli: 40KB	+825B
Terraform	64KB	3.4KB	brotli: 3.0KB	+379B
HTML	4KB	1004B	brotli: 821B	+183B
Terraform	256KB	10KB	brotli: 10KB	+161B
JSON log	16KB	2.4KB	brotli: 2.3KB	+148B
Makefile	4KB	809B	brotli: 697B	+112B
JSON log	4KB	866B	brotli: 768B	+98B
INI config	64KB	10KB	bzip2: 10KB	+87B
Unicode text	4KB	626B	brotli: 549B	+77B
Bash	4KB	825B	brotli: 765B	+60B

Long-Form Prose (enwik9)

enwik9 is the canonical benchmark for text compressors — the first 10⁹ bytes of an English Wikipedia dump. The numbers below are mzip vs every standard library compressor on the first 1 MB and 10 MB prefixes.

Compressor	enwik9 1 MB	enwik9 10 MB	Class
zstd:19	312,639	2,921,957	LZ77 + entropy
gzip:9	356,643	3,720,323	LZ77 + Huffman
bzip2:9	294,484	3,054,639	BWT + multi-tree Huffman
xz:9	302,832	2,844,360	LZMA
7z:mx9	302,910	2,844,433	LZMA2
brotli:11	293,057	2,827,632	LZ77 + Huffman + 120 KB static dict
mzip	286,307	2,671,197	BWT + capfold + word dict + LZP-after-dict, per-block dynamic trial
bzip3 †	~245,000	~2,300,000	BWT + LZP + arithmetic + 1-symbol context model (CLI, not a library)
bsc-m03 †	—	—	LZP + BWT + M03 context coder (CLI, research-grade, not a library)

mzip produces the smallest output on both prefixes among standard library compressors — beats brotli:11 by 5.9%, xz/7z by 6.5%, bzip2:9 by 14.4% on 10 MB; beats brotli:11 by 2.4%, xz/7z by 5.8%, bzip2:9 by 2.9% on 1 MB.

† The honest ceiling. bzip3 and bsc-m03 are CLI archive tools (not embeddable libraries) that replace the entropy backend with adaptive arithmetic coding driven by a post-BWT context model. mzip now closes much of this gap inside the library: cm_backend.hpp adds exactly that — a BWT + context-mixing range coder (bzip3-class), wired as bwt9 mode 2 (see Primary Benchmark above), trial-and-keep so it never regresses. bsc-m03 / ZPAQ / cmix still lead at the cost of much slower compression and no library API.