Skip to content

Cranot/mzip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mzip

Detection-based compression that finds patterns zstd, brotli, and bzip2 miss.

Store the formula, not the data.

TL;DR

A C++17 single-header library that detects mathematical structure and per-file patterns before reaching for an entropy coder. On data that has structure (numeric sequences, templates, columns, prose, audio, gradients, logs) it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, and rar. On generic small source code it is competitive but rarely beats brotli. Every output is round-trip-verified before the encoder commits to it.

2026 update: mzip now ships a context-mixing (bzip3-class) entropy backend + an xz/brotli ensemble backstop (trial-and-keep, never regresses). On a held-out, type-stratified benchmark it is the overall ratio leader with 0 losses across 70 real files. See Primary Benchmark.

Benchmark Result Notes
Held-out, type-stratified (38 types, 70 real files) 28 wins · 42 ties · 0 losses — overall 7.43× ratio The fair, current headline. vs brotli 4.59× (+38%), xz 4.85× (+35%), zstd-22 4.11× (+45%). Real GitHub + scientific time-series + fetched real formats, each vs gzip/bzip2/zstd-19/zstd-22/xz-9e/brotli-11 at max. All roundtrip-verified.
250 synthetic tests (50 types × 5 sizes) 235 / 250 (94.0%), avg 8.26× Formula-friendly suite (seeded generators) — skews high vs real data. Top ratio among the 8 compressors tested.
enwik9 10 MB Wikipedia prose 2,671,197 bytes Beats brotli:11 by 5.9%, bzip2:9 by 14.4% — smallest of any standard library compressor here.

Sections: Why · Strengths & limits · Primary benchmark · Synthetic suite · enwik9 prose · Real-world files · Strategies · Quick start


Why mzip?

Most compressors treat all bytes as random and rely on LZ77 + entropy coding. Real data is rarely random — it has structure that LZ77 cannot reach: a sequential-ID column is a formula v[i] = a + b·i, a JSON API response is a template with variables, an audio waveform is a smooth function. mzip detects that structure first and substitutes the minimal description before the entropy coder ever runs.

A few representative wins (1 MB inputs, vs the best of zstd:19 / brotli:11 / bzip2:9 / xz:9 / 7z / rar):

Input pattern mzip output Best other Advantage
Sequential database IDs 32 B bzip2: 3.4 KB 106× smaller
Repeating JSON API templates 10 KB brotli: 49 KB 4.9× smaller
16-bit audio PCM 1.7 KB bzip2: 4.0 KB 2.4× smaller
RGB image gradient 124 B brotli: 397 B 3.2× smaller
10 MB enwik9 Wikipedia prose 2.67 MB brotli:11: 2.83 MB 5.9% smaller

The first four wins come from formula or template detection — algorithmic substitutions LZ77 has no path to. The fifth win comes from a tuned BWT pipeline (capfold + word dict + LZP-after-dict + multi-tree Huffman with per-block dynamic trial) that puts mzip at the top of standard library compressors on long-form English prose.


Key Strengths

mzip is a specialist, not a general archiver. On data that has structure — numeric sequences, templates, columns, prose, audio, gradients, logs, SQL, XML, K8s — it produces meaningfully smaller output than zstd:19, brotli:11, bzip2:9, xz:9, 7z, or rar. On generic small handwritten source code (a few KB of TypeScript, Markdown, or Python) brotli's 120 KB static dictionary still usually wins. The 94.0% synthetic win rate (formula-friendly suite) and 38.3% real-world win rate (47-file GitHub corpus) are both honest measurements of those two regimes. Read both, then pick the tool for the data you actually have.

When to use mzip

  • Long-form English text (≥ 1 MB): prose, books, email archives, Wikipedia-class content. mzip beats brotli:11 by 5.9% on enwik9 10 MB.
  • Generated / templated data: K8s manifests, OpenAPI output, log streams, repeating JSON, SQL INSERTs, generated docs.
  • Numeric arrays and time series: sequential IDs, timestamps, counters, sensor data, audio PCM, GPS coordinates, float arrays.
  • Stream-separable formats: CSV, fixed-width logs, HTML, large XML, JSON Lines.
  • Anything ≥ 256 KB where compression ratio is the budget you care about.

When not to use mzip

  • Latency-bound read paths: mzip decompresses ~38× slower than zstd. If your hot path opens compressed files repeatedly, use zstd.
  • Small handwritten source code (a few KB of code/config/markdown): brotli usually wins by 3–20%.
  • Already-compressed or encrypted data: nothing to detect, RAW path falls back to zstd:19 — just use zstd directly.
  • General folder backup / archival: ZPAQ, 7z LZMA2, and xz are mature CLI archivers with broader format support. mzip is a library, not an archive format.

Where mzip wins big

Data category Synthetic win rate Why
Numeric sequences (IDs, timestamps, counters) 100% (40/40) Formula compression: v[i] = a + b·i beats any LZ77
Binary / audio / sensor 100% (20/20) Delta + ALP for floats, Paeth/E8E9 predictors
Long-form text (prose / markdown) 100% (15/15) BWT pipeline + capfold + word dict + LZP-after-dict
Code (JS, Py, Go, Rust, ...) at scale 94.5% (52/55) Identifier-stream separation, BWT, ZSTD_DICT trial
Logs (access / syslog / JSON log) 80–96% Columnar separation + BWT per column
Anything ≥ 256 KB 97% (97/100) More data → more patterns to detect

Where mzip loses

Scenario Winner Why
Small handwritten source code (4–30 KB) brotli:11 brotli ships a 120 KB pre-built English/web dictionary; per-file approaches can't fully match it without shipping their own
Random / encrypted / already-compressed data zstd No patterns to detect — entropy coder is all that's left
Decompression speed zstd mzip is ~38× slower to decompress (BWT vs LZ77 inverse). Trade-off, not a bug. See Decompression Speed (the trade-off).

Primary Benchmark (held-out, type-stratified)

This is mzip's fairest real-world measure and the current source of truth. benchmark_types.py compresses 38 content types / 76 files — 70 real files (real GitHub source, real scientific time-series, and fetched real proto/rst/tsv/svg/ndjson/diff) plus 6 labeled synthetic — against every standard at max settings (gzip-9, bzip2-9, zstd-19, zstd-22, xz-9e, brotli-11). Every mzip output is roundtrip-verified (0 failures). Reproduce: bash build_evals.sh && python3 benchmark_types.py.

Real files (overall) mzip+CM brotli-11 xz-9e zstd-22 bzip2-9 gzip-9
Compression ratio 7.43× 4.59× 4.85× 4.11× 3.70× 3.39×

mzip's output is smaller by +38% vs brotli, +35% vs xz, +45% vs zstd-22. The context-mixing backend alone contributes +4.42% over mzip-without-CM.

Per-file standing (70 real files): 28 strict wins · 42 framing-ties · 0 losses. mzip matches or beats the best standard on every real file. A "framing-tie" means mzip's compressed payload is within 32 B of the best standard — the gap is mzip's self-describing archive header (~10–14 B/file) versus a raw stream, not a compression loss. (Across all 76 files including synthetic: 34 wins / 42 ties / 0 losses.)

Where mzip pulls ahead hardest (real types — mzip+CM ratio vs the best standard for that type):

Type mzip+CM Best standard Why mzip wins
Numeric (sensor floats) 17.63× xz 2.76× formula / ALP float coding + bwt9 backstop
SQL 17.27× xz 15.98× columnar separation + CM
Log 16.95× bzip2 14.44× per-column separation + CM
XML 11.32× xz 6.98× tag / content stream split
Metrics 10.50× bzip2 10.00× columnar + CM
JSON 10.15× brotli 9.18× JSON columnar + CM

How it gets there — three trial-and-keep layers, none of which can regress (the encoder keeps the smallest roundtrip-verified candidate per block):

  • CM backend (cm_backend.hpp) — a BWT + context-mixing range coder (bzip3-class), wired into bwt9 as mode 2, so every BWT call-site (general blocks, DBF, CSV, BWT_TEXT) trials it and keeps it when smaller. This carries most of mzip's edge on text / code / logs.
  • Ensemble backstop — each block also trials xz (liblzma -9e) and brotli-11 and keeps the smallest, so mzip never loses where those tools would win (this flipped SQL / YAML / config). bwt9 is itself a universal backstop that catches BWT-friendly data the type detectors miss (this flipped real float / sensor arrays that previously lost to bzip2).
  • Encoder-firing audit (diagnose_encoders.py) — per-block telemetry (MZIP_STATS=1) cross-referenced with each type's expected encoder and win-gap, to systematically find where a specialized encoder doesn't fire. This is how the numeric losses and an under-representative TypeScript sample were found and fixed.

Fairness notes. real_bench/ is held-out; the dictionaries are trained on a separate train_corpus/ (mzip is never benchmarked on its own training data); samples/ are synthetic (labeled). mzip is a trial-everything ensemble — it trades speed for ratio: ~0.2 MB/s (the brotli backstop is gated to ≤1 MB blocks for ~2× on large data; MZIP_MAXRATIO opts out). zstd / brotli are 10–100×+ faster; if read latency matters, use them.

The synthetic suite and 47-file GitHub corpus below predate this harness and remain for reference — these held-out numbers are the headline.


Benchmark Results

Secondary benchmark — a formula-friendly synthetic suite. The primary held-out benchmark above is the fair real-world measure; these synthetic numbers skew high because the suite includes formula-compressible content (sequential IDs, timestamps, gradients) that real data rarely contains.

All synthetic benchmarks below are generated deterministically by generators.hpp (seeded RNG, seed=42). Click sample links to download the exact input bytes the table reports on.

Roundtrip-verified. mzip's encoder verifies its own output decodes back to the input before returning. Strategies that produce non-roundtrippable bytes are auto-discarded in favor of the next-best valid candidate. All 250 synthetic + 47 real-world results below pass verification.

Overall Compressor Scoreboard (250 tests: 50 types × 5 sizes)

Compressor Avg Ratio Range Wins Win% Rank
mzip 8.26x 1.0–32768x 235 94.0% 1
brotli:11 5.78x 1.0–1716x 13 5.2% 2
bzip2:9 5.66x 1.0–1001x 5 2.0% 3
rar:m5 5.97x 1.0–1014x 0 0.0% 4
xz:9 5.89x 1.0–997x 0 0.0% 5
7z:mx9 5.88x 1.0–922x 0 0.0% 6
zstd:19 5.14x 1.0–2641x 0 0.0% 7
gzip:9 4.78x 1.0–240x 0 0.0% 8

Verified: 250/250 roundtrips pass. Total input: 66.60 MB. lz4/snappy excluded (speed-focused, optimize for a different point on the curve).

Decompression Speed (the trade-off)

mzip optimizes for the smallest output and accepts a slower decode in return. If your workload is read-heavy and latency-bound, this is the wrong tool — use zstd. If you write once and store / transfer often, the savings show up on every later read.

Compressor Decode time (66.6 MB) Decode speed
zstd:19 92 ms 722 MB/s
mzip 3,523 ms 19 MB/s

zstd is roughly 38× faster to decompress. Most of the gap is BWT-inverse vs LZ77-inverse — a structural difference, not a tuning gap. Compression throughput depends on the strategy chosen; the production trial-everything ensemble (CM + BWT + dictionaries + xz + brotli per block) runs at ~0.2 MB/s — ratio over speed (see the primary benchmark fairness note). Streaming / incremental decode is on the roadmap but not in main yet.

Win Rate by Size

Size Wins Total Win%
4KB 43 50 86.0%
16KB 48 50 96.0%
64KB 47 50 94.0%
256KB 48 50 96.0%
1MB 49 50 98.0%

Top 10 mzip Wins

Type mzip 2nd Best Advantage
Database IDs (1MB) 32B (32768x) 3.4KB 106.8x better
Timestamps (1MB) 32B (32768x) 2.7KB 84.2x better
Database IDs (256KB) 32B (8192x) 937B 29.3x better
Timestamps (256KB) 32B (8192x) 772B 24.1x better
Database IDs (64KB) 32B (2048x) 301B 9.4x better
Timestamps (64KB) 32B (2048x) 287B 9.0x better
Image gradient (256KB) 53B (4946x) 323B 6.1x better
Image gradient (64KB) 39B (1680x) 212B 5.4x better
Timestamps (16KB) 32B (512x) 160B 5.0x better
JSON API (1MB) 10KB (104x) 49KB 4.9x better

Where Others Win (Top 10 Gaps)

The remaining gaps are all on small (4–16KB) text/code/config where brotli's pre-built 120KB English dictionary gives an edge no per-file approach can fully match without shipping its own dictionary.

Type Size mzip Best Gap
Terraform 1MB 41KB brotli: 40KB +825B
Terraform 64KB 3.4KB brotli: 3.0KB +379B
HTML 4KB 1004B brotli: 821B +183B
Terraform 256KB 10KB brotli: 10KB +161B
JSON log 16KB 2.4KB brotli: 2.3KB +148B
Makefile 4KB 809B brotli: 697B +112B
JSON log 4KB 866B brotli: 768B +98B
INI config 64KB 10KB bzip2: 10KB +87B
Unicode text 4KB 626B brotli: 549B +77B
Bash 4KB 825B brotli: 765B +60B

Long-Form Prose (enwik9)

enwik9 is the canonical benchmark for text compressors — the first 10⁹ bytes of an English Wikipedia dump. The numbers below are mzip vs every standard library compressor on the first 1 MB and 10 MB prefixes.

Compressor enwik9 1 MB enwik9 10 MB Class
zstd:19 312,639 2,921,957 LZ77 + entropy
gzip:9 356,643 3,720,323 LZ77 + Huffman
bzip2:9 294,484 3,054,639 BWT + multi-tree Huffman
xz:9 302,832 2,844,360 LZMA
7z:mx9 302,910 2,844,433 LZMA2
brotli:11 293,057 2,827,632 LZ77 + Huffman + 120 KB static dict
mzip 286,307 2,671,197 BWT + capfold + word dict + LZP-after-dict, per-block dynamic trial
bzip3 ~245,000 ~2,300,000 BWT + LZP + arithmetic + 1-symbol context model (CLI, not a library)
bsc-m03 LZP + BWT + M03 context coder (CLI, research-grade, not a library)

mzip produces the smallest output on both prefixes among standard library compressors — beats brotli:11 by 5.9%, xz/7z by 6.5%, bzip2:9 by 14.4% on 10 MB; beats brotli:11 by 2.4%, xz/7z by 5.8%, bzip2:9 by 2.9% on 1 MB.

The honest ceiling. bzip3 and bsc-m03 are CLI archive tools (not embeddable libraries) that replace the entropy backend with adaptive arithmetic coding driven by a post-BWT context model. mzip now closes much of this gap inside the library: cm_backend.hpp adds exactly that — a BWT + context-mixing range coder (bzip3-class), wired as bwt9 mode 2 (see Primary Benchmark above), trial-and-keep so it never regresses. bsc-m03 / ZPAQ / cmix still lead at the cost of much slower compression and no library API.


Per-Category Tables

Synthetic data, generated at each size by generators.hpp so results are reproducible.

NUMERIC

Type 64KB 256KB 1MB Samples
Timestamps 32B vs 287B 32B vs 772B 32B vs 2.6KB 64k 256k 1m
Database IDs 32B vs 301B 32B vs 937B 32B vs 3.3KB 64k 256k 1m
Integer array 3.3KB vs 4.4KB 12KB vs 17KB 51KB vs 67KB 64k 256k 1m
GPS coordinates 9.7KB vs 11KB 38KB vs 44KB 154KB vs 179KB 64k 256k 1m
Float temperature 11KB vs 22KB 40KB vs 87KB 151KB vs 331KB 64k 256k 1m
Sensor 16-bit 26KB vs 27KB 107KB vs 111KB 430KB vs 445KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

STRUCTURED

Type 64KB 256KB 1MB Samples
GraphQL queries 2.8KB vs 2.8KB 7.5KB vs 7.8KB 25KB vs 28KB 64k 256k 1m
SQL dump 4.6KB vs 4.7KB 14KB vs 15KB 53KB vs 56KB 64k 256k 1m
JSON API 1016B vs 3.7KB 2.8KB vs 12KB 9.8KB vs 48KB 64k 256k 1m
XML document 1020B vs 2.2KB 2.9KB vs 8.0KB 10KB vs 29KB 64k 256k 1m
CSV data 8.1KB vs 9.8KB 27KB vs 33KB 100KB vs 122KB 64k 256k 1m
Base64 data 47KB vs 48KB 189KB vs 192KB 758KB vs 771KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

CODE

Type 64KB 256KB 1MB Samples
JavaScript 4.0KB vs 5.2KB 10KB vs 12KB 37KB vs 43KB 64k 256k 1m
Python 5.2KB vs 5.3KB 12KB vs 12KB 33KB vs 40KB 64k 256k 1m
TypeScript 4.2KB vs 4.4KB 12KB vs 12KB 40KB vs 45KB 64k 256k 1m
HTML 5.4KB vs 5.7KB 16KB vs 18KB 61KB vs 68KB 64k 256k 1m
CSS 4.1KB vs 4.1KB 11KB vs 11KB 40KB vs 43KB 64k 256k 1m
Go 3.3KB vs 3.4KB 8.1KB vs 8.5KB 24KB vs 28KB 64k 256k 1m
Rust 3.5KB vs 3.5KB 8.9KB vs 9.1KB 28KB vs 31KB 64k 256k 1m
Java 3.7KB vs 3.9KB 10.0KB vs 10KB 29KB vs 36KB 64k 256k 1m
C 4.9KB vs 5.2KB 14KB vs 15KB 47KB vs 54KB 64k 256k 1m
Bash 3.5KB vs 3.7KB 9.9KB vs 10KB 34KB vs 37KB 64k 256k 1m
PHP 3.2KB vs 3.3KB 7.8KB vs 8.6KB 23KB vs 27KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

CONFIG

Type 64KB 256KB 1MB Samples
Docker Compose 1.7KB vs 2.1KB 5.4KB vs 5.6KB 17KB vs 19KB 64k 256k 1m
Terraform 2.8KB vs 3.0KB 10KB vs 10KB 41KB vs 40KB 64k 256k 1m
K8s manifests 2.8KB vs 3.3KB 7.4KB vs 7.6KB 21KB vs 24KB 64k 256k 1m
YAML config 3.8KB vs 3.8KB 11KB vs 11KB 38KB vs 41KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

LOG

Type 64KB 256KB 1MB Samples
Access log 6.3KB vs 6.8KB 21KB vs 24KB 85KB vs 94KB 64k 256k 1m
Nginx access log 6.5KB vs 6.8KB 21KB vs 22KB 82KB vs 87KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

BINARY

Type 64KB 256KB 1MB Samples
Image gradient 39B vs 212B 53B vs 323B 124B vs 397B 64k 256k 1m
Audio PCM 1.7KB vs 4.0KB 1.7KB vs 4.0KB 1.7KB vs 4.0KB 64k 256k 1m
Sparse bitmap 689B vs 880B 2.6KB vs 3.0KB 10KB vs 11KB 64k 256k 1m
Protobuf-like 40KB vs 41KB 160KB vs 163KB 640KB vs 650KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

ADDITIONAL

Type 64KB 256KB 1MB Samples
Natural text 6.7KB vs 7.2KB 26KB vs 28KB 104KB vs 111KB 64k 256k 1m
Markdown docs 3.6KB vs 3.8KB 10KB vs 11KB 36KB vs 41KB 64k 256k 1m
Email headers 6.6KB vs 6.8KB 20KB vs 21KB 75KB vs 79KB 64k 256k 1m
Unicode text 2.4KB vs 2.4KB 7.5KB vs 7.5KB 27KB vs 28KB 64k 256k 1m
Syslog 8.8KB vs 9.4KB 32KB vs 34KB 126KB vs 133KB 64k 256k 1m
Metrics 7.8KB vs 7.8KB 29KB vs 29KB 117KB vs 117KB 64k 256k 1m
JSON log 7.2KB vs 8.0KB 29KB vs 30KB 116KB vs 122KB 64k 256k 1m
Timestamps (jitter) 14KB vs 15KB 56KB vs 61KB 224KB vs 244KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

BUILD

Type 64KB 256KB 1MB Samples
Makefile 3.8KB vs 4.6KB 12KB vs 16KB 46KB vs 61KB 64k 256k 1m
package.json 4.6KB vs 4.8KB 14KB vs 15KB 54KB vs 57KB 64k 256k 1m
Cargo.toml 3.3KB vs 3.5KB 9.5KB vs 10KB 32KB vs 38KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

Real-World File Benchmark

Legacy benchmark, superseded by the primary held-out benchmark above (which adds the CM backend, more content types, and more compressors). Kept for historical comparison.

47 files (7.2 MB total) pulled from public GitHub repos — React, Linux kernel, Django, Bootstrap, lodash, plus 20+ programming-language files. Mix of source code, configs, logs, JSON, CSV, markdown. All 47 round-trip-verified.

Reproduce the full per-file table with the real-world benchmark — see Run Benchmarks.

Result

  • mzip wins or ties on 18 / 47 files (38.3%)
  • brotli:11 wins 27, bzip2:9 wins 1, xz:9 wins 1, all others 0

The synthetic 94.0% does not survive intact on real GitHub source code — the synthetic suite includes formula-compressible content (sequential IDs, timestamps, gradients, audio, generated templates) that hand-written code rarely contains. The 12 pre-trained group dictionaries (5 synthetic-trained + 7 real-data-trained on public corpora not in this benchmark) close most of the gap on logs, CSV, JSON, SQL, XML, and structured config. Read both numbers together.

Where mzip wins (top advantages)

File Size mzip ratio 2nd best Advantage
sql_schema.sql 4.1 KB 20.86× brotli: 1.1 KB +81.8%
java_arraylist.java 64.6 KB 9.00× brotli: 11.2 KB +36.1%
xml_maven.xml 45.4 KB 11.29× brotli: 5.8 KB +30.3%
apache_log_sample.log 2.26 MB 22.83× brotli: 116 KB +12.5%
go_http.go 128 KB 4.23× brotli: 34.3 KB +11.7%
docker-compose.yml 3.9 KB 4.23× brotli: 1.0 KB +8.2%
dashboard.html 42.5 KB 34.04× brotli: 1.3 KB +7.1%
terraform_main.tf 6.3 KB 3.83× brotli: 1.8 KB +7.1%
app.log 464 KB 7.90× bzip2: 60.3 KB +2.5%
lodash.js 532 KB 7.85× bzip2: 69.2 KB +2.1%
models.rs 16.8 KB 21.24× brotli: 826 B +1.9%
nginx_access.log 417 KB 12.11× bzip2: 34.9 KB +1.3%
styles.css 19.6 KB 9.28× bzip2: 2.1 KB +1.3%
handlers.go 14 KB 17.78× brotli: 814 B +1.2%
events.csv 578 KB 7.13× bzip2: 82.0 KB +1.1%
metrics.prom 176 KB 10.12× bzip2: 17.5 KB +1.1%
users.json 170 KB 10.11× bzip2: 17.0 KB +1.0%
linux_kernel.c 281 KB 4.41× bzip2: 64.3 KB +1.0%

Where brotli still wins

Brotli's 120 KB pre-built static English/web dictionary holds an edge on small handwritten markdown / source code where the input doesn't match any of mzip's 12 group dictionaries strongly enough.

File Size mzip gap
contributing.md 6.6 KB +11.8%
api_docs.md 17 KB +10.7%
Dockerfile 4.1 KB +8.7%
json_github_api.json 6.6 KB +8.5%
vscode_main.ts 19.8 KB +7.1%

…and 22 more, mostly handwritten source code 4–50 KB, typical gap +3% to +7%.

Per-category breakdown

Category mzip wins Total Win%
Logs 3 3 100%
CSV / columnar 1 1 100%
Metrics 1 1 100%
XML 1 1 100%
Web (HTML/CSS) 2 3 67%
Config files 2 4 50%
JSON 1 2 50%
SQL 1 2 50%
Source code (general) 6 25 24%
Markdown 0 3 0%
Other 0 2 0%

Compression Strategies

The encoder runs detection on every block, picks a candidate strategy from one of five families below, then trials multiple variants per block and keeps the smallest output that round-trip-verifies.

Family Picked when Headline strategies
Formula / numeric Bytes look like a generator output: linear, periodic, geometric, modular, smooth LINEAR_GEN, NUMERIC (delta / strided / ALP), PERIODIC, MODULAR, LINEAR_PRED
Templates / structured text Repeating lines or blocks with a few varying tokens (logs, generated docs, K8s, SQL INSERTs) TEMPLATE, SECTION_TEMPLATE, ML_TEMPLATE, WORD_TEMPLATE / MULTI_WORD_TEMPLATE, LINE_GROUP_TEMPLATE
Stream separation Fixed columns, tag/content split, key/value records COLUMNAR / BLOCK_COLUMNAR, CSV_COLUMNAR, JSON_COLUMNAR, HTML_STREAM, URL_STREAM, DBF_CONSTCOL
Binary / executable x86 code, raw RGB, sparse bitmaps, base64 text E8E9_X86 + LZMA_OPTIMAL, PAETH_RGB, SPARSE, BASE64_DECODE
Text backends None of the above fits — prose, code, config, mixed BWT_TEXT, BG (single-block BWT for ≥1 MB), MC (per-chunk pick), ZSTD_DICT (4–16 KB code/config), WORD_ENCODED, KV_CONFIG, RAW (fallback to zstd:19)

The BWT text backend (BWT_TEXT / BG) is itself a per-block trial over: pre-RLE on/off, dict size ∈ {64, 128, 192, 255}, number of Huffman trees ∈ {3..7}, LZP-after-dict min-match ∈ {10, 20, 40}, capfold on/off. The smallest valid combination wins.

For the full enum of named strategies (~30) with selection rules, grep BlockType:: and case BlockType:: in mzip.hpp — each block-type carries a one-line // what it does comment at its definition.


Quick Start

Requirements: a C++17 compiler and the zstd library headers + shared object. Install zstd if you don't already have it:

# macOS
brew install zstd

# Debian / Ubuntu
sudo apt install libzstd-dev

# Fedora
sudo dnf install libzstd-devel

# Windows / MSYS2
pacman -S mingw-w64-x86_64-zstd

Option 1: Single-header (recommended)

The amalgamated header bundles mzip + the BWT pipeline + libsais. You only need to add zstd.

// In ONE translation unit:
#include <zstd.h>
#define MZIP_IMPLEMENTATION
#include "mzip_amalgamated.hpp"

// In every other translation unit that uses mzip:
#include "mzip_amalgamated.hpp"

// Usage:
auto compressed   = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());

Option 2: Separate headers

#include <zstd.h>      // include zstd first
#include "mzip.hpp"

auto compressed   = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());

Build

# Single-header build (libsais bundled inside)
g++ -std=c++17 -O3 -march=native -o mzip_cli mzip_cli.cpp -lzstd

# Separate-headers build (needs libsais.c too)
g++ -std=c++17 -O3 -march=native -o mzip_cli mzip_cli.cpp libsais.c -lzstd

If your zstd headers are not on the default search path, add -I/path/to/zstd/include -L/path/to/zstd/lib.

CLI Usage

# Compress
./mzip_cli compress input.bin output.mzip

# Decompress
./mzip_cli decompress output.mzip restored.bin

Run Benchmarks

# Build (assumes zstd is installed; otherwise add -I/-L flags as in Quick Start above)
g++ -std=c++17 -O3 -march=native -o mzip_bench mzip_bench.cpp libsais.c -lzstd

# Synthetic suite — 50 types × 5 sizes = 250 tests, ~10–15 min
./mzip_bench --csv full_bench.csv

# Quick (64 KB only)
./mzip_bench --quick

# Single type, all sizes
./mzip_bench --type graphql

# Single real file
./mzip_bench --file path/to/file.bin

# All 47 real-world files
./mzip_bench --file real_bench/*

# Regenerate the README tables from a fresh CSV
python generate_readme_tables.py full_bench.csv

Held-out type benchmark + encoder audit (CM backend)

# Build all eval binaries (mzip+CM, baseline, zstd sizer, probes) + fetch/derive the extra real corpora
bash build_evals.sh

# Type-stratified benchmark vs gzip/bzip2/zstd-19/zstd-22/xz-9e/brotli-11, held-out real files, roundtrip-verified
python3 benchmark_types.py            # -> bench_types_report.md  (38 types / 76 files)

# Systematic "which encoder fired / what didn't fire" audit (LOSS / MISSED-SPECIAL / BACKSTOP-RELIANT)
python3 diagnose_encoders.py          # -> encoder_audit.md

# Per-block encoder telemetry on any file
MZIP_STATS=1 ./mzip_cm.exe c file out

Files

File Description
mzip.hpp Main library — include this
mzip_amalgamated.hpp Single-header build (mzip + BWT + libsais bundled)
bwt_compress_v5.hpp / v8.hpp / v9.hpp BWT pipelines (v5 = current prose backend)
word_dict.hpp Per-file word-dictionary preprocessor (BWT_TEXT helper)
cap_fold.hpp Capital-letter folding (BWT_TEXT helper)
bigram_dict.hpp, xml_entity.hpp Pre-BWT preprocessing candidates (auto-deselected when they don't help)
range_coder.hpp LZMA-style binary range coder (variant backend)
mzip_dicts.h 12 pre-trained zstd group dictionaries (5 synthetic + 7 real-data: MD/YAML/HCL/SQL/XML/CODE/JSON). Embedded ~2 MB.
train_corpus/ Real-world corpus used to train dicts 6–12 (held out from real_bench/); regenerable via train_corpus/fetch.sh.
lzma_optimal2.hpp, lzma_decoder.hpp LZMA optimal encoder + decoder (LZMA_OPTIMAL strategy)
mzip_base64.hpp Base64 detect / decode helper (BASE64_DECODE strategy)
generators.hpp Single source of truth for benchmark / test data
libsais.h BWT suffix array (Apache 2.0)
stb_image.h, stb_image_write.h Image IO (Public Domain) — for image strategies
mzip_bench.cpp Benchmark tool — --csv exports results
mzip_cli.cpp Command-line interface
mzip_test.cpp Quick debug / single-type test
mzip_unit_tests.cpp Unit tests for core strategies
cm_backend.hpp BWT + context-mixing (bzip3-class) entropy backend — wired as bwt9 mode 2 + ensemble candidate
brotli_shim.hpp / liblzma_shim.hpp Minimal decls to link brotli / liblzma as ensemble backstop candidates
build_evals.sh Builds all eval binaries (mzip+CM, baseline, zstd sizer, probes) + fetches/derives the held-out corpora
benchmark_types.py Held-out type-stratified benchmark (mzip+CM vs gzip/bzip2/zstd/xz/brotli) → bench_types_report.md
diagnose_encoders.py Encoder-firing audit (MZIP_STATS telemetry → LOSS / MISSED-SPECIAL / BACKSTOP) → encoder_audit.md
generate_readme_tables.py Auto-generate the markdown tables above from full_bench.csv
summarize_real_bench.py Auto-generate real_bench_summary.md from real_bench_results.txt
samples/ Sample files at 4 / 16 / 64 / 256 KB and 1 MB
real_bench/ 47 real-world files used by the real-world benchmark
full_bench.csv Latest synthetic benchmark CSV (one row per (type, size))

License & Contact

Dual-licensed: AGPL-3.0 OR commercial.

  • AGPL-3.0 — free for open-source projects. If you deploy mzip as part of a network service (SaaS, hosted API, etc.), the AGPL requires you to make your source available.
  • Commercial — for proprietary or closed-source use, open a GitHub issue tagged commercial-license and I'll follow up. Same for bug reports, benchmarks on your own data, or proposing a new strategy.

Third-party code bundled in the repo: libsais (Apache 2.0), stb_image (Public Domain). zstd is required at link-time but not bundled (BSD).

About

Store the formula, not the data. Detection-based compression: 32KB for 1MB of sequential IDs (32768x). Beats zstd/brotli/bzip2 on structured data.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors