Skip to main content

Open Source / Lossless / Zero Dependencies

LogCrush

Structure-aware log compression that beats zstd -19 by up to 62% through CLP-style deterministic parsing and typed columnar encoding.

46.9x

Apache Logs

vs zstd-19: 28.9x (+62%)

25.5x

HDFS Logs

vs zstd-19: 15.3x (+66%)

22.1x

Linux Syslog

vs zstd-19: 19.5x (+13%)

0%

Override Rate

Lossless by design, not by patching

Benchmarks - Compression Ratio (higher is better)

Benchmarked on LogHub datasets. All ratios verified with byte-perfect lossless roundtrip.

The CLP Rewrite - Before vs After

Apache

30.72x -> 46.88x (+53%)

HDFS

16.16x -> 25.49x (+58%)

Linux

14.00x -> 22.09x (+58%)

How It Works

01

Deterministic Schema Parsing

CLP-style single-pass tokenization classifies every character as static text, integer variable, or dictionary variable. No ML clustering. No statistical thresholds. Zero reconstruction failures by design.

Jun 9 06:06:20 combo kernel -> log_type + [9, 06, 06, 20, "combo", "kernel"]

02

Columnar Decomposition

Variables are transposed into homogeneous typed columns grouped by schema. All integers from the same template position compress together. Dictionary variables deduplicate into a shared global dictionary.

2,716 Drain3 templates -> 483 deterministic log types on Linux

03

Typed Column Encoding

Each column gets a type-specific codec: delta + zigzag varint for integers, digit-preserving encoding for floats, dictionary indices for string variables, and frame-of-reference bitpacking for dense integer sequences.

No override system - every value encodes exactly, or escapes to dictionary

04

zstd Final Pass

All encoded columns are concatenated and compressed with zstd. The pre-processing creates highly regular byte streams that compress dramatically better than raw log text under LZ-family algorithms.

Structure-aware pre-processing + general-purpose compression = best of both worlds

Full Benchmark Results

Dataset Lines LogCrush zstd-3 zstd-9 zstd-19 vs zstd-19
Apache 56K 46.88x 19.03x 24.65x 28.90x +62%
Thunderbird 3.2M 57.69x - - - -
HDFS 5M 25.49x 9.98x 11.93x 15.32x +66%
Linux 25K 22.09x 12.27x 16.68x 19.51x +13%

All results verified=True (byte-perfect lossless roundtrip). Datasets from LogHub. Thunderbird zstd baselines pending. Benchmarked on Debian Trixie amd64.