Data Compression Programs

by Matt Mahoney

As of July 23, 2009, this page is no longer maintained. The newest version can be found at http://mattmahoney.net/dc/

All software on this page is open source licensed under GPL and believed to be unencumbered by patents. All downloads include Windows executables and C++ source code for Windows or Linux/UNIX. The source code comments explain how the programs work. The PAQ8, LPAQ, and FPAQ projects have many contributors. Programs are last modified by Matt Mahoney unless otherwise specified. Last software update: June 14, 2009 (zpaq 1.02).

Contents

ZPAQ

ZPAQ is a proposed standard format for highly compressed data that allows new compression algorithms to be developed without breaking compatibility with older programs.

ZPAQ solves the compatibility problem by encoding a description of the decompression algorithm in the compressed data. ZPAQ is based on PAQ-like context mixing algorithms which are top ranked on many benchmarks. The format supports archivers, single file compressors, and memory to memory compression.

The current standard is Level 1 (ZPAQ-1). It is anticipated that future levels (ZPAQ-2, ZPAQ-3, etc.) will be backward compatible, such that newer levels can read archives produced by older programs.

ZPAQ is not patented. All versions of the reference decoder and zpaq compressor are licensed under GPL.

History

ZPAQ development began on Feb. 15, 2009 with a series of mutually incompatible experimental programs (v0.01 through v0.09, now obsolete). Archives created with these versions could only be read by the same version. The level 1 standard became fixed on Mar. 9, 2009 with version 1.00. All versions from 1.00 onward are level 1 compliant and can extract each other's archives.

zpaq v0.01 Open source (C++) and Win32 executables, Feb. 15, 2009.
zpaq v0.02 adds E8E9 transform. Fully supports post-processing. Not compatible with v0.01. Feb. 19, 2009.
zpaq v0.03 modifies MIX, MIX2, IMIX to fix poor compression on large files. Not compatible with v0.02. Feb. 19, 2009.
zpaq v0.04 modifies train() and squash() for improved compression. Not compatible with v0.03. Feb. 21, 2009.
zpaq v0.05 modifies probability representation and mixer weights to prevent mixer overflow and to improve compression for highly redundant data. Not compatible with v0.04. Feb. 26, 2009.
zpaq v0.06 adds SHA1 checksums, replaces IMIX2 with ISSE. Not compatible with v0.05. Feb. 27, 2009.
zpaq v0.07 improves ISSE and bit-history state table. Not compatible with v0.06. Feb. 28, 2009.
zpaq v0.08 adds LZP transform and minor improvements. Not compatible with v0.07. Mar. 8, 2009.
zpaq v0.09 removes counters from ISSE and ICM to improve speed. Not compatible with v0.09. Mar. 9, 2009.
zpaq v1.00 (first level 1 compliant version) includes unzpaq1 candidate reference decoder. Simplified bit history tables. Not compatible with earlier versions. Mar. 12, 2009.
unzpaq 1.01 updates reference decoder comments and help message and fixes some VS2005 compiler issues. Compatible with 1.00. Apr. 27, 2009.
unzpaq 1.02 and zpaq 1.02 closes extracted files immediately after decompression instead of when program exits. Fixes g++ 4.4 warnings. Compatible with 1.00 and 1.01. June 14, 2009.

ZPAQ is intended to replace PAQ and its variants (PAQ8, PAQ9A, LPAQ, LPQ1, etc) with similar or better compression in a portable, standard format. Current versions of PAQ break archive compatibility with each compression improvement. ZPAQ is intended to fix that. I no longer maintain the older PAQ code.

PAQ8

PAQ8 is a series of archivers that achieve very high compression rates at the expense of speed and memory. My latest version is paq8l, Mar. 8, 2007. I am no longer maintaining this code. However, there have been many compression improvements since then written by others, as described in the history section below.

  To compress:   paq8l -5 archive [files_or_folders...]  (creates archive.paq8p)
  To decompress: paq8l -d archive.paq8l [output_folder]
The -5 option is the default. It requires 233 MB of memory for compression and same for decompression. Options may range from -1 (35 MB) to -8 (1712 MB). More memory usually means better compression. In the Windows version you can also compress or decompress by putting paq8l.exe on the desktop and dropping files or folders on it. Compressed files/folders have a .paq8l extension. The input is not deleted.

NOTE: Files can only be decompressed with the same version of PAQ that they were compressed with. However, links to all versions can be found here. Decompression requires the same memory and time as compression.

Papers

All PAQ compressors use a context mixing algorithm described in Wikipedia. A large number of models independently predict the next bit of input. The predictions are combined using a neural network and arithmetic coded. There are specialized models for text, binary data, x86 executable code, BMP, TIFF, and JPEG images (except in the paq8hp* series, which are tuned for English text only). Additional papers:

Projects

PAQ variants licensed under GPL have held the Calgary Corpus Compression Challenge title since Jan. 2004, and the Hutter Prize since its inception in Aug. 2006.

PowerPAQ is a command line and GUI front end for 32 and 64 bit Linux by Eugene Varnavsky.

PeaZip (Giorgio Tani) is a GUI front end for Windows and Linux that supports the paq8o, lpaq1, and many other compression formats.

FreeBSD port, usually a few days behind the latest version.

History

History prior to March 2007 can be found on the PAQ history page.

paq8l was released Mar. 8 2007 by Matt Mahoney.

paq8hp12any (May 20 2007, Alexander Ratushnyak, updated Jan. 9, 2009) is a specialized version which has won the Hutter Prize. It compresses English text better and faster than paq8l, but compresses binary data, images, etc. worse. It includes some dictionary files, which must be in the current directory when run. It does not compress or create directories or have a drag-and-drop interface. Older versions with source code.

  To compress file enwik8 with 1 GB memory:  paq8hp12any -7 enwik8.paq8hp12any enwik8
  To decompress:                             paq8hp12any enwik8.paq8hp12any

paq8fthis2 (Jan Ondrus, Aug 12 2007), is based on paq8f (Matt Mahoney, Feb 28 2006) with improved JPEG compression.

paq8n (Matt Mahoney, Aug. 18, 2007) was created by inserting the JPEG model from paq8fthis2 into paq8l.

paq8o (Andreas Morphis, Aug. 22 2007) is derived from paq8n modified with an improved BMP model.

                 rafale.bmp  Time  lena.bmp  Time
                 ----------  ----  --------  ----
  Original size  4,149,414         786,486
  paq8n -5         634,547    66   419,735    17
  paq8o -5         551,665    78   410,644    21
  paq8osse -5      551,665    72   410,644    20
Compression and decompression times are about the same. Times are timer 3.01 process times in seconds on a 2.2 GHz Athlon-64 3500+, 2 GB, WinXP SP2, 32 bit. Other data compresses the same in paq8n and paq8o.

paq8osse is included in the paq8o distribution. It is for processors supporting SSE2 such as the Pentium 4 and Athlon. It is compatible with paq8o2 but 8% faster.

paq8o ver. 2 by Andreas Morphis, Aug. 24, 2007, is a minor bug fix to paq8o to fix the archive file name extension.

paq8fthis3 by by Jan Ondrus (Sept. 8, 2007) improves the JPEG model of paq8fthis2.

paq8o3 (KZ, Sept. 11, 2007) combines paq8o ver. 2 with the improved JPEG model from paq8fthis3 and the grayscale image PGM model introduced in paq8i by Pavel L. Holoborodko (Aug. 18, 2006), but not included in later PAQ versions until now.

paq8o4 ver 1 (KZ, Sept. 15, 2007) extends the PGM model of paq8o3 to 8 bit grayscale and paletted color BMP files.

paq8o4 ver. 2, released Sept. 17, 2007 by Matt Mahoney, is archive compatible with paq8o4 ver 1. It fixes directory creation and traversal and wildcard arguments, but is 8% slower because it was compiled with g++ instead of Intel C++.

paq8o5 by KZ, Sept. 21, 2007, is paq8o4v2 with the improved StateMap from lpaq1.

paq8fthis4 by Jan Ondrus, Sept. 27, 2007, is paq8f with an improved JPEG model (better than paq8o4/paq8o5). The archive also includes a faster JPEG model with less compression (paq8fthis_fast).

paq8o6 (mirror) by KZ, Sept. 28, 2007, improves on paq8o5 with the new JPEG model from paq8fthis4 and some code optimizations by Enrico Zeidler.

paq8o7 by KZ, Oct. 16, 2007, detects 4, 8 and 24 bit BMP, 8 bit PGM images, and has some improvements to the JPEG model.

paq8o8 by KZ, Oct. 23, 2007, improves JPEG compression further.

paq8o9 (mirror) by KZ, Feb. 16, 2007. Fixes a bug in .bmp detection that caused an infinite loop for files with invalid headers. Added grayscale .rgb support. Note: paq8o9 bug was fixed by KZ on Mar. 4, 2008. Mirror was updated on Mar. 9, 2008. No version number change.

paq8o10t is by KZ, June 11, 2008. Compression is better than paq8o9 on some files but worse on others. Discussion.

paq8p by Andreas Morphis, released Aug. 25, 2008, has greatly improved .wav compression and slightly improved .bmp compression. Discussion. Note: the SSE2 compiled version is not always archive compatible with the MMX version (noted by Rugxulo, Mar. 9, 2009).

here. paq8p1 (Oct. 8, 2008) by Kaidorav fixes some bugs in paq8p.

paq8p update: Jan 8, 2009. Added paq8o8sse2.asm (optimized assembler by Dark Shikari) and corresponding Windows Pentium 4+ compiled executable, paq8p_sse2.exe. It is archive compatible with paq8p.exe and might or might not be faster.

paq8p2 by Jan Ondrus, Mar. 2, 2009. Has some improvements in JPEG compression.

paq8o8z executables source code by Rugxulo, Apr. 25, 2009, contains ports to several different operating systems (DOS, Windows (paq8o8zw.exe), OS/2, Linux 2.4 and 2.6, Sun, etc) and assembly languages. It contains run time CPU detection to select between SSE2, MMX or non vectorized x86 code appropriately. Will decompress paq8o8 archives. Replaces versions released Dec. 25, 2008, Feb. 28, 2009, and Apr. 1, 2009 (removed). Based on paq8o8.

As of May 30, 2009, the latest version is paq8px_v40, with several new versions per day. See encode.ru for discussion.

PAQ9A

paq9a is an experimental archiver with compression and speed similar to lpaq1 and lacking specialized models. It has a different architecture than paq8. It uses LZP preprocessing to speed compression of highly redundant files, although it is slower than lpaq1 on other files. It also uses chains of 2-input mixers rather than a single mixer combining many predictions.

The archive format and command line interface are like lpq1. The archive supports files over 2 GB.

  To compress:   paq9a a archive [-1..-9] [[-s|-c] files...]...
  To decompress: paq9a x archive [new filenames...]
  List contents: paq9a l archive
Options:
  -c compress remaining files (default)
  -s store remaining files
  -1..-9 select memory usage from 18 MB to 1585 MB, default -7 = 405 MB.
         Memory usage is about 12 + 3*2level MB.  Memory
         option must appear first, e.g.

  paq9a a files.paq9a -4 file1 -s file2 -c file3 file4
compresses file1, stores file2, and compresses file3 and file4, creating archive files.paq9a. File names are stored exactly as entered (with or without a path). Wildcards are allowed if compiled with g++. Archives are solid and cannot be updated once created. Files can be renamed when extracted:
  paq9a x files.paq9a foo bar
extracts file1 to foo, file2 to bar, then file3 and file4 without renaming. If a file already exists or if the stored filename has a path to a nonexistent directory, then extraction is skipped.

LPAQ

lpaq1 (July 24, 2007) is a "lite" version of PAQ, faster but with less compression. It is a single file compressor, not an archiver.

  To compress:   lpaq1 5 input output
  To decompress: lpaq1 d input output
5 selects 102 MB memory. Options range from 0 (9 MB) to 9 (1542 MB). Option N uses 6 + 3*2N MB.

lpaq1a by Matt Mahoney, Dec. 21, 2007. This is an experimental program (not a new version) that combines the model from lpaq1 with the asymmetric binary coder from fpaqb. It is about 1% slower than lpaq1, compresses 0.02% larger, and requires an extra 1.25 MB memory during compression. The purpose of the program is to demonstrate an alternative to arithmetic coding with performance that is nearly as good.

I am no longer maintaining this code, but there have been many compression improvements (below):

History

The first version of lpaq1 was released by Matt Mahoney on July 24, 2007.

lpaq1 ver. 2 (Sept. 19, 2007) by Alexander Ratushnyak is archive compatible with lpaq1 with speed optimizations (6% to 10% faster).

lpaq2 (Sept. 20, 2007) by Alexander Ratushnyak improves compression, but is not compatible with lpaq1.

lpaq3 (Sept. 29, 2007) by Alexander Ratushnyak includes a version tuned for large text files (elpaq3), compiled from the same source code with option -DWIKI.

lpaq3a (Sept. 30, 2007) by Alexander Ratushnyak, has slightly better compression on some files, but slightly worse on others. lpaq3a.exe was compiled with Intel C++ instead of g++, which might be faster on some machines (but is slower on my AMD Athlon-64). This archive also contains lpaq3e.exe, which is an Intel compile of elpaq3 (archive compatible).

lpaq4 (Oct. 1, 2007) is by Alexander Ratushnyak. lpaq4e is tuned for large text files.

lpaq5 (Oct. 16, 2007) by Alexander Ratushnyak includes a version for large text files (lpaq5e) with separate compression and decompression programs (lpaq5e-c.exe and lpaq5e-d.exe) all produced from the same source code.

lpaq6 (Oct. 22, 2007) by Alexander Ratushnyak includes a E8E9 filter for better compression of x86 executables. Compression times below are all about 16 seconds.

Compressor  Opt  acrord32.exe  mso97.dll
----------  ---  ------------  ---------
Uncompressed     3,870,784     3,782,416
lpaq5        9   1,287,001     1,651,519
lpaq5e       9   1,329,518     1,684,646
lpaq6        9   1,157,971     1,564,607
lpaq6e       9   1,187,698     1,591,053

lpaq7 by Alexander Ratushnyak, Oct. 31, 2007.

lpaq8 by Alexander Ratushnyak, Dec. 10, 2007. Update Feb. 15, 2008: lpaq8.exe and lpaq8e.exe are no longer packed with Upack 0.399. This caused false alarms by some virus detectors. However the .exe files are now larger (but still under 30 KB). lpaq8e (included) is a version tuned for large text files.

lpaq9m by Alexander Ratushnyak, Feb. 20, 2009, is tuned for the large text benchmark. It includes a dictionary preprocessor (no source code) and a compressed dictionary. A large number of earlier versions (lpaq9e through lpaq9l) are omitted. See the large text benchmark for details.

LPQ1

lpq1 (Dec. 23, 2007, v2 updated May 6, 2008, Matt Mahoney) is an archiver based on lpaq1 with support for files over 2 GB without any Windows or UNIX/Linux specific code (portable C++). The commands should be familiar to users of 7zip and rar. Archives are "solid" and cannot be updated once created.

  To create an archive:  lpq1 a archive [options] files...
  To extract:            lpq1 x archive [new names]
  To list contents:      lpq1 l archive
Compression options are -c to compress (default) or -s to store. Options can be mixed with filenames and apply to all files that follow, overriding previous options, for example:
  lpq1 a foo.lpq file1.txt -s file2.txt file3.txt -c file4.txt
creates archive foo.lpq, compressing file1.txt and file4.txt and storing file2.txt and file3.txt without compression.

Files are extrated in the order they are added and can be renamed during extraction:

  lpq1 x foo.lpq newfile1.txt newfile2.txt
renames file1.txt to newfile1.txt, file2.txt to newfile2.txt, then extracts file3.txt and file4.txt without renaming. Extraction does not clobber files. If any file already exists or cannot be created, then it is skipped. lpaq1 does not create folders. If a file is saved with a path (not recommended) and the directory does not exist during extraction, then the file is skipped.

Version 2 fixes a bug in version 1 that caused it to fail (file not found) when compressing more than about 50 files. Archives format is unchanged.

BBB

bbb (Aug. 31, 2006) is a Big Block BWT (Burrows-Wheeler transform) compressor. It allows blocks as large as 80% of available memory, unlike other BWT compressors (bzip2, GRZipII, sbc, dark, nanozip, etc), which allow only 20%. This allows better compression of very large text files. However, it lacks a mechanism for handling slow sorting of long repeating strings, so it can be slow on highly compressible data.

  To compress/decompress: bbb cmd input output
The command is a sequence of concatenated letters:
  c = compress (default),  d = decompress.
  f = fast mode, needs 5x block size memory, default uses 1.25x block size.
  q = quiet (no output except error messages).
  bN, kN, mN = use block size N bytes, KiB, MiB, default = m4 (compression only).
Commands should be concatenated in any order, e.g. bbb cfm100q foo foo.bbb means compress foo to foo.bbb in fast mode using 100 MiB block size in quiet mode.

bbb uses a memory efficient BWT. For compression, blocks are first context-sorted in small blocks and then merged using temporary files. For the inverse transform, instead of building a linked list, the program builds an index to the approximate location of the next node, then searches linearly for the exact location. Fast mode (f) uses a normal BWT, but requires blocks no larger than 20% of memory. Fast/slow mode does not affect the compressed file format and is independent for compression and decompression. In either case, the context-sorted block is compressed with an adaptive order-0 model and arithmetic coded.

SR2

sr2 (Aug. 3, 2007) is a symbol ranking compressor designed for high speed rather than good compression. It is a single file compressor and uses 6 MB memory. It models the last 3 bytes seen in the order-4 context by order of appearance, and models all other bytes in an order-1 context with arithmetic coding.

  To compress:   sr2 input output
  To decompress: unsr2 input output

History

sr2 by Matt Mahoney, Aug. 3, 2007.

sr3 modified by Nania Francesco Antonio, Oct. 28, 2007. Better compression, a little slower, recognizes some file types. There is only one program (no unsr2).

sr3a modified by Andrew Paterson, Feb. 22, 2008, is about 5% faster than sr3 and compresses 3 bytes smaller. It will also compress and decompress in SR2 and SR3 formats.

FLZP

flzp (June 18, 2008) is a fast byte-oriented LZP compressor. It can be used as a preprocessor to improve the compression ratio and speed of low order compressors, for example:

57,366,279 enwik8.flzp          8 sec (2.2 GHz Athlon-64, WinXP Home)
63,391,013 enwik8.fpaq0         36 sec
39,879,072 enwik8.flzp.fpaq0    8+21 sec
36,800,798 enwik8.ppmd-o2
30,884,687 enwik8.flzp.ppmd-o2
30,017,979 enwik8.ppmd-o3
29,372,279 enwik8.flzp.ppmd-o3

The program uses 8 MB memory. It compresses by dividing the input into blocks up to 64KB, then using any unused bytes in the block to represent LZP match lengths.

FPAQ

FPAQ0 is a simple order-0 arithmetic file compressor for stationary sources (independent bytes with a uniform distribution throughout the file). If you want to build a custom compressor based on your own model, this is a good place to start. The code is much smaller, simpler, and faster than the other PAQ versions. This model only compresses well where the bytes are independent and statistics are uniform throughout the file, so you will probably want to change it. There are several versions.

fpaq0.cpp originally posted to comp.compression on Sept. 3, 2004, posted here Oct. 9, 2004.

  To compress:   fpaq0 c input output
  To decompress: fpaq0 d input output

fpaq1.cpp is a 64-bit version that gets slightly better compression on purely stationary data, but is slower. It works like fpaq0 but compressed files are not compatible. Posted Jan. 10, 2006.

fpaq0b.cpp by Fabio Buffoni uses the same model as fpaq0 but compresses as well as fpaq1 on stationary sources using only 32 bit arithmetic. It uses the coder from paqar/paq6fb which outputs one bit at a time and uses a carry counter achieving 30 bits precision. It is faster than fpaq1 but slower than fpaq0. Posted Jan. 10, 2006.

fpaq0s.cpp by David A. Scott improves compression and speed over fpaq0b.cpp by coding 8-bit symbols without an explicit EOF bit or length field. It uses the end of the compressed file to mark the end of the uncompressed file, saving the O(log(n)) space normally required to encode the length. Posted Jan. 16, 2006.

fpaq.zip by Nikolay Petrov is an assembly language implementation of fpaq0. It maintains archive compatibility but compresses 37% faster and decompresses 46% faster (on a 2.2 GHz Athlon-64 in 32 bit mode, XP home). Feb. 20, 2006.

fpaq02.zip by David Anderson, May 27, 2007. It is a 64 bit version like fpaq1, but extended using the techniques of fpaq0s to exact rational representation of probabilities to 64 bits rather than 12 bits. The result is a tiny improvement in compression (a few bytes per million) but about 2.5 times slower than fpaq1.

fpaq0p.zip by Ilia Muraviev, Apr. 15, 2007, uses an adaptive order 0 model. Instead of keeping a 0,1 count for each context, it keeps a probability and updates it by adjusting by 1/32 of the error. This is faster because it avoids a division instruction.

fpaqa by Matt Mahoney, Dec. 15, 2007, is an adaptive order 0 model like fpaq0p, but is based on Jarek Duda's asymmetric binary coder (ABC) instead of an arithmetic coder. The coder has only one state variable (10-12 bits) so can be implemented using lookup tables without using multiplication. This is the first compressor to implement this coder.

fpaqb by Matt Mahoney, Dec. 18, 2007, updated Dec. 20, 2007, is an improved ABC coder using bytewise I/O and no tables for decompression. It uses one table for compression to avoid division. It has improved compression (near the theoretical limit) and faster decompression. It also uses less memory.

fpaqc by Matt Mahoney, Dec. 24, 2007 (documentation updated Dec. 25, 2007), is mainly a speed-optimized version of fpaqb. It also eliminates block size information from block headers, saving 3 bytes.

fpaq0f by Matt Mahoney, Jan. 28, 2007 uses the bit history in each bitwise context as context, followed by arithmetic coding.

fpaq0f2 by Matt Mahoney, Jan. 30, 2007 modifies fpaq0f by using a simplified bit history consisting of the last 8 bits, plus some minor improvements.

D translation of fpaq0 by Leonardo Maffi was posted Feb. 8, 2008. The DMD compile is about 60% as fast as g++ -O3.

Benchmarks

The benchmarks below are for calgary.tar, the 14 files of the Calgary corpus as a tar file, and for enwik8, a 100 MB text file used in the Large Text Compression Benchmark and Hutter Prize. Compression and decompression times are for enwik8 in seconds on a 2.2 GHz Athlon-64 3500+ in 32-bit Windows XP. Memory is in MB. Options are selected for maximum compression with 2 GB memory (except paq8f with 1 GB) on enwik8. All of the high memory compressors have options to select less memory at the cost of some compression.

As of 2009, timing results are tested on a Gateway M-7301U laptop with 2.0 GHz dual core Pentium T3200 (1MB L2 cache), 3 GB RAM, Vista SP1, 32 bit. Run times (on one core) are similar to my older computer.

Program           calgary.tar   enwik8      Comp  Decmp   Mem    Alg     Best for
------------       ---------  ----------    ----- -----   ----   ---     --------
Original size      3,152,896 100,000,000 
paq8hp12any    -8    594,269  16,230,028     5700  5700   1850   CM      English text (Hutter prize, slow)
paq8o10t       -8    594,587  17,772,821    14425 14372   1591   CM      Best general purpose (very slow)
paq9a          -9    676,659  19,974,112      539   510   1585   LZP+CM  Experimental
lpaq8           9    676,409  19,523,803      368   371   1542   CM      General purpose
lpaq8e          9    681,497  18,982,007      342   347   1542   CM      Large text files
bbb          m100    798,705  20,847,290      369   247    146   BWT     Very large text files (moderate speed)
sr2                  979,349  30,432,506       10    11      6   SR      General purpose (fast)
flzp               1,691,924  57,366,279        8     4      8   LZP     Speed or as a low order preprocessor
fpaq0              1,903,024  63,391,013       33    35     <1   ord 0   Independent bytes, stationary (C++ source)
fpaq               1,903,024  63,391,013       25    25     <1   ord 0   Independent bytes, stationary (assembler)
fpaq0p             1,685,882  61,457,810       13    13     <1   ord 0   Independent bytes, nonstationary (C++ source)
fpaqc              1,681,774  61,270,455       25    18      2   ord 0   Asymmetric coder source
fpaq0f2            1,601,207  56,916,872       21    20     <1   ord 0   Independent bytes, adaptive

ppmd -m256 -o10 -r1  756,106  21,388,296       88    89    256   PPM     General purpose
bzip2 -9             860,097  29,008,758       33    12      8   BWT     General purpose
zip -9             1,023,101  36,445,443       10     1.3    1.2 LZ77    General purpose

Obsolete
--------
paq8osse       -8    594,798  17,916,451    12526 12457   1778   CM
paq8f          -7    605,782  18,289,559     6896  6900    854   CM
paq8l          -8    594,796  17,916,450    13600 13639   1643   CM
paq8n          -8    594,796  17,916,420    13488 13548   1643   CM
paq8o          -8    594,798  17,916,451    13585 13526   1643   CM
paq8o3         -8    594,798  17,916,450    13458 13453   1636   CM
paq8o4 v1      -8    594,798  17,916,450    12678 12656   1636   CM
paq8o5         -8    594,700                              1846   CM
paq8o6         -8    594,734  17,904,721    13953 13972   1712   CM
paq8o7         -8    594,734  17,904,756    13914 13853   1574   CM
paq8o8         -8    594,734  17,904,756    13937 13915   1574   CM

paq8fthis2     -7    605,782                               855   CM
               -8    605,782  18,075,265     6910  6931   1693   CM
paq8fthis3     -8    605,782                              1693   CM
paq8fthis4     -8    605,782                              1693   CM
paq8fthis_fast -8    605,782                              1693   CM     Faster JPEG version of paq8fthis4

lpaq1 v1        9    681,811  19,755,948      365   360   1539   CM
lpaq1 v2        9    681,811  19,755,948      341   346   1539   CM
lpaq2           9    681,854  19,755,471      329   337   1539   CM
lpaq3           9    679,493  19,580,276      369   373   1542   CM
elpaq3          9    683,949  19,392,604      341   345   1542   CM
lpaq3a          9    679,045  19,585,951      424   389   1542   CM
lpaq4           9    679,034  19,583,905      374   375   1542   CM
lpaq4e          9    683,848  19,358,662      346   350   1542   CM
lpaq5           9    680,101  19,455,395      367   368   1542   CM
lpaq5e          9    683,016  19,027,721      348   364   1542   CM
lpaq6           9    679,053  19,562,861      385   391   1542   CM
lpaq6e          9    681,405  19,054,076      369   376   1542   CM
lpaq1a          9    681,943  19,759,778      365   349   1540   CM
drt|lpaq9m      8/9  663,480/ 17,964,751      209   209 774/1542 CM

fpaq1              2,110,511  63,502,003       48    49     <1   ord 0
fpaq0b             1,902,674  63,375,460       46    44     <1   ord 0
fpaq0s             1,902,673  63,375,457       43    42     <1   ord 0
fpaq02             2,110,506  63,501,997      134   132     <1   ord 0
fpaqa              1,682,231  61,310,408       27    24      2   ord 0   Asymmetric binary coder source
fpaqb              1,681,777  61,270,458       26    18      2   ord 0   Bytewise ABC coder source
fpaq0f             1,622,153  58,088,230       27    24     <1   ord 0   Independent bytes, adaptive

Additional benchmarks can be found here:

Large Text Compression Benchmark by Matt Mahoney
Maximum Compression by Werner Bergmans
Black Fox
Squeeze Chart by Stephan Busch
Squxe
Compression Max (in French)
Ultimate Command Line Compression by Johan de Bock
Emilcont Ultracompression by Berto Destasio.

JPEG Benchmarks

The PAQ series compressors beginning with PAQ7*, with the exception of PAQ8HP*, will compress JPEG images, including images embedded in other files (embed). They compress only baseline JPEG (base), which make up about 90-95% of images. They do not compress progressive mode (prog), the other 5-10%. Compression times are in seconds for a10.jpg. Both test files are baseline JPEG.

Program              a10.jpg  dscn3974.jpg  Comp  Base  Prog  Embed Non-JPEG
-------              -------  ------------  ----  ----  ----  ----- --------
Original size        842,468   1,114,198
Stuffit 11 (best)    638,540     834,079     1     yes   yes   no    yes
paq8o8 -6            638,206     827,071    30.8   yes   no    yes   yes
paq8fthis_fast -6    673,112     862,834     9.4   yes   no    yes   yes
PackJPG 2.3          697,822     873,261     1.5   yes   yes   no    no
lprepaq 5            699,692   1,083,737     9.5   yes   no    yes   yes
jpg2dct | ppmd -o2   770,302     964,803     2.0   yes   yes   no    no  (lossy)
rings 0.1            819,169   1,075,239     0.3   yes               yes
ppmd -o2             833,336   1,094,853     1.2   no    no    no    yes

Obsolete
--------
paq8f -6             698,214     880,902    17.6   yes   no    yes   yes
paq8l -6             698,510     881,244    21.5   yes   no    yes   yes
paq8fthis2 -6        660,740     843,264    22.7   yes   no    yes   yes
paq8n -6             661,321     843,920    25.1   yes   no    yes   yes
paq8fthis3 -6        652,266     835,871    23.7   yes   no    yes   yes
paq8o3 -6            652,040     835,556    26.8   yes   no    yes   yes
paq8fthis4 -6        645,102     831,258    29.7   yes   no    yes   yes
paq8o6 -6            643,952     829,142    31.4   yes   no    yes   yes
PackJPG 2.2          697,818     873,261     1.6   yes   no    no    no
paq8o7 -6            642,092     827,713    33.8   yes   no    yes   yes
jpg2dct is a JPEG preprocessor by Jean-Pierre Demailly, Sept. 4, 2007 (ported from Linux to Windows by Matt Mahoney, Sept. 15, 2007). It expands a JPEG file into DCT coefficients, which makes it larger but sometimes more compressible to other compressors. It is lossy in that the inverse transform does not restore exactly the same file, although it does restore a pixel for pixel identical image.

Incompressible Data?

SHARND generates cryptographically secure pseudo random files (I think. I am looking for feedback). These should not be compressible by any algorithm that does not know the key used to generate it. To use:
  sharnd key n
to generate the n-byte random file, sharnd.out. The key can be any ASCII string (quoted if it contains spaces). Data is generated as a series of 20-byte strings, x[1], x[2], x[3]... such that x[i] = SHA1(x[i-1] + key), and x[0] = 0.

sharnd.c, June 7, 2004
sharnd.exe, 29,860 bytes (Windows executable, compiled with g++, UPX)

The 100,000 byte file sharnd_challenge.dat posted on June 21, 2004 was generated by sharnd using a secret key. The SHARND challenge is to guess either the key (an ASCII string less than 80 bytes) or any of the bytes following the data.

Furthermore, without knowing the key, it is believed to be impossible to compress this file, i.e. to find a decompression program such that its size (as source code or executable) plus the size of its input is 99,999 bytes or less. (Of course, such a program does exist. It is posted here, and its input is less than 80 bytes).

BARF - Ultimate Compression?

Compress any nonempty file. Recursively compress files to arbitrarily small sizes. Compress the Calgary corpus to 1 byte per file in under 1 second. Is it possilbe with the Better Archiver with Recursive Functionality? :-)

Matt Mahoney, mmahoney@cs.fit.edu