Lempel-Ziv Welch compression algorithm

Please see disclaimer on main page.

Note: The LZW compression algorithm used to be covered by a patent held by Unisys. Unisys required anyone using the algorithm to purchase a license. However, the patent has now expired and anyone is free to implement the algorithm without a license from Unisys.


Basic overview:

LZW compression creates a table of strings commonly occurring in the data being compressed, and replaces the actual data with references into the table. The table is formed during compression at the same time at which the data is encoded and during decompression at the same time as the data is decoded.

As an example, say we are compressing a data sequence consisting of the values a, b and c in arbitrary order:

            ababcbcabcacab

First of all, a table is created with all possible data values:

   CompTable = {a,b,c}

We create a string, call it 'BuildString', initially empty. We also have a character variable 'EncChar'. We start by letting 'EncChar' equal the next character in the data stream, in this case 'a'. Then, we check if (BuildString + EncChar) (which is 'a' at the moment) is in CompTable. It is, so we loop back and read another value for 'EncChar' ('b'). Now, (BuildString + EncChar) is 'ab' which is not in CompTable - so, we:

  1. Output the index of BuildString (='a'), currently 0.
  2. Add (BuildString + EncChar) to CompTable, so it appears as:
          CompTable = {a,b,c,ab}
  3. Set BuildString to contain only EncChar (='b')

The process is repeated until all data is processed. The idea is that the index into the table will take up less space than the data itself. Note that in the previous example, the next time the sequence 'ab<x>' is encountered, the index number 3 will be written out and 'ab<x>' will be added to the table. In this way, commonly occurring strings are recognized.


Formal Implementation

Compression:

Create string table "CompTable" which must contain all possible data values;
Create string "BuildString" (empty), character "EncChar"
[1] Char EncChar = next char in datastream (if no more, goto [2])
    If (BuildString + EncChar) is in CompTable {
      BuildString = BuildString + EncChar, goto [1]
    } (end if)
    output index of BuildString in CompTable
    BuildString = EncChar
    goto [1]
[2] Output index of BuildString in CompTable
    (Finish)

Decompression:

Create string table "CompTable" which must contain all possible data values;
Create Indexes (into CompTable) "ReadIndex" and "OldIndex"
Create string "BuildString", character "BuildChar"
    ReadIndex = index read from data stream
    goto [2];
[1] ReadIndex = index read from data stream (goto [3] if end);
    if ReadIndex exists in CompTable {
      BuildChar = First char of CompTable entry for ReadIndex;
    } else {
      BuildChar = first character of BuildString;
    } (end if)
    Add (BuildString + BuildChar) to CompTable;
[2] BuildString = CompTable entry for ReadIndex;
    Output BuildString;
    goto [1]
[3] (finish)

Notes

1. The "all possible data values" above refers to possible character values which may be encountered in the data stream. Generally, this will be all possible values a byte can contain (ie 0 to 255) however this may not be the case; when compressing plain text, for instance, certain characters will never be encountered.

2. The number of elements in CompTable increases as compression or decompression occurs. The necessary size of the index (size = width in bits) will therefore change - it is possible to write (or read) a different number of bits depending on this need. Remember, however, that during decompression, whenever an index is read (apart from the very first), there is actually one more element in the table than the decompressor knows about.

3. When using the technique described in 2 (above), there may be situations where the full index number need not be output, only enough bits to distinguish the index number. For instance, if the highest index number is binary 110000, and the index being presently written or read is 110000, then only the first two bits are necessary - the rest of the bits must be zero, since anything else would constitute an illegal index number.

4. When the compression table becomes full, it is usually completely re-initialized. To do this correctly, the program must act as if compression/decompression has finished, and then re-initialize. It is also possible to retain commonly occurring strings, if a 'use count' is maintained.

5. Each new item in the compression table consists of a lower-index item, followed by a single character. Memory can be saved be implementing all elements as an index as well as a single character (this may cause a speed reduction, and the code in decompression may need to implement an 'OldIndex' variable)