PYME.contrib.lzw module

A stream friendly, simple compression library, built around iterators. See L{compress} and L{decompress} for the easiest way to get started.

After the TIFF implementation of LZW, as described at U{http://www.fileformat.info/format/tiff/corion-lzw.htm}

In an even-nuttier-shell, lzw compresses input bytes with integer codes. Starting with codes 0-255 that code to themselves, and two control codes, we work our way through a stream of bytes. When we encounter a pair of codes c1,c2 we add another entry to our code table with the lowest available code and the value value(c1) + value(c2)[0]

Of course, there are details :)

The Details

Our control codes are

  • CLEAR_CODE (codepoint 256). When this code is encountered, we flush the codebook and start over.
  • END_OF_INFO_CODE (codepoint 257). This code is reserved for encoder/decoders over the integer codepoint stream (like the mechanical bit that unpacks bits into codepoints)

When dealing with bytes, codes are emitted as variable length bit strings packed into the stream of bytes.

codepoints are written with varying length
  • initially 9 bits
  • at 512 entries 10 bits
  • at 1025 entries at 11 bits
  • at 2048 entries 12 bits
  • with max of 4095 entries in a table (including Clear and EOI)

code points are stored with their MSB in the most significant bit available in the output character.

>>> import lzw
>>>
>>> mybytes = lzw.readbytes("README.txt")
>>> lessbytes = lzw.compress(mybytes)
>>> newbytes = b"".join(lzw.decompress(lessbytes))
>>> oldbytes = b"".join(lzw.readbytes("README.txt"))
>>> oldbytes == newbytes
True
class PYME.contrib.lzw.BitPacker(initial_code_size)

Bases: object

Translates a stream of lzw codepoints into a variable width packed stream of bytes, for use by L{BitUnpacker}. One of a (potential) set of encoders for a stream of LZW codepoints, intended to behave as closely to the TIFF variable-width encoding scheme as closely as possible.

The inbound stream of integer lzw codepoints are packed into variable width bit fields, starting at the smallest number of bits it can and then increasing the bit width as it anticipates the LZW code size growing to overflow.

This class knows all kinds of intimate things about how it’s upstream codepoint processors work; it knows the control codes CLEAR_CODE and END_OF_INFO_CODE, and (more intimately still), it makes assumptions about the rate of growth of it’s consumer’s codebook. This is ok, as long as the underlying encoder/decoders don’t know any intimate details about their BitPackers/Unpackers

Methods

pack(codepoints) Given an iterator of integer codepoints, returns an iterator over bytes containing the codepoints packed into varying lengths, with bit width growing to accomodate an input code that it assumes will grow by one entry per codepoint seen.

Takes an initial code book size (that is, the count of known codes at the beginning of encoding, or after a clear)

Methods

pack(codepoints) Given an iterator of integer codepoints, returns an iterator over bytes containing the codepoints packed into varying lengths, with bit width growing to accomodate an input code that it assumes will grow by one entry per codepoint seen.
pack(codepoints)

Given an iterator of integer codepoints, returns an iterator over bytes containing the codepoints packed into varying lengths, with bit width growing to accomodate an input code that it assumes will grow by one entry per codepoint seen.

Widths will be reset to the given initial_code_size when the LZW CLEAR_CODE or END_OF_INFO_CODE code appears in the input, and bytes following END_OF_INFO_CODE will be aligned to the next byte boundary.

>>> import lzw
>>> pkr = lzw.BitPacker(258)
>>> [ b for b in pkr.pack([ 1, 257]) ] == [ chr(0), chr(0xC0), chr(0x40) ]
True
class PYME.contrib.lzw.BitUnpacker(initial_code_size)

Bases: object

An adaptive-width bit unpacker, intended to decode streams written by L{BitPacker} into integer codepoints. Like L{BitPacker}, knows about code size changes and control codes.

Methods

unpack(bytesource) Given an iterator of bytes, returns an iterator of integer code points.

initial_code_size is the starting size of the codebook associated with the to-be-unpacked stream.

Methods

unpack(bytesource) Given an iterator of bytes, returns an iterator of integer code points.
unpack(bytesource)

Given an iterator of bytes, returns an iterator of integer code points. Auto-magically adjusts point width when it sees an almost-overflow in the input stream, or an LZW CLEAR_CODE or END_OF_INFO_CODE

Trailing bits at the end of the given iterator, after the last codepoint, will be dropped on the floor.

At the end of the iteration, or when an END_OF_INFO_CODE seen the unpacker will ignore the bits after the code until it reaches the next aligned byte. END_OF_INFO_CODE will not stop the generator, just reset the alignment and the width

>>> import lzw
>>> unpk = lzw.BitUnpacker(initial_code_size=258)
>>> [ i for i in unpk.unpack([ chr(0), chr(0xC0), chr(0x40) ]) ]
[1, 257]
class PYME.contrib.lzw.ByteDecoder

Bases: object

Decodes, combines bit-unpacking and interpreting a codepoint stream, suitable for use with bytes generated by L{ByteEncoder}.

See L{ByteDecoder} for a usage example.

Methods

decodefrombytes(bytesource) Given an iterator over BitPacked, Encoded bytes, Returns an iterator over the uncompressed bytes.

Methods

decodefrombytes(bytesource) Given an iterator over BitPacked, Encoded bytes, Returns an iterator over the uncompressed bytes.
decodefrombytes(bytesource)

Given an iterator over BitPacked, Encoded bytes, Returns an iterator over the uncompressed bytes. Dual of L{ByteEncoder.encodetobytes}. See L{ByteEncoder} for an example of use.

class PYME.contrib.lzw.ByteEncoder(max_width=12)

Bases: object

Takes a stream of uncompressed bytes and produces a stream of compressed bytes, usable by L{ByteDecoder}. Combines an L{Encoder} with a L{BitPacker}.

>>> import lzw
>>>
>>> enc = lzw.ByteEncoder(12)
>>> bigstr = b"gabba gabba yo gabba gabba gabba yo gabba gabba gabba yo gabba gabba gabba yo"
>>> encoding = enc.encodetobytes(bigstr)
>>> encoded = b"".join( b for b in encoding )
>>> encoded
'3\x98LF#\x08\x82\x05\x04\x83\x1eM\xf0x\x1c\x16\x1b\t\x88C\xe1q(4"\x1f\x17\x85C#1X\xec.\x00'
>>>
>>> dec = lzw.ByteDecoder()
>>> decoding = dec.decodefrombytes(encoded)
>>> decoded = b"".join(decoding)
>>> decoded == bigstr
True

Methods

encodetobytes(bytesource) Returns an iterator of bytes, adjusting our packed width between minwidth and maxwidth when it detects an overflow is about to occur.

max_width is the maximum width in bits we want to see in the output stream of codepoints.

Methods

encodetobytes(bytesource) Returns an iterator of bytes, adjusting our packed width between minwidth and maxwidth when it detects an overflow is about to occur.
encodetobytes(bytesource)

Returns an iterator of bytes, adjusting our packed width between minwidth and maxwidth when it detects an overflow is about to occur. Dual of L{ByteDecoder.decodefrombytes}.

class PYME.contrib.lzw.Decoder

Bases: object

Uncompresses a stream of lzw code points, as created by L{Encoder}. Given a list of integer code points, with all unpacking foolishness complete, turns that list of codepoints into a list of uncompressed bytes. See L{BitUnpacker} for what this doesn’t do.

Methods

code_size() Returns the current size of the Decoder’s code book, that is, it’s mapping of codepoints to byte strings.
decode(codepoints) Given an iterable of integer codepoints, yields the corresponding bytes, one at a time, as byte strings of length E{1}.

Creates a new Decoder. Decoders should not be reused for different streams.

Methods

code_size() Returns the current size of the Decoder’s code book, that is, it’s mapping of codepoints to byte strings.
decode(codepoints) Given an iterable of integer codepoints, yields the corresponding bytes, one at a time, as byte strings of length E{1}.
code_size()

Returns the current size of the Decoder’s code book, that is, it’s mapping of codepoints to byte strings. The return value of this method will change as the decode encounters more encoded input, or control codes.

decode(codepoints)

Given an iterable of integer codepoints, yields the corresponding bytes, one at a time, as byte strings of length E{1}. Retains the state of the codebook from call to call, so if you have another stream, you’ll likely need another decoder!

Decoders will NOT handle END_OF_INFO_CODE (rather, they will handle the code by throwing an exception); END_OF_INFO should be handled by the upstream codepoint generator (see L{BitUnpacker}, for example)

>>> import lzw
>>> dec = lzw.Decoder()
>>> ''.join(dec.decode([103, 97, 98, 98, 97, 32, 258, 260, 262, 121, 111, 263, 259, 261, 256]))
'gabba gabba yo gabba'
class PYME.contrib.lzw.Encoder(max_code_size=4096)

Bases: object

Given an iterator of bytes, returns an iterator of integer codepoints, suitable for use by L{Decoder}. The core of the “compression” side of lzw compression/decompression.

Methods

code_size() Returns a count of the known codes, including codes that are implicit in the data but have not yet been produced by the iterator.
encode(bytesource) Given an iterator over bytes, yields the corresponding stream of codepoints.
flush() Yields any buffered codepoints, followed by a CLEAR_CODE, and clears the codebook as a side effect.

When the encoding codebook grows larger than max_code_size, the Encoder will clear its codebook and emit a CLEAR_CODE

Methods

code_size() Returns a count of the known codes, including codes that are implicit in the data but have not yet been produced by the iterator.
encode(bytesource) Given an iterator over bytes, yields the corresponding stream of codepoints.
flush() Yields any buffered codepoints, followed by a CLEAR_CODE, and clears the codebook as a side effect.
code_size()

Returns a count of the known codes, including codes that are implicit in the data but have not yet been produced by the iterator.

encode(bytesource)

Given an iterator over bytes, yields the corresponding stream of codepoints. Will clear the codes at the end of the stream.

>>> import lzw
>>> enc = lzw.Encoder()
>>> [ cp for cp in enc.encode("gabba gabba yo gabba") ]
[103, 97, 98, 98, 97, 32, 258, 260, 262, 121, 111, 263, 259, 261, 256]
flush()

Yields any buffered codepoints, followed by a CLEAR_CODE, and clears the codebook as a side effect.

class PYME.contrib.lzw.PagingDecoder(initial_code_size)

Bases: object

UNTESTED. Dual of PagingEncoder, knows how to handle independantly encoded, END_OF_INFO_CODE delimited chunks of an inbound byte stream

Methods

decodepages(bytesource) Takes an iterator of bytes, returns an iterator of iterators of uncompressed data.
next_page(codepoints) Iterator over the next page of codepoints.
decodepages(bytesource)

Takes an iterator of bytes, returns an iterator of iterators of uncompressed data. Expects input to conform to the output conventions of PagingEncoder(), in particular that “pages” are separated with an END_OF_INFO_CODE and padding up to the next byte boundary.

BUG: Dangling trailing page on decompression.

>>> import lzw
>>> pgdec = lzw.PagingDecoder(initial_code_size=257)
>>> pgdecoded = pgdec.decodepages(
...     ''.join([ '\x80\x1c\xcc\'\x91\x01\xa0\xc2m6',
...               '\x99NB\x03\xc9\xbe\x0b\x07\x84\xc2',
...               '\xcd\xa68|"\x14 3\xc3\xa0\xd1c\x94',
...               '\x02\x02\x80\x18M\xc6A\x01\xd0\xd0e',
...               '\x10\x1c\x8c\xa73\xa0\x80\xc7\x02\x10',
...               '\x19\xcd\xe2\x08\x14\x10\xe0l0\x9e`\x10',
...               '\x10\x80\x18\xcc&\xe19\xd0@t7\x9dLf\x889',
...               '\xa0\xd2s\x80@@' ])
... )
>>> [ b"".join(pg) for pg in pgdecoded ]
['say hammer yo hammer mc hammer go hammer', 'and the rest can go and play', "can't touch this", '']
next_page(codepoints)

Iterator over the next page of codepoints.

class PYME.contrib.lzw.PagingEncoder(initial_code_size, max_code_size)

Bases: object

UNTESTED. Handles encoding of multiple chunks or streams of encodable data, separated with control codes. Dual of PagingDecoder.

Methods

encodepages(pages) Given an iterator of iterators of bytes, produces a single
encodepages(pages)

Given an iterator of iterators of bytes, produces a single iterator containing a delimited sequence of independantly compressed LZW sequences, all beginning on a byte-aligned spot, all beginning with a CLEAR code and all terminated with an END_OF_INFORMATION code (and zero to seven trailing junk bits.)

The dual of PagingDecoder.decodepages

>>> import lzw
>>> enc = lzw.PagingEncoder(257, 2**12)
>>> coded = enc.encodepages([ "say hammer yo hammer mc hammer go hammer", 
...                           "and the rest can go and play",
...                           "can't touch this" ])
...
>>> b"".join(coded)
'\x80\x1c\xcc\'\x91\x01\xa0\xc2m6\x99NB\x03\xc9\xbe\x0b\x07\x84\xc2\xcd\xa68|"\x14 3\xc3\xa0\xd1c\x94\x02\x02\x80\x18M\xc6A\x01\xd0\xd0e\x10\x1c\x8c\xa73\xa0\x80\xc7\x02\x10\x19\xcd\xe2\x08\x14\x10\xe0l0\x9e`\x10\x10\x80\x18\xcc&\xe19\xd0@t7\x9dLf\x889\xa0\xd2s\x80@@'
PYME.contrib.lzw.bitstobytes(bits)

Interprets an indexable list of booleans as bits, MSB first, to be packed into a list of integers from 0 to 256, MSB first, with LSBs zero-padded. Note this padding behavior means that round-trips of bytestobits(bitstobytes(x, width=W)) may not yield what you expect them to if W % 8 != 0

Does NOT pack the returned values into a bytearray or the like.

>>> import lzw
>>> bitstobytes([0, 0, 0, 0, 0, 0, 0, 0, "Yes, I'm True"]) == [ 0x00, 0x80 ]
True
>>> bitstobytes([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]) == [ 0x01, 0x30 ]
True
PYME.contrib.lzw.bytestobits(bytesource)

Breaks a given iterable of bytes into an iterable of boolean values representing those bytes as unsigned integers.

>>> import lzw
>>> [ x for x in lzw.bytestobits(b"\x01\x30") ]
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]
PYME.contrib.lzw.compress(plaintext_bytes)

Given an iterable of bytes, returns a (hopefully shorter) iterable of bytes that you can store in a file or pass over the network or what-have-you, and later use to get back your original bytes with L{decompress}. This is the best place to start using this module.

PYME.contrib.lzw.decompress(compressed_bytes)

Given an iterable of bytes that were the result of a call to L{compress}, returns an iterator over the uncompressed bytes.

PYME.contrib.lzw.filebytes(fileobj, buffersize=1024)

Convenience for iterating over the bytes in a file. Given a file-like object (with a read(int) method), returns an iterator over the bytes of that file.

PYME.contrib.lzw.intfrombits(bits)

Given a list of boolean values, interprets them as a binary encoded, MSB-first unsigned integer (with True == 1 and False == 0) and returns the result.

>>> import lzw
>>> lzw.intfrombits([ 1, 0, 0, 1, 1, 0, 0, 0, 0 ])
304
PYME.contrib.lzw.inttobits(anint, width=None)

Produces an array of booleans representing the given argument as an unsigned integer, MSB first. If width is given, will pad the MSBs to the given width (but will NOT truncate overflowing results)

>>> import lzw
>>> lzw.inttobits(304, width=16)
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]
PYME.contrib.lzw.readbytes(filename, buffersize=1024)

Opens a file named by filename and iterates over the L{filebytes} found therein. Will close the file when the bytes run out.

PYME.contrib.lzw.unpackbyte(b)

Given a one-byte long byte string, returns an integer. Equivalent to struct.unpack(“B”, b)

PYME.contrib.lzw.writebytes(filename, bytesource)

Convenience for emitting the bytes we generate to a file. Given a filename, opens and truncates the file, dumps the bytes from bytesource into it, and closes it