I do a lot of programming in the large and spend far too much time, according to some, thinking about programming in the large concepts such as SOA and CEP but from time to time I like to keep it real and code something detailed and low(er) level. Recently I’ve been doing a little investigation around compression and so I ended up reading up on Huffman Coding. Huffman coding is a way of taking a fixed length encoding and turning it into a shorter variable length encoding.
Back to basics
That’s a little vague and academic so I’m going to assume little knowledge of how computers store data and work my way up and see if I can explain it properly.
Computer storage is made of switches, lots and lots of on/off switches, a switch in storage terms is called a bit and is the smallest possible unit of storage. Switches can be on or off and computers represent that on or off as 1 or 0 respectively. That’s why the classic power switch on most devices is a zero with a one overlayed as in figure 1, 0 and 1 mean off and on to engineers. If I only have 0s and 1s then I’m going to have to find a way of representing numbers bigger than 1 using them.
Getting to second base
Classically we use a numbering system with ten digits zero through to nine, which mathematicians would call base 10 or decimal. However as computers only have two digits they use a counting system base 2 which is referred to as binary. To count in binary we simply increment the number on the right until it reaches its digit limit (1) and then increment the number closest to it on the left whilst reseting that number to zero, if the number on the left is also at its limit we move again to the left and reset it until we find one we can increment, this is exactly the algorithm we use for normal counting. Therefore we have
binary decimal 0 0 1 1 10 2 11 3 100 4 101 5 110 6 111 7 1000 8 1001 9 1010 10 1011 11 1100 12
Interestingly and logically where in decimal the first digit represents units and the second tens and the third hundreds, in decimals they represent 1, 2, 4, 8 and double for each digit to the left. Thus 101 (binary) is
(1 × 1) + (2 × 0) + (4 × 1) = 5 (decimal)
Anyway enough about bases, I think we have that covered off. What we need to understand is you can represent numbers using sequences of zeros and one on computers.
Computers encode many things to files such as text, images, video, CAD models and many more. Let’s choose the simplest thing possible, though, for our example: text. To have text file we must have some way of mapping numbers to/from letters, what computer scientists call character encodings. Whilst conceptually simply these things stike fear into programmers who are smart enough to understand what they are and what a mess we’ve created for ourselves. I’m going to dodge the bullet of fully explain why they are a pain and use the simplest one as an example. US-ASCII, which is the seemly tautological United States-American Standard Code for Information Interchange, as an example character encoding.
US-ASCII defines a mapping of numbers to characters. Each character is represented by an 8 binary digit number known as a byte, remember a binary digit is called a bit, so a byte is an 8-bit number. The reason numbers are stored as 8-bits is so that they can be read off storage easily. Remember there are only zeros and ones on computers. Therefore to know where a number starts and ends there has to be some way of splitting them back apart. A file contains for example:
Which is two bytes of data namely 01001111 and 01001011 or in decimal 79 and 75 or in US-ASCII encoding “OK”.
ASCII defines which numbers represent which characters, for example ‘A’ is 41 (decimal) that is 01000001 (binary). As it just uses 8 bit numbers there are only 256 characters possible (i.e. 00000000 (0) through to 11111111 (255)). Full mappings of numbers to characters for ASCII are availble all over the web.
You might be thinking why bother with the leading 0s on each byte in the “OK” example above, well if I did that I would have
and how on earth do I know which digits belong to which number, it could be 10011111, 001011 or it could be 100111, 11010101 indeed there are 13 possibilities assuming I know there are two numbers represented, otherwise it could be 14 numbers or 1, basically I’m screwed if I don’t know the length of the numbers in bits. I can’t add anything to separate them because all I have is zeros and ones and I can’t use a zero or one to separate them because I won’t have anything left to count with; you can’t count in base 1. Therefore in order to put numbers in storage I need to predetermine a length for those numbers in bits and stick to it. The general building blocks of information in computer science is 8-bit bytes and that is why.
To summarize, we have numbers represented by 8-bit bytes and we have a mapping of those numbers to characters assuming we are using US-ASCII. However forcing all numbers to by 8 bits seems awfully wasteful. Let’s suppose I want to store the sequence of 20 characters:
Well in this case I only need two numbers to do so so I could just store them using a basic character binary encoding of 0=a and 1=b. Therefore rather than using 8 x 20 = 160 bits of storage I could just use 20 bits. To spell it out I could use
Rather than the US-ASCII equivalent:
1000110110001110 1000110110001110 1000110110001110 1000110110001110 1000110110001110
This is huge storage saving in percentage terms. It’s important to point out that an encoding maps one set of symbols to another, in the case of ASCII it is 8-bit bytes to common western characters, for Huffman it is variable length binary sequences to 8-bit bytes. Therefore what’s actually happening here is:
a (ASCII) -> 10001101 (8-bit byte) -> 0 (Huffman)
Now what if I have 3 characters in my sequence:
This presents a problem for the encoding technique above, because I have no obvious way to encode the ‘c’. If I use the next number in the binary sequence namely 10 for c. I have a problem that a sequence such as cba:
has ambigious meaning, it could be abc, cc, bac or cba, not much use. However, and here’s the key insight, suppose I only use zero/one sequences that don’t appears as the beginning of any other sequence to represent 8-bit numbers. Therefore I pick 0=a and because I can’t use a number starting with 0 after picking 0=a I must use something like 10=b leaving 11=c. I can therefore unambigiously encode cba as:
There is nothing else it can be but cba. I don’t need to know the bit length of each number to decode. To work the example the first digit, 1, this isn’t a character, nothing is encoded as 1 so I add the next bit to it and get 11, well that’s c there is no other number starting with 11 so I can unambigiously decode it. Similarly with 10 and 0.
The aim of Huffman Coding is to create a shorter encoding that the original fixed width encoding. Indeed it is a little more than that, it is to get the shortest possible encoding unambigious encoding. How do we find this out? Well first a few things about Huffman Coding. The genius of the algorithm is that it is simple and will always find the optimum (shortest possible) encoding and this encoding will always be less than or equal to the length of the equivalent 8-bit encoding.
Firstly I describe the algorithm then I’ll do any example. As we know US-ASCII is simply a sequence of 8-bit bytes as is every other file on your computer (on the vast majority of modern computers). Therefore we have a sequence of bytes.
Huffman encodings use trees, Huffman trees, to describe their encoding. A Huffman tree is a binary tree, in that each branch gives way to 2 or fewer branches.
So the algorithm:
- Count the number of occurences of each byte in the sequence and put them in a list
- Sort that list in ascending order of freqency
- Pick the two lowest frequency bytes off the top of the table and add them to the tree as two branches on the trunk
- Add the frequencies of those two nodes together and add that part of the tree back to the list and sort the list again in ascending order of frequencies
- If there is more than one item left in the list then go to step 3, otherwise you are done, the last item in the list is the completed tree
In this example we are going to encode the string: “Mississippi hippies”.
Here’s the frequency table:
[_,1] [M,1] [e,1] [h,1] [p,4] [s,5] [i,6]
Note I’ve substituded [space] for the underscore character “_“.
So we take the two smallest values off the top and create our first part of the tree, hopefully the notation is self explanitory the tilda (~) means that it is just a node and the letter before is simply and identifier, the leafs have the character they represent and the frequencies:
z[~,2] / \ / \ [_,1] [M,1]
We add this back to the list
[e,1] [h,1] z[~,2] [p,4] [s,5] [i,6]
Then the next two lowest frequency items
y[~,2] / \ / \ [e,1] [h,1]
Adding it back in gives
y[~,2] z[~,2] [p,4] [s,5] [i,6]
Then the next two (which are both sub-trees) gives
x[~,4] / \ --- --- / \ y[~,2] z[~,2] / \ / \ / \ / \ [e,1] [h,1] [_,1] [M,1]
Adding it back in gives
x[~,4] [p,4] [s,5] [i,6]
w[~,8] / \ / \ x[~,4][p,4] / \ --- --- / \ y[~,2] z[~,2] / \ / \ / \ / \ [e,1] [h,1] [_,1] [M,1]
Back to the list:
[s,5] [i,6] w[~,8]
v[~,11] / \ / \ [s,5] [i,6]
And the final tree:
u[~,19] / \ --- --- / \ w[~,8] v[~,11] / \ / \ / \ / \ x[~,4][p,4] [s,5] [i,6] / \ --- --- / \ y[~,2] z[~,2] / \ / \ / \ / \ [e,1] [h,1] [_,1] [M,1]
There we have it, our final Huffman tree. How do we use it? Well in order to find the encoding for each letter we travel down the tree until we get to it. When we go left we add a 0 when we go right we add 1. For example to encode ‘e’ which is a rare character in our input string we get the following:
Or for ‘i’ which is very common in the input sequence we get
a much sorter encoding. From this tree we can build an encoding table as follows:
_ 0010 M 0011 e 0000 h 0001 p 01 s 10 i 11
Thus we can encode the orginal string “Mississippi hippies” into:
00111110 10111010 11010111 00100001 11010111 000010
Which is 46 bits rather than the (19 x 8) 152 of the orginal. We’ve compressed 19 bytes into 5 bytes (last byte is zero padded).
A couple of notes:
- For input with almost every byte possibility and roughly equal frequency of those bytes compression will be very limited — large movie files for instance.
- For some encodings certain bytes will encode to longer than 8-bit codes, but the shorter encodings will at least offset those
- Huffman tree can be stored very effiently using a fix length format, pick the left then the right, move down to the left and repeat until the tree is traverse. When you hit nodes in the tree that are leafs output 1 followed by the value (not the frequency), when they are nodes that are simple parents of others output 0 followed by nothing.
This was a pretty fiddly blog post, please let me know if I’ve made mistakes in the comments or email.
That’s it, I think.