CIS 3223 Lab12 – Implementing Huffman Encoding
DUE: (before the next lab, 10:30am Friday, April 25, 2008)
Huffman Encoding
Huffman encoding is a greedy
algorithm that finds optimal binary coding of symbols from a sequence such that
the encoded sequence has the minimal bit-length. The pseudocode
of the algorithm is described in Section 5.2 of the Dasgupta
textbook.
TASKS
1. (60%) Your goal is to implement in Matlab
the HUFFMAN algorithm that, given a vector of frequencies f of symbols determines their optimal binary
representation. The length L of vector f (L=length(f)) is the size of the alphabet, and element f(i) is the number of times the i-th
symbol appears in a sequence. The function definition is:
function code = huffman(f)
You should assume that code is a cell array with L
elements, code{i} is the
binary representation of the i-th symbol as a char
array. For example, if the i-th symbol is represented
by binary word 0010, code{i} = ‘0010’.
Test your
code on an 4-symbol example from the textbook. Create
few more examples with larger number of symbols (let us say, of length 6 and 10) for testing. Determine
Huffman encodings by hand. Submit
your examples with the Huffman encoding obtained by hand and the solution given
by your HUFFMAN function.
2. (40%)
Download lab12_sequence.mat
that contains char array seq with 756 symbols from
alphabet {A, B, C, D}. Calculate
frequency of each symbol and apply HUFFMAN to obtain their optimal binary
representation. What is the total bit-length of seq_binary when symbols are converted
to Huffman bitwords. How does it compare with the original size of seq, assuming the ASCII representation of symbols with
8 bits? How does it compare with the simplistic representation that encodes
each symbol with 2 bits?
Let us go a step further –
represent each pair of symbols (such as AA, or CD) as a mega-symbol. Therefore,
there are 16 mega-symbol in the alphabet, and the
sequence seq is represented
as 756/2 mega symbols. Calculate
frequencies of the mega-symbols and run the HUFFMAN to determine their binary
encoding. What is the bitlength of seq when encoded in this way? How does it compare with
the previous solutions?