CIS 3223 Lab12 – Implementing Huffman Encoding

DUE: (before the next lab, 10:30am Friday, April 25, 2008)


 

Huffman Encoding

 

Huffman encoding is a greedy algorithm that finds optimal binary coding of symbols from a sequence such that the encoded sequence has the minimal bit-length. The pseudocode of the algorithm is described in Section 5.2 of the Dasgupta textbook.

 

 

TASKS

1. (60%) Your goal is to implement in Matlab the HUFFMAN algorithm that, given a vector of frequencies f of symbols determines their optimal binary representation. The length L of vector f (L=length(f)) is the size of the alphabet, and element f(i) is the number of times the i-th symbol appears in a sequence. The function definition is:

 

function code = huffman(f)

 

You should assume that code is a cell array with L elements, code{i} is the binary representation of the i-th symbol as a char array. For example, if the i-th symbol is represented by binary word 0010, code{i} = ‘0010’.

Test your code on an 4-symbol example from the textbook. Create few more examples with larger number of symbols (let us say, of length 6 and 10)  for testing. Determine Huffman encodings by hand. Submit your examples with the Huffman encoding obtained by hand and the solution given by your HUFFMAN function.

 

2. (40%) Download lab12_sequence.mat that contains char array seq with 756 symbols from alphabet {A, B, C, D}. Calculate frequency of each symbol and apply HUFFMAN to obtain their optimal binary representation. What is the total bit-length of seq_binary when symbols are converted to Huffman bitwords. How does it compare with the original size of seq, assuming the ASCII representation of symbols with 8 bits? How does it compare with the simplistic representation that encodes each symbol with 2 bits?

Let us go a step further – represent each pair of symbols (such as AA, or CD) as a mega-symbol. Therefore, there are 16 mega-symbol in the alphabet, and the sequence seq is represented as 756/2 mega symbols. Calculate frequencies of the mega-symbols and run the HUFFMAN to determine their binary encoding. What is the bitlength of seq when encoded in this way? How does it compare with the previous solutions?