basic idea behind Huffman coding: to use fewer bits for characters that occur more frequently.
The output from Huffman’s algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol.
Terminology
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called “prefix-free codes”, that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol)
Problem definition
Given
A set of symbols and their weights (usually proportional to probabilities).
Find
A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).
Process
Two type of nodes
Leaf nodes
1
2
3
4
5typedef struct LeafNode{
Type symbol ; // the type of the thing you want to compress
double weight ; // usually is the frequency of apparence
struct LeafNode * parent ;
}LeafNode ;Internal nodes
1
2
3
4
5
6typedef struct InternalNode{
LeafNode * Lchild ;
LeafNode * Rchild ;
double weight ;
struct InternalNode * parent ; //this is optional
}InternalNode;The process begins with the leaf nodes containing the probabilities of the symbol they represent.
Then, the process takes the two nodes with smallest probability, and creates a new internal node having these two nodes as children. The weight of the new node is set to the sum of the weight of the children.
We then apply the process again, on the new internal node and on the remaining nodes, we repeat this process until only one node remains, which is the root of the Huffman tree.
we’re building a single tree from a group (or forest) of trees. Initially, all the trees have a single node containing a character and the character’s weight.
Decompression
Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream (reaching a leaf node necessarily terminates the search for that particular byte value).
C implementation
1 |
|
In case of there are some non-ascii characters in the file to be compressed, In order to do a lossless decompression, I arbitrarily read the file byte by byte, though this may spoil the frequency of appearance.
1
2
3
4
5$ cat test.txt
a
Í
$ hexdump -C < test.txt > testdump && cat testdump
00000000 61 0a c3 8d 0a |a....|1
2'Í'.encode('utf8')
b'\xc3\x8d'1
2
3
4
5
6
7
8
9
10
11
int main(){
FILE * file = fopen("test.txt" , "rt");
FILE * result = fopen("result.txt" , "wt");
char c = fgetc(file);
while(c != EOF){
fputc(c , result);
c = fgetc(file);
}
return 0;
}1
2
3
4
5$ cat result.txt
a
Í
$ diff test.txt result.txt
$char
is a signed byte, this may lead to some troubles, so I convert it tounsigned char
to represent0-255
1
2
3
4
5
6
7
8
9
10
11
12
int main(){
FILE * file = fopen("test.txt" , "rt");
FILE * result = fopen("result.txt" , "wt");
char c = fgetc(file);
while(c != EOF){
fputc(c , result);
printf("signed : %d\tunsigned :%u\n" , c ,(unsigned char)c);
c = fgetc(file);
}
return 0;
}1
2
3
4
5
6$ gcc frw.c -o frw && ./frw
signed : 97 unsigned :97
signed : 10 unsigned :10
signed : -61 unsigned :195
signed : -115 unsigned :141
signed : 10 unsigned :10use bitwise operation to convert between
byte
andbitstring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23void char2str( unsigned char number , char * bitstring){
for (int i = 7 ; i >= 0 ; i--){
bitstring[i] = (char)( (number & 0b00000001) + 48);
number >>= 1 ;
}
}
char dqueue(ByteQ * bq){
char res = 0;
for (int i = 0 ; i < 8 ; i++){
char c = bq->head[(bq->front)++ % bq->maxsize];
bq->size--;
if (c == '1'){
res = (res << 1) + 1 ;
}
else if (c == '0') {
res <<= 1;
}
else{
printf("error happens when dqueue\n");
}
}
return res;
}In case of I need to decompress a very large file, I implement a customized
queue
to mimic astream parser
The program seems to work nicely on my MacOS, but on Windows it will encounter a wired memory access error, during mallocing memory, a pointer will get a very small address, like
0x5d
, which will lead to apermission denied
later, I don’t know why.