Information Theory

exercise in the previous class
Write a computer program to construct a Huffman code...
typical application of the data structure “priority queue”
a collection of data with each associated with “priority”
enqueue: add a data in the collection
dequeue: retrieve a data with the largest priority
0.3
dequeue
0.1
enqueue
0.2
0.3
0.4
tree = data, smaller probability = higher priority
dequeue two trees, join them, and enqueue the result
at the final satge, give labels to the unified single tree
1
review of the previous class
Shannon’s source coding theorem
the average codeword length is H(S) or more
the length can be reduced to H(S) + ε for any ε
“block” version of Huffman codes
encode by block, and we have more compact coding
there are various approaches to define block patterns
there is an additional remark concerning the block coding
2
handling the “tail” of the data
block coding cannot be applied if...
the message is too short
we have leftover symbols
the last run is not terminated
no clever solution, get around problems by operational devises
use different coding for the final (leftover) block
use padding, etc.
3
today’s class
some more source coding scheme
coding for “simple” realization
arithmetic codes
coding for the unknown information sources
Lempel-Ziv algorithms
coding which sacrifices the uniquely decodable property
non-reversible, or lossy, coding
4
arithmetic code
part I
coding for “simple” realization
5
realization issue of coding
coding ≈ translation between symbols and codewords
We need a large translation table to obtain good performance.
static (静的な) approach...
construct the table beforehand, and keep it in the system
the translation is made by a simple table-lookup
large memory is needed  not good for realization
dynamic (動的な) approach...
the table is not constructed explicitly (明示的に)
the table-lookup is replaced by on-the-fly computation
6
arithmetic code
arithmetic code...well-known example of the dynamic approach
the idea is sketched by an example:
p(A) = p, p(B) = 1 – p, the block size is n
in numerical examples, we use p = 0.7 and n = 3
i
0
1
2
3
4
5
6
7
wi
AAA
AAB
ABA
ABB
BAA
BAB
BBA
BBB
P(wi)
0.343
0.147
0.147
0.063
0.147
0.063
0.063
0.027
S(wi)
0
0.343
0.490
0.637
0.700
0.847
0.910
0.973
We have 23 = 8 block patterns.
w0, ..., w7, in the dictionary order
𝑃(𝑤𝑖 ) : probability that wi occurs
𝑆 𝑤𝑖 = 𝑖−1
𝑗=0 𝑃 𝑤𝑗
= 𝑆 𝑤𝑖−1 + 𝑃 𝑤𝑖−1
(sum of probabilities before wi)
7
illustration of probabilities
the eight patterns partition the interval [0, 1];
0
0.5
AAA
AAB
ABA
ABB
BAA
1.0
BAB BBA BBB
0.343
0.147
0.147
0.063
0.147
0.0630.063
S(ABB)
i
0
1
2
3
4
5
6
7
wi
AAA
AAB
ABA
ABB
BAA
BAB
BBA
BBB
P(wi)
0.343
0.147
0.147
0.063
0.147
0.063
0.063
0.027
S(wi)
0
0.343
0.490
0.637
0.700
0.847
0.910
0.973
0.027
S(BAA)
= S(ABB)+P(ABB)
wi occupies the interval
𝐼𝑖 = 𝑆 𝑤𝑖 , 𝑆 𝑤𝑖 + 𝑃(𝑤𝑖 )
basic idea:
encode wi as a certain value 𝑥 ∈ 𝐼𝑖
problem to solve:
need a translation between wi and 𝑥
8
about the translation
P(w)
S(w)
two directions of the translation:
[encode] the translation from wi to 𝑥
P(wA) P(wB)
S(wA) S(wB)
[decode] the translation from 𝑥 to wi
...use recursive computation instead of a static table

A
B
AA
AAA
AB
AAB
0.343
ABA
BA
ABB
0.147
BAA
0.147
AB
BAB
0.063
BAB
0.147
BBB
0.0630.063 0.027
9
[encode] the translation from wi to 𝑥
recursively determine P( ) and S( ) for prefixes of wi
P(ε) = 1, A(ε) = 1
for a pattern wA, P(wA) = P(w)p, S(wA) = S(w)
for a pattern wB, P(wB) = P(w)(1 – p), S(wB) = S(w) + P(w)p
P(w)
the interval of “ABB” ?
S(w)
P() = 1
S() = 0 
P(A)=0.7 A
S(A) = 0
AB P(AB)=0.21
S(AB) = 0.49
AA
ABA
p = P(A) = 0.7
B
P(wA) P(wB)
S(wA) S(wB)
ABB P(ABB)=0.063
S(ABB) = 0.637
the interval of “ABB” is
[0.637, 0.637 + 0.063)
10
[encode] the translation from wi to 𝑥 (cnt’d)
Given a pattern wi, now we know the interval 𝐼𝑖 ,
but which 𝑥 ∈ 𝐼𝑖 should be chosen?
we want 𝑥 ∈ 𝐼𝑖 whose binary representation is the shortest.
choose 𝑥 as S(wi) + P(wi) but trimming at − log 2 𝑃(𝑤𝑖 ) places
(...以下を切り捨て)
S(wj)
+ P(wj)
S(wj+1)
0.aa...aaa...a
+0.00...0bb...b
0.aa...acc...c
0.aa...ac0...0
x
0.aa...ac
0.aa...aaa...a
the length of x ≈ – log2P(wi)
place (桁数) of the most
= – log2P(wj)
significant non-zero of P(wj)
0.aa...acc...c
11
[decode] the translation from 𝑥 to wi
goal: given x, determine the leaf node whose interval contains x
almost the same as the first half of the encoding translation
compute, compare, and move to the left or right
P(w)
x = 0.600
A(w)
P() = 1
S() = 0 
P(A)=0.7 A
S(A) = 0
AA
P(ABA)=0.147
S(ABA) = 0.49
B S(B) = 0.7
AB
ABA
ABB
P(AB)=0.21
S(AB) = 0.49
P(wA) P(wB)
A(wA) A(wB)
threshold
S(ABB) = 0.637
0.600 is contained in the interval of ABA...decoding completed
12
performance, summary
an n-symbol pattern w with probability P(w)
 encoded to a sequence with length − log 2 𝑃(𝑤𝑖 )
the average codeword length per symbol is
1
1
𝑃(𝑤) − log 2 𝑃(𝑤𝑖 ) ≈
−𝑃 𝑤 log 2 𝑃 𝑤𝑖 = 𝐻(𝑆)
𝑛
𝑛
𝑛
𝑛
𝑤∈𝑉
𝑤∈𝑉
almost optimum coding without using a translation table
however...
we need much computation with good precision
( use approximation?)
13
Lempel-Ziv algorithms
part II
coding for the unknown information sources
14
probability in advance?
so far, we assumed that the probabilities of symbols are known...
in the real world...
the symbol probabilities are often not known in advance
scan the data twice?
first scan...count the number of symbol occurrences (出現)
second scan...Huffman coding
delay of the encoding operation?
overhead to transmit the translation table?
15
Lempel-Ziv algorithms
for information sources whose symbol probability is not known...
LZ77
lha, gzip, zip, zoo, etc.
LZ78
compress, arc, stuffit, etc.
LZW
GIF, TIFF, etc.
work fine for any information sources  universal coding
(ユニバーサル符号化)
16
LZ77
proposed by A. Lempel and J. Ziv in 1977
represent a data substring by using
a substring which has been occurred previously
L
Z
algorithm overview
process the data from the beginning
partition the data to blocks in a dynamic manner
represent a block by a three-tuple (i, l, x)
“rewind i symbols, copy l symbols, and append x”
x
–i
–i+l
encoding completed
–1 0
l–1 l
17
encoding example of LZ77
consider to encode ABCBCDBDCBCD
symbol
A
B
C
B
C
D
B
D
C
B
C
D
history
first time
first time
first time
= (here) – 2
= (here) – 2
≠ (here) – 2
= (here) – 3
≠ (here) – 3
= (here) – 6
= (here) – 6
= (here) – 6
= (here) – 6
codeword
(0, 0, A)
(0, 0, B)
(0, 0, C)
(2, 2, D)
(3, 1, D)
(6, 4, *)
18
decoding example of LZ77
decode (0, 0, A), (0, 0, B), (0, 0, C), (2, 2, D), (3, 1, D), (6, 4, *)
possible problem:
large block is good, because we can copy more symbols
large block is bad, because a codeword contains a large integer
... the dilemma degrades the performance.
19
LZ78
proposed by A. Lempel and J. Ziv in 1978
represent a block by a thw-tuple (b, x)
“copy the b-th block before, and append x”
x
–b
–1
0
encoding completed
20
encoding example of LZ78
consider to encode ABCBCBCDBCDE
symbol
A
B
C
B
C
B
C
D
B
C
D
E
history
first time
first time
first time
= (here) – 2 block
codeword
(0, A)
(0, B)
(0, C)
block #
1
2
3
(2, C)
4
(1, D)
5
(1, E)
6
= (here) – 1 block
= (here) – 1 block
21
decoding example of LZ78
decode (0, A), (0, B), (0, C), (2, C), (1, D), (1, E)
advantage against LZ77:
large block is good, because we can copy more symbols
is there anything wrong with large blocks?
 the performance slightly better than LZ78
22
summary of LZ algorithms
in LZ algorithms, the translation table is constructed adoptively
information sources with unknown symbol probabilities
information sources with memory
LZW: good material to learn intellectual property (知的財産)
UNISYS, CompuServe, GIF format, ...
23
non-reversible, or lossy, coding
part III
coding which sacrifices the uniquely decodable property
24
uniquely decodable? really needed?
We have considered codes which are uniquely decodable.
“encode, decode, and we have the original data”
reversible code, lossless code (可逆,無歪み符号)
Sometimes, the uniquely decodable property is too much.
image, sound, etc...OK if similar to the original
non-reversible code, lossy code (非可逆,有歪み符号)
more compactness
individual coding schemes are not given in this lecture...
25
Shannon’s theorem, reconsidered
1.
2.
Shannon’s source coding theorem:
for any code, the average codeword length ≥ H (X)
a code with average codeword length < H (X) + ε is constructible
X?
X
encoder
Y
encoder = communication channel
input X = symbol(s), output Y = codeword(s)
reversible: the knowledge of Y uniquely determines X
entropies: H(X) → 0 ... mutual information I(X; Y) = H(X)
“A codeword is a container of the information with size I(X; Y).”
... A codeword cannot be smaller than H(X).
26
non-reversible case
non-reversible code:
the knowledge of Y does not eliminate all ambiguity on X
“AFTER entropy” > 0, mutual information I(X; Y) < H(X).
X?
X
encoder
Y
“A codeword is a container of the information with size I(X; Y).”
it can happen that the average codeword length < H (X)
“a code beyond Shannon’s theorem” may exist.
 the theory is more complicated, cf. rate-distortion theorem
27
summary of today’s talk
source coding other than Huffman codes
coding for “simple” realization
arithmetic codes
coding for the unknown information sources
Lempel-Ziv algorithms
coding which sacrifices the uniquely decodable property
non-reversible, or lossy, coding
28
exercise
Encode ABACABADACABAD by the LZ77 algorithm, and decode
its result.
Survey what has happened concerning the patent of LZW.
29