Coding Trees in the Deflate format - LMU, Informatik, TCS

Coding Trees in the Deflate format
Christoph-Simon Senjak
Lehr- und Forschungseinheit f¨
ur Theoretische Informatik
Institut f¨
ur Informatik
Ludwig-Maximilians-Universit¨
at M¨
unchen
Oettingenstr.67, 80538 M¨
unchen
Agda Implementor’s Meeting 2014
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
1 / 14
Outline
Deflate is probably the most widespread compression format
We use it as case study for low-level verification
Deflates way to store encoding trees (huffman trees) is a lot
more complicated than expected
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
2 / 14
Deflate - Basics
Specified in RFC 1951. (RFC 1952 specifies the GZIP file
format, and RFC 1950 specifies the ZLIB file format. Both are
often confused with Deflate).
Can make use of Huffman-Codings and (extended) Run-length
encoding.
Does not require any specific compression algorithm, even
though some are recommended.
Widely used: HTTP, ZIP/RAR, PNG, SSH
$ apt−c a c h e r d e p e n d s z l i b 1 g | wc − l
1418
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
3 / 14
Deflate - Overview
An informal illustration of the format:
Deflate
::= (’0’ Block)* ’1’ Block (0|1)*
Block
::= ’00’ UncompressedBlock |
’01’ DynamicallyCompBl |
’10’ StaticallyCompBl
UncompressedBlock ::= length ~length bytes
StaticallyCompBl
::= CompBl(standard coding)
DynamicallyCompBl ::= header coding CompBl(coding)
CompBl(c)
::= [^256]* 256
We are interested in the “coding” part.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
4 / 14
Deflate Codings
1. Must be prefix-free, that is, no code prefixes another code, so
they form a tree.
2. The shorter codes must lexicographically precede longer codes.
(RFC 1951, 3.2.2)
3. All codes of a given bit length have lexicographically consecutive
values, in the same order as the symbols they represent. (RFC
1951, 3.2.2)
4. If there is a lexicographically smaller code c of the same length
for some code D(a) in the coding, then it is prefixed by some
image of the coding D:
∀a,c .c D(a) → len c = len D(a) → ∃b .D(b) = [ ] ∧ D(b) c
Point 4 is not explicitly stated in RFC 1951, but the algorithms and
examples it gives imply it.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
5 / 14
Deflate Trees in Agda
Deflate codings can be viewed as coding trees. A first
formalization in Agda:
data DeflateTree {alpha : N} : (range : Subset alpha)
→ (shortHeight : N) → (shortBranchChar : Fin alpha)
→ (longHeight : N) → (longBranchChar : Fin alpha)
→ (notNonFork : Bool) → Set where
leaf : (b : Fin alpha) → DeflateTree ⁅ b ⁆ 0 b 0 b true
forkEq : {ld rd md : N} → {f g : Subset alpha} →
{lc mc1 mc2 rc : Fin alpha} → {tl : Bool}
→ DeflateTree f ld lc md mc1 true → DeflateTree g md mc2 rd rc tl
→ (disjoint : Empty (f ∩ g))
→ (toN mc1 ) < (toN mc2 )
→ DeflateTree (f ∪ g) ld lc rd rc tl
forkNeq : {ld md1 md2 rd : N} → {f g : Subset alpha} → {lc mc1 mc2 rc : Fin alpha}
→ {tl : Bool} → DeflateTree f ld lc md1 mc1 true
→ DeflateTree g md2 mc2 rd rc tl
→ (disjoint : Empty (f ∩ g))
→ md1 < md2 → DeflateTree (f ∪ g) ld lc rd rc tl
nonFork : {b : Bool} → {ld rd : N} → {f : Subset alpha} → {lc rc : Fin alpha}
→ DeflateTree f ld lc rd rc b → DeflateTree f (suc ld) lc (suc rd) rc false
⇒ Complicated. Equivalence to the standard is not obvious.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
6 / 14
Problems with Agda
A lot of small combinatorical facts have to be proved.
Often it is not easy to modularize the proofs.
⇒ Very large proof terms.
Lack of usable libraries, but writing libraries is not the goal of
this project.
Especially no complete library for rational numbers (or is there
one now?).
Compilation is slow and the large proofs sometimes crashed
Agda.
⇒ Switched to coq ; ;
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
7 / 14
Formalization in Coq
Record deflateCoding : Set :=
m kD ef la t eC od in g {
M : nat ;
C : VecLB M ;
prefix_free : forall a b , a <> b ->
(( Vnth C a ) = nil ) + ~ ( prefix ( Vnth C a ) ( Vnth C b )) ;
length_lex : forall a b , ll ( Vnth C a ) <= ll ( Vnth C b ) ->
lex ( Vnth C a ) ( Vnth C b ) ;
char_enc : forall a b , ll ( Vnth C a ) = ll ( Vnth C b ) ->
( f_le a b ) -> lex ( Vnth C a ) ( Vnth C b ) ;
dense : forall a c , c <> nil -> lex c ( Vnth C a ) ->
ll c = ll ( Vnth C a ) ->
exists b , (~ ( Vnth C b ) = nil ) /\ ( prefix ( Vnth C b ) c )
}.
⇒ much closer to the standard.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
8 / 14
Deflate Coding Sequences
A deflate sequence is a sequence (ai )i of code-lengts or 0 that
satisfies Kraft’s inequality
2−ai ≤ 1.
i,ci =0
Main Theorem: The type of lists that satisfy Kraft’s inequality
is isomorphic to the type of Deflate codings.
⇒ Saving a sequence of lengths is not expensive, and that is the
way it is done in Deflate, and is sufficient, as this theorem shows.
As everything is decidable, we can split the proof into a
“uniqueness” and an “existence” part.
We made a detailled informal proof and now formalize it in Coq.
The “uniqueness” part is finished:
Lemma uniqueness :
forall ( D E : deflateCoding ) ( eq : M D = M E ) ,
( Vmap ( ll ( A := bool )) ( vec_id eq ( C D ))) =
( Vmap ( ll ( A := bool )) ( C E )) ->
coding_eq D E .
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
9 / 14
Outline of the Uniqueness-Proof
We may do a proof by contradiction.
So assume we had two distinct deflate codings D and E . Then
there exist n, m such that D(n) = min {D(x) | D(x) = E (x)}
and E (m) = min {E (x) | D(x) = E (x)}
len D(n) > len E (m) implies D(m) D(n), but D(n) was
chosen minimal. Symmetric for len D(n) < len E (m). So
len D(n) = len E (m). Also, E (m) = [], since otherwise
D(m) = E (m) by the length condition. Same for D(n).
By totality of , we know that D(n) E (m) ∨ E (m) D(n).
Both cases are symmetric. Assume D(n) E (m).
By 4, we know that some b exists, such that E (b) ≺ D(n),
therefore E (b) E (m), and thus either b = m or E (b) = D(b).
b = m would imply D(m) = E (m). E (b) = D(b) would imply
D(b) ≺ D(n) which contradicts 1.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
10 / 14
Uniqueness in Coq
About 400 loc, excluding lemmata about lexicographical orders
and prefixes.
Does not yet exploit the symmetric cases.
Advantage (?) of Coq vs Agda: It is much easier to “hack”
proofs.
Things I missed in Coq:
Putting placeholders into terms to fill them out later. In Coq, a
similar functionality is given by the refine tactic, but it is not
the same.
It is harder to see the algorithms behind proofs in Coq code. It
is possible to view the proof terms, though.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
11 / 14
Existence
Here we need to care more about efficiency.
We use a recursive algorithm with a lot of parameters that
depend on each other. Even harder to modularize.
The Algorithm will be similar to the algorithm in RFC 1951, but
not the same, as this algorithm makes extensive use of stateful
operations.
We basically do a recursion on the list of pairs (char,length) of
the given length-function, sorted firstly according to length and
secondly according to char.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
12 / 14
Problems
The distinction between computational and non-computational
definitions: inl, or introl, left. We chose to not use
non-computational definitions wherever possible, at least for
now, probably sacrificing efficiency.
Substitutions with the rewrite tactic do not go into type
parameters (especially for Fin.t which we often use). It is
possible to circumvent this. However, we chose to make
extensive use of dependent induction, which uses JMeq.
Hidden parameters can make it hard to understand why
substitutions are not possible, but the option Set Printing
All can be confusing.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
13 / 14
Conclusion
The algorithm as such can be implemented very efficiently.
Verification is, however, very complicated.
⇒ Even more useful to have a verified reference implementation.
Christoph-Simon Senjak (LMU M¨
unchen)
Deflate Trees
2014-05-26
14 / 14