Document

LCM ver.2: Efficient Mining Algorithms for
Frequent/Closed/Maximal Itemsets
National Institute of Informatics, JAPAN
Takeaki Uno
Masashi Kiyomi National Institute of Informatics, JAPAN
Hiroki Arimura Hokkaido University, JAPAN
1/Nov/2004 Frequent Itemset Mining Implementations ’04
Summary
Our approach
Typical approach
FI mining
Backtracking with Hypercube
Backdecomposition (few freq. Counting) tracking
CI mining
Backtracking with PPC-extension
Apriori with
pruning
(complete enumeration)
(small memory)
MFI mining Backtracking with pruning
Apriori with
pruning
(small memory)
freq. counting Occurrence deliver
Down
project
(linear time computation)
database
array with Anytime database
Trie (FPmaintenance reduction (simple) (fast
tree)
initialization)
maximality
More database reductions
store all
Frequent Itemset Mining
FI mining
CI mining
MFI mining
Backtracking with Hypercube
decomposition (few freq. Counting)
Backtracking
Backtracking with PPC-extension (complete enumeration)(small
memory)
Backtracking with pruning (small memory)
Apriori with pruning
freq.counting Occurrence deliver
(linear time computation)
array with Anytime database
database
(simple) (fast
maintenance reduction
initialization)
maximality check
More database reductions (small memory)
Apriori with pruning
Down
project
Trie (FP-tree)
store all itemsets
• Almost all computation time is spent for frequency counting
⇒ How to reduce
• #FI to be checked
• cost of frequency counting
Hypercube Decomposition [form Ver.1]
• Reduce #FI to be checked
1. Decompose the set of all FI’s
into hypercubes, each of which is
included in an equivalence class
2. Enumerate maximal and
minimal of each hypercube
(with frequency counting)
3. Generate other FI’s between
maximal and minimal
(without frequency counting)
Efficient when support is small
Occurrence Deliver
[ver1]
• Compute the denotations of P ∪{i} for all i’s at once,
by transposing the trimmed database
• Trimmed database is composed of
- items to be added
- transactions including P
database
Trimmed
database
1
3 4 5
A
B
C
4 5
5
3
3
denotation of 1,2,3
denotation of 1,2,4
denotation of 1,2,5
A B C
B C
A
A B
itemset: 1,2
denotation: A,B,C
2
linear time in the size of
trimmed database
Efficient for sparse datasets
Loss of Occurrence Deliver
[new]
• Avoiding frequency counting of infrequent itemset P∪{e} has
been considered to be important
• However, the computation time for such itemsets is 1/3 of all
computation cost on average, in our experiments
(if we sort items by their frequency (size of tuple list))
P∪
3
4
5
6
7
8
9
θ
AD
ELM
ABCEFGH JKLN
ABDEFGI JKLMSTW
BEGILT
MTW
ABCDFGH IKLMNST
Occurrence deliver has an
advantage of its simple structure
Anytime Database Reduction
[new]
• Database reduction: Reduce the database, by [fp-growth, etc]
◆ Remove item e, if e is included in less than θ transactions
or included in all transactions
◆ merge identical transactions into one
• Anytime database reduction: Recursively apply trimming and
this reduction, in the recursion
 database size becomes small in lower levels of the recursion
In the recursion tree, lower level iterations are exponentially many
rather than upper level iterations.  very efficient
Example of Anytime D. R.
[new]
trim  anytime database reduction  trim  anytime database
reduction….
i
j
Array(reduced) vs. Trie (FP-tree)
[new]
initialization is fast (LCM O(||T||) : Trie O(|T|log|T| + ||T||) )
• Trie can compress the trimmed database [fp-growth, etc]
• By experiments for FIMI instances, we compute the average
compression ratio by Trie for trimmed database over all iterations
• #items(cells) in Tries  1/2 average, 1/6 minimum (dense case)
• If Trie is constructed by a binary tree, it needs at least 3 pointers
for each item.
memory use (computation time)  twice, minimum 2/3
Results
Closed Itemset Mining
FI mining
CI mining
MFI mining
freq.counting
database maintenance
Backtracking with Hypercube decomposition (few freq. Counting)
Backtracking
Backtracking with PPC-extension
(complete enumeration)(small
memory)
Apriori with
pruning
Backtracking with pruning (small memory)
Apriori with pruning
Down project
Trie (FP-tree)
Occurrence deliver (linear time computation)
array with Anytime database reduction (simple) (fast initialization)
Maximality More database reductions
store all
check
itemsets
(small memory)
• avoid (prune) non-closed itemsets?
(existing pruning is not complete)
• quickly operate closure?
• How to
• save memory use?
(existing approach uses much memory)
Prefix Preserving Closure Extension
[ver1]
• Prefix preserving closure extension (PPC-extension) is
a variation of closure extension
Def. closure tail of a closed itemset P
⇔ the minimum j s.t. closure (P ∩ {1,…,j}) = P
H = closure(P∪{i}) (closure extension of P)
is a PPC-extension of P
⇔ i > closure tail and H ∩{1,…,i-1} = P ∩{1,…,i-1}
Def.
“Any” closed itemset H is generated from another “unique” closed
itemset by PPC-extension (i.e., from closure(H ∩{1,…,i-1}) )
 no duplication occurs by depth-first search
Example of ppc-extension
[ver1]
φ
• closure extension  acyclic
• ppc extension  tree
{2}
{7,9}
1,2,5,6,7,9
2,3,4,5
T = 1,2,7,8,9
1,7,9
2,7,9
2
closure extension
ppc extension
{1,7,9}
{2,5}
{1,2,7,9}
{2,7,9}
{2,3,4,5}
{1,2,7,8,9}
{1,2,5,6,7,9}
Results
Maximal Frequent Itemset Mining
FI mining
Backtracking with Hypercube decomposition (few freq. Counting)
Backtracking
CI mining
Backtracking with PPC-extension (complete enumeration)(small memory)
Apriori with pruning
Backtracking with pruning
(small memory)
Apriori with
pruning
Occurrence deliver (linear time computation)
array with Anytime database reduction (simple) (fast initialization)
Down project
Trie (FP-tree)
More database reductions
(small memory)
store all
itemsets
MFI mining
freq.counting
database maintenance
maximality
check
• How to
• avoid (prune) non-maximal imteset?
• check maximality quickly?
• save memory? (existing maximality
check and pruning use much memory)
Backtracking-based Pruning
• During backtracking algorithm for FI,
: current itemset
1 2 3
: a MFI including K
• re-sort items s.t.
items of H locate end
• Then, new MFI NEVER be found in
recursive calls w.r.t. items in H
 omit such recursive calls
4
5
6
[new]
7
8
9
10
re-sort
6
8
10
rec. call
We can avoid so many non-MFI’s
4
5
7
9
no rec. call
Fast Maximality Check (CI,MFI)
[new]
• To reduce the computation cost for maximality check,
closedness check, we use more database reduction
• At anytime database reduction, we keep
◆ the intersection of merged transactions, for closure operation
◆ the sum of merged transactions as a weighted transaction database,
for maximality check
• Closure is the intersection of transactions
• Frequency of one more larger itemsets are
sum of transactions in the trimmed database
 By using these reduced databases, computation time becomes short
(no more than frequency counting)
Results
Experiments
CPU, memory, OS: AMD Athron XP 1600+, 224MB, Linux
Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI
(All these marked high scores at competition FIMI03)
13 datasets of FIMI repository
Result
• Fast at large supports for all instances of FI, CI, MFI
• Fast for all instances for CI (except for Accidents)
• Fast for all sparse datasets of FI, CI, MFI
• Slow only for accidents, T40I10D100K of FI, MFI, and
pumsbstar of MFI
Summary of Results
FI
large
supports
CI
MFI
sparse(7)
LCM
middle(5)
dense(1)
small
supports
FI
CI
MFI
sparse(7)
LCM
LCM
LCM
middle(5)
Both
LCM
Both
dense(1)
Others
LCM
Others
results
Conclusion
• When equivalence classes are large, PPC-extension and
Hypercube decomposition works well
• Anytime database reduction and Occurrence deliver have
advantages on initialization, sparse cases and simplicity compared to
Trie and Down project
• Backtracking-based pruning saves memory usage
• More database reduction works well as much as memory storage
approaches
Future Work
• LCM is weak at MFI mining and dense datasets
• More efficient Pruning for MFI
• Some new data structures for dense cases
• Fast radix sort for anytime database reduction
• IO optimization ?????
List of Datasets
Real datasets
・ BMS-WebVeiw-1
・ BMS-WebVeiw-2
・ BMS-POS
・ Retail
・ Kosarak
・ Accidents
Machine learning benchmark
・ Chess
・ Mushroom
・ Pumsb
・ Pumsb*
・ Connect
Aartificial datasets
・ T10I4D100K
・ T40I10D100K