ppt - Philippe Fournier

Novel Concise Representations of High Utility
Itemsets using Generator Patterns
Philippe Fournier-Viger1
Cheng Wei Wu2
Vincent S. Tseng2
1University
2National
of Moncton, Canada
Cheng Kung University
1
Frequent Itemset Mining
Discovering sets of items frequently occurring in
transactions.
A transaction database
Frequent itemsets
Transaction items
T1
T2
{a, b, c, d, e}
{a, b, e}
T3
T4
{c, d, e}
{a, b, d, e}
minsup = 50 %
Itemset
Support
{e}
100 %
{d, e}
75 %
{b, d, e}
50 %
…
…
Limitations:
• an item can only appear once in a transaction!
• all items have the same unit profit/weight
Thus, will ignore rare itemset having high profit ! (e.g. caviar, wine)
2
High Utility Itemset Mining
• A generalization of FIM:
– items can appear more than once in a transaction
– items have a unit profit (weight)
• Several applications:
– click-stream analysis,
– cross-marketing in retail stores,
– bio-medical applications…
3
High Utility Itemset Mining
Input: transaction database with quantities
TID
T0
T1
T2
T3
T5
items
a(1), b(5), c(1), d(3), (e,1)
b(4), c(3), d(3), e(1)
a(1), c(1), d(1)
a(2), c(6), e(2)
b(2), c(2), e(1)
a threshold minutil
unit profit table
item unit profit
a
b
c
d
e
5$
2$
1$
2$
3$
Output: high-utility itemsets (HUIs),
i.e. itemsets having a utility ≥ minutil
4
A full example
Transation database
Unit profit table
TID
items
item
unit profit
T0
T1
T2
a(1), b(5), c(1), d(3), (e,1)
a
5$
b(4), c(3), d(3), e(1)
b
2$
a(1), c(1), d(1)
c
1$
T3
T5
a(2), c(6), e(2)
d
2$
e
3$
b(2), c(2), e(1)
High utility itemsets
Suppose that
minutil = 25 $
{a,c} : 28$
{a,b,c,d,e}: 25 $
{b,c,d}: 34 $
{b,c,e} : 37 $
{b,d,e} : 36 $
{c, e}: 27$
{a,c,e}: 31 $
{b,c} : 28 $
{b,c,d,e}: 40 $
{b,d} : 30 $
{b,e} : 31 $
5
How to calculate utility?
TID
T0
T1
T2
T3
T5
items
item unit profit
a(1), b(5), c(1), d(3), (e,1)
a
b
c
d
e
b(4), c(3), d(3), e(1)
a(1), c(1), d(1)
a(2), c(6), e(2)
5$
2$
1$
2$
3$
b(2), c(2), e(1)
The utility of an itemset is the sum of the
utility of items (profit x quantity) in
transactions where the itemset appears.
u({a,e} = (1×5$+ 1×3$) + (2×5$ +2×3$) = 24 $
A difficult task!
• In high-utility-itemset mining: utility is not antimonotonic.
• Example:
u({a}) = 20 $
u({a,e}) = 24 $
u({a,b,c}) = 16 $
• In frequent itemset mining: the anti-monotonicity
of the support is used to prune the search space.
• Therefore, algorithms for FIM cannot be directly
applied to HUIM.
How to solve this problem?
• Several algorithms:
– Two-Phase (2005),
– IHUP (2010),
– UP-Growth (2011),
– HUI-Miner (2012),
– FHM (2014)
• Key idea: use an anti-monotonic
measure that is an upper-bound on the
utility of itemsets (e.g. TWU measure)
8
Problem
• Current algorithms generates a huge amount of
HUIs
– not convenient for the user
– algorithm may run out of storage space
– algorithm may fail to terminate
• A solution: mining concise representations of HUIs:
– closed HUI: a high utility itemset which is not included
in another itemset having the same support
– maximal HUI: a high utility itemset which is not
included in another HUIs.
9
Problem (cont’d)
• No work combines « generator patterns » and HUI
mining
• In FIM, a generator is an itemset such that no
proper subset has the same support.
• in FIM, generators provide several benefits:
– can be combined with closed patterns to generate non
redundant association rules,
– may provide higher classification accuracy,
– preferable according to MDL principle,
– may be more efficient than discovering all patterns
 How can we combine generators and the concept of HUIs?
10
Hasse diagram of equivalence classes
Observation 1:
• Some equivalence classes do not contain HUIs (not show)
• We should not mine generators in these equivalence classes
11
Hasse diagram of equivalence classes
Observation 2:
• We name GHUIs, generators in equivalence classes containing at least 1 HUI
• GHUIs may or may not be HUIs.
12
LUGs: those that are not HUI
HUGs : those that are HUI
Some interesting properties
Property 5: In each equiv. class, if an itemset Y is a proper superset of an
itemset X, then u(Y) > u(X)
Property 6: In each equivalence class, the closure has the highest utility
Property 7: In each equivalence class, supersets of a HUGs are HUIs
13
Our goals
1) Design an algorithm to mine only HUGs
2) Design an algorithms to mine all GHUIs (a much harder problem)
3) Evaluate characteristics of these representations
14
FHM (2014)
FHM is to our knowledge the fastest algorithm for HUI mining
– Create a vertical structure named Utility-List for each item.
– Find larger itemsets with depth-first search by appending
items one at a time.
∅
{a}.
{a, b}.
{b}.
{a, c}.
{a, d}.
{b, c}.
{a, c, d}.
{a, b, c}.
{a, b, d}.
{a, b, c, d}.
….
….
….
…
…
FHM (2014)
– Create a vertical structure named Utility-List for each item.
Utility list of {a}
Utility list of {e}
Utility list of {a, e}
TID
util
rutil
TID
util
rutil
TID
util
rutil
T1
5
3
T2
6
5
T2
16
5
T2
10
17
T3
3
5
T3
8
5
T3
5
25
T4
3
0
u({a}) = 20
join
u({e}) = 12
u({a,e}) = 24
16
– The exact utility of an itemset is obtained by joining utilitylists of smaller itemsets (no need to scan database).
– Pruning using remaining utility in utility lists
HUG-Miner
• Utilize the FHM search procedure to explore
the search space of HUIs.
– Property 4: all subsets of a generator must be
generators [11]
• How to check if a HUI is a generator?
– Integrate the DEFME generator checking
mechanism (PAKDD 2014)
– Advantages: can determine if any itemset is a
generator on-the-fly without maintaining
candidates or comparing with other itemsets
17
GHUI-Miner (1)
• Finding GHUIs is much more difficult because we
cannot just explore the search space of HUIs.
• How to determine if a generator appear in an
equivalence class containing a HUI?
– We just need to check if its closure is a HUI.
– But computing the closure on-the-fly by intersection is
too expensive
– Actually, state-of-the-art algorithms for mining both
generators and closed itemsets mines them separately
and link them by post-processing (Szathmary, 2014).
Not an option for HUI mining!
18
GHUI-Miner (2)
Main idea
• First, apply a closed HUI mining algorithm (e.g.
CHUD) and keep closed HUIs in a hash table.
• Second, mine GHUIs using a modified FHM algorithm
– only explore itemsets that are generators (by using
DEFME; Property 4)
– only explore itemsets that are subsets of a closed HUI
– for each generator,
• if it is a HUI, then output as HUG
• if it is not a HUI, then retrieve the closure to
determine if it is a LUG
19
GHUI-Miner (3)
There are also several optimizations.
See the paper for more details!
20
Experimental Evaluation
Four datasets
Dataset
transaction
count
Mushroom 88,162
Kosarak
990,000
Retail
88,162
Foodmart 4,141
distinct item
count
16,470
41,270
16,470
1,559
avg. transaction
length
23
8.09
10.30
4.4
• Foodmart: real utility values
• Other datasets: Unit profit between 1 and 1000 and
quantities between 1 and 5 (normal distribution)
• GHUI-Miner vs HUG-Miner vs CHUD vs FHM
• Java, Windows 7, 5 GB of RAM
21
Execution time
Mushroom
Retail
• Mining HUGs is faster than mining all GHUIs
because LUGs may have very low support,
• Mining closed HUIs is the main cost for mining
GHUIs  in future work, we could replace CHUD
with a faster algorithm
22
Execution time (cont’d)
Kosarak
Foodmart
• Mining HUGs is up to 100 times faster than FHM and CHUD.
• Mining GHUIs is faster than mining all HUIs on Mushroom (46
times faster) and Foodmart.
• mushroom is a dense dataset with few generator
• in general, concise representations work best on dense
datasets
23
Pattern count
• HUGs and GHUIs are compact (up to 46 times
smaller than the set of all HUIs)
• HUG-Miner vs GHUI-Miner:
• HUG-Miner misses some generators but it is
faster
• So there is a trade-off.
24
Conclusion
• We formalized the properties of generators and
equivalence classes in HUI mining.
• We proposed two new concice representations of HUIs:
• GHUIs : Generators of High Utility Itemsets
• HUGs: High-Utility Generators
• We presented two algorithms: GHUI-Miner and HUG-Miner
• Results
 up to 36 times more compact than all HUIs
 can be up to 100 times faster than mining all HUIs
• Perspective: generator and closure found by GHUI-Miner
could be used to generate association rules
Open source Java data mining software, > 80 algorithms
http://www.phillippe-fournier-viger.com/spmf/
25
Thank you. Questions?
Open source Java data mining software, >80 algorithms
http://www.phillippe-fournier-viger.com/spmf/
26
This work has been funded by an NSERC grant from the government of Canada.