Novel Concise Representations of High Utility Itemsets using Generator Patterns Philippe Fournier-Viger1 Cheng Wei Wu2 Vincent S. Tseng2 1University 2National of Moncton, Canada Cheng Kung University 1 Frequent Itemset Mining Discovering sets of items frequently occurring in transactions. A transaction database Frequent itemsets Transaction items T1 T2 {a, b, c, d, e} {a, b, e} T3 T4 {c, d, e} {a, b, d, e} minsup = 50 % Itemset Support {e} 100 % {d, e} 75 % {b, d, e} 50 % … … Limitations: • an item can only appear once in a transaction! • all items have the same unit profit/weight Thus, will ignore rare itemset having high profit ! (e.g. caviar, wine) 2 High Utility Itemset Mining • A generalization of FIM: – items can appear more than once in a transaction – items have a unit profit (weight) • Several applications: – click-stream analysis, – cross-marketing in retail stores, – bio-medical applications… 3 High Utility Itemset Mining Input: transaction database with quantities TID T0 T1 T2 T3 T5 items a(1), b(5), c(1), d(3), (e,1) b(4), c(3), d(3), e(1) a(1), c(1), d(1) a(2), c(6), e(2) b(2), c(2), e(1) a threshold minutil unit profit table item unit profit a b c d e 5$ 2$ 1$ 2$ 3$ Output: high-utility itemsets (HUIs), i.e. itemsets having a utility ≥ minutil 4 A full example Transation database Unit profit table TID items item unit profit T0 T1 T2 a(1), b(5), c(1), d(3), (e,1) a 5$ b(4), c(3), d(3), e(1) b 2$ a(1), c(1), d(1) c 1$ T3 T5 a(2), c(6), e(2) d 2$ e 3$ b(2), c(2), e(1) High utility itemsets Suppose that minutil = 25 $ {a,c} : 28$ {a,b,c,d,e}: 25 $ {b,c,d}: 34 $ {b,c,e} : 37 $ {b,d,e} : 36 $ {c, e}: 27$ {a,c,e}: 31 $ {b,c} : 28 $ {b,c,d,e}: 40 $ {b,d} : 30 $ {b,e} : 31 $ 5 How to calculate utility? TID T0 T1 T2 T3 T5 items item unit profit a(1), b(5), c(1), d(3), (e,1) a b c d e b(4), c(3), d(3), e(1) a(1), c(1), d(1) a(2), c(6), e(2) 5$ 2$ 1$ 2$ 3$ b(2), c(2), e(1) The utility of an itemset is the sum of the utility of items (profit x quantity) in transactions where the itemset appears. u({a,e} = (1×5$+ 1×3$) + (2×5$ +2×3$) = 24 $ A difficult task! • In high-utility-itemset mining: utility is not antimonotonic. • Example: u({a}) = 20 $ u({a,e}) = 24 $ u({a,b,c}) = 16 $ • In frequent itemset mining: the anti-monotonicity of the support is used to prune the search space. • Therefore, algorithms for FIM cannot be directly applied to HUIM. How to solve this problem? • Several algorithms: – Two-Phase (2005), – IHUP (2010), – UP-Growth (2011), – HUI-Miner (2012), – FHM (2014) • Key idea: use an anti-monotonic measure that is an upper-bound on the utility of itemsets (e.g. TWU measure) 8 Problem • Current algorithms generates a huge amount of HUIs – not convenient for the user – algorithm may run out of storage space – algorithm may fail to terminate • A solution: mining concise representations of HUIs: – closed HUI: a high utility itemset which is not included in another itemset having the same support – maximal HUI: a high utility itemset which is not included in another HUIs. 9 Problem (cont’d) • No work combines « generator patterns » and HUI mining • In FIM, a generator is an itemset such that no proper subset has the same support. • in FIM, generators provide several benefits: – can be combined with closed patterns to generate non redundant association rules, – may provide higher classification accuracy, – preferable according to MDL principle, – may be more efficient than discovering all patterns How can we combine generators and the concept of HUIs? 10 Hasse diagram of equivalence classes Observation 1: • Some equivalence classes do not contain HUIs (not show) • We should not mine generators in these equivalence classes 11 Hasse diagram of equivalence classes Observation 2: • We name GHUIs, generators in equivalence classes containing at least 1 HUI • GHUIs may or may not be HUIs. 12 LUGs: those that are not HUI HUGs : those that are HUI Some interesting properties Property 5: In each equiv. class, if an itemset Y is a proper superset of an itemset X, then u(Y) > u(X) Property 6: In each equivalence class, the closure has the highest utility Property 7: In each equivalence class, supersets of a HUGs are HUIs 13 Our goals 1) Design an algorithm to mine only HUGs 2) Design an algorithms to mine all GHUIs (a much harder problem) 3) Evaluate characteristics of these representations 14 FHM (2014) FHM is to our knowledge the fastest algorithm for HUI mining – Create a vertical structure named Utility-List for each item. – Find larger itemsets with depth-first search by appending items one at a time. ∅ {a}. {a, b}. {b}. {a, c}. {a, d}. {b, c}. {a, c, d}. {a, b, c}. {a, b, d}. {a, b, c, d}. …. …. …. … … FHM (2014) – Create a vertical structure named Utility-List for each item. Utility list of {a} Utility list of {e} Utility list of {a, e} TID util rutil TID util rutil TID util rutil T1 5 3 T2 6 5 T2 16 5 T2 10 17 T3 3 5 T3 8 5 T3 5 25 T4 3 0 u({a}) = 20 join u({e}) = 12 u({a,e}) = 24 16 – The exact utility of an itemset is obtained by joining utilitylists of smaller itemsets (no need to scan database). – Pruning using remaining utility in utility lists HUG-Miner • Utilize the FHM search procedure to explore the search space of HUIs. – Property 4: all subsets of a generator must be generators [11] • How to check if a HUI is a generator? – Integrate the DEFME generator checking mechanism (PAKDD 2014) – Advantages: can determine if any itemset is a generator on-the-fly without maintaining candidates or comparing with other itemsets 17 GHUI-Miner (1) • Finding GHUIs is much more difficult because we cannot just explore the search space of HUIs. • How to determine if a generator appear in an equivalence class containing a HUI? – We just need to check if its closure is a HUI. – But computing the closure on-the-fly by intersection is too expensive – Actually, state-of-the-art algorithms for mining both generators and closed itemsets mines them separately and link them by post-processing (Szathmary, 2014). Not an option for HUI mining! 18 GHUI-Miner (2) Main idea • First, apply a closed HUI mining algorithm (e.g. CHUD) and keep closed HUIs in a hash table. • Second, mine GHUIs using a modified FHM algorithm – only explore itemsets that are generators (by using DEFME; Property 4) – only explore itemsets that are subsets of a closed HUI – for each generator, • if it is a HUI, then output as HUG • if it is not a HUI, then retrieve the closure to determine if it is a LUG 19 GHUI-Miner (3) There are also several optimizations. See the paper for more details! 20 Experimental Evaluation Four datasets Dataset transaction count Mushroom 88,162 Kosarak 990,000 Retail 88,162 Foodmart 4,141 distinct item count 16,470 41,270 16,470 1,559 avg. transaction length 23 8.09 10.30 4.4 • Foodmart: real utility values • Other datasets: Unit profit between 1 and 1000 and quantities between 1 and 5 (normal distribution) • GHUI-Miner vs HUG-Miner vs CHUD vs FHM • Java, Windows 7, 5 GB of RAM 21 Execution time Mushroom Retail • Mining HUGs is faster than mining all GHUIs because LUGs may have very low support, • Mining closed HUIs is the main cost for mining GHUIs in future work, we could replace CHUD with a faster algorithm 22 Execution time (cont’d) Kosarak Foodmart • Mining HUGs is up to 100 times faster than FHM and CHUD. • Mining GHUIs is faster than mining all HUIs on Mushroom (46 times faster) and Foodmart. • mushroom is a dense dataset with few generator • in general, concise representations work best on dense datasets 23 Pattern count • HUGs and GHUIs are compact (up to 46 times smaller than the set of all HUIs) • HUG-Miner vs GHUI-Miner: • HUG-Miner misses some generators but it is faster • So there is a trade-off. 24 Conclusion • We formalized the properties of generators and equivalence classes in HUI mining. • We proposed two new concice representations of HUIs: • GHUIs : Generators of High Utility Itemsets • HUGs: High-Utility Generators • We presented two algorithms: GHUI-Miner and HUG-Miner • Results up to 36 times more compact than all HUIs can be up to 100 times faster than mining all HUIs • Perspective: generator and closure found by GHUI-Miner could be used to generate association rules Open source Java data mining software, > 80 algorithms http://www.phillippe-fournier-viger.com/spmf/ 25 Thank you. Questions? Open source Java data mining software, >80 algorithms http://www.phillippe-fournier-viger.com/spmf/ 26 This work has been funded by an NSERC grant from the government of Canada.
© Copyright 2024 ExpyDoc