Finding Food Entity Relationship using User-generated Data in Recipe Service Youngjoo Chung Mar 2015 Rakuten, Inc. Motivation § User-generated data contains spelling vibrations, misspelling words, abbreviations, synonyms, and hyponyms, hypernyms § Identifying and normalizing these words is essential for search service to retrieve relevant recipes to user’s query “Knowing that an aubergine is the same as an eggplant, or that drumsticks equal chicken legs, simply gives better results.” Andrea Cutrigh (Foodily), NYTimes May, 2011 2 Goal § Identify words in the synonymy and hyponymy relations in a recipe service – Expand the thesaurus for recipe search using synonyms • When users search サーモン(salmon), we would show the result of サケ (Japanese salmon). – Show the additional result for specific ingredients using hyponyms • When users search 海老 (shrimp), we would show the result of ブラッ クタイガー(tiger prawn) as well. – Suggest related category based on ingredients during search • When users search ブラックタイガー(tiger prawn), we would show the 海老(shrimp)category as the related category. 3 Contributions § Find hyponyms which cannot be found by text similarity • • • • • 鰯 - オイルサーディン 海老 - ブラックタイガー えんどう豆 - 絹さや きのこ - ブナピー ジャガイモ - メークイン § Find synonyms without using text similarity • サーモンーサケ § Find booming ingredient in a specific category • その他の芋−菊芋 • その他の魚−めかじき、ごまめ 4 Related Work § Hyponym relation extraction • Most work has detected this relationship by exploiting existing food knowledge bases such as Wikipedia [Sumida et. al, LRECL2008] • Extract synonym and hyponym relation using the hierarchical layout of Wikipedia and text patterns ➜ Cannot reflect booming ingredients and ingredients that are referred to using proper names such as brand names (e.g. メークイン (mayqueen) and loan words § Synonym detection • Most previous research has focused on text similarity [Islam et. al EMNLP2009] • Find misspelled words by similarity distance or distributional similarity ➜ Cannot find the all synonyms such as loan word (e.g. “サーモン” and ”サ ケ”) 5 Approach § Calculate the relevance between ingredients and category names based on the position in ingredient lists – Main ingredients are likely written in the first position – Main ingredients are strongly related to the category names アスパラカテゴリ http://recipe.rakuten.co.jp/recipe/1740005182 6 Position of Main Ingredient Number of Recipe 200 180 160 140 120 100 80 60 40 20 0 1st 2nd 3rd other Position of the main ingredient Among randomly selected 200 recipes, about 90% were listed their main ingredient in the first posi<on 7 Data § Data collecting Period § June 2011~Apri 2012 § Number of recipe § 405,519 § Each recipe belongs to only one category Both 17,276 4% Event 5,878 2% Condiment Others 5,603 4,986 1% 1% § Number of categories § Layer 1: 9 § Layer 2: 62 § Layer 3: 790 Meal and Dish 153,638 38% Ingredient 218,138 54% § Type of level 3 category § § § § § Ingredient (材料名): asparagus, beef Meal and dishes (料理名): hanan chi fan Both (材料・料理名):pasta, curry Event (イベント): Christmas, Valentine’s day, Others (その他): picnic, bento 8 Scoring Strategy § Frequency • Category-relevant ingredients frequently appear in the category → Count the occurrence of ingredients in the category § TF-IDF (based on category) • Category-relevant ingredients frequently appear in the given category and do not appear other categories → Compute the ingredient frequency and inverse category frequency § First Ingredient • Category-relevant ingredients appear in the first ingredient → Compute the occurrences as the first ingredient in a given category and normalize with the all occurrence in the category 9 Scoring Strategy A Ingredient i in Category c B C Occr. as 1st ingredient Occr. as 2nd ingredient Occr. as ..th ingredient Total Occr of i in c # of Categories including ingredient i 小エビ 10 9 4 50 2 片栗粉 0 3 2 80 5 ブラックタイガー 3 0 0 3 0 § Create the ingredient-position matrix for each category c • Each row represents ingredient in the category c • Each column represents the position of ingredients in ingredient lists of the category c § Calculate related score of ingredient i and category c • Freq(i, c) = A • TF-IDF: B * log(Z/C) • First Ingredient : A/B 10 Experiments § Data – Category: 227 Ingredient-type categories – Ingredients: 6,998 ingredients • Each ingredients appeared at least 11 times in each category – Ingredient-category pair: 1,101 ingredient-category pairs were selected as the positive pair among 12,769 pairs § Evaluation – Compare coverage and precision with morpheme analysis and Wikipedia-based approach – Compare F-scores by changing scoring strategy and threshold 11 Categorization of Related Pairs Category Detail Synonym Spelling variant Hyponym Number of Pair 330 Example エビ-えび Abbreviation 11 新じゃが-新じゃがいも Common name 18 なすび-なす Misspelled 6 アボガド-アボカド Variety 320 ブラックタイガーエビ Processed 214 オイルサーディン-イワシ Part 202 キャベツの葉-キャベツ § About 30% of pairs were in variety relationship • These relation is hard to detect using textual similarity § Part, processed ingredient frequently appear in a recipe domain • These relation should be considered for a search system in recipe domain 12 Experimental Result Evaluation Precision and Recall F-score 1 1 0.9 0.9 0.8 FI_P 0.7 FREQ_P 0.6 TFIDF_P 0.5 0.4 FI_R 0.3 FREQ_R 0.7 0.6 0.5 0.4 0.2 0.1 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 FRE Q TFID F 0.8 0.3 TFIDF_R 0.2 0 FI 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Score Threshold § FI approach achieved the highest F-score (F-score=0.929 when threshold is 0.5) § FI approach achieved the highest recall 13 Comparison with Wikipedia-based Approach § Confirmed that Wikipedia-based approach has a limitation for domain-specific words – Structure-based approach http://alaginrc.nict.go.jp/hyponymy/ index.html えび えび えび えび えび えび えび えび えび えび えび うしえび くまえび くるまえび さるえび しばえび しろえび ふとみぞえび あめりかざりがに うちだざりがに たすまにあおおざりがに たんかいざりがに 14 Analysis Freq Ingreds 塩(Salt) TF-IDF Score Ingreds 1 小海老 (shrimp) 片栗粉 (potato starch) 0.824 エビ (shrimp) 酒 (liquor) 0.698 海老 (shrimp) 水 (water) 0.593 えび (shrimp) マヨネーズ (mayonnaise) 0.489 小海老 (shrimp) 0.452 ケチャップ (ketchup) 片栗粉 (potato starch) ケチャ (ketchup) 0.416 酒 (liquor) FirstIngred Score Ingreds 1 有頭海老(head-on shrimp) 0.512 ブラックタイガー(tiger prawn) 0.5 バナメイエビ (white-leg shrimp) Score 1 0.972 0.944 0.448 小海老 (shrimp) 0.941 0.397 甘海老 (shrimp) 0.909 0.297 甘エビ (shrimp) 0.875 0.28 エビ(shrimp) 0.87 § Freq approach gives high scores to ingredients frequently appeared, mainly condiments such as salt, water obtain high score § TF-IDF approach fails to remove condiments appeared in mainly Shrimp category such as Ketchup and potato starch (for Ebi-chilli) § FI approach gives scores to only ingredients that directly related to category 15 Conclusion and Future Work § Conclusion – Proposed approach could successfully found words in synonym / hyponymy relations in a recipe domain – Related word pairs can be used for category suggestion – Related word pairs can be used as related keywords 16
© Copyright 2024 ExpyDoc