Finding Food Entity Relationship using User

Finding Food Entity Relationship
using User-generated Data in
Recipe Service
Youngjoo Chung
Mar 2015
Rakuten, Inc.
Motivation
§  User-generated data contains spelling vibrations, misspelling
words, abbreviations, synonyms, and hyponyms, hypernyms
§  Identifying and normalizing these words is essential for search
service to retrieve relevant recipes to user’s query
“Knowing that an aubergine is the same as an eggplant, or that
drumsticks equal chicken legs, simply gives better results.”
Andrea Cutrigh (Foodily), NYTimes May, 2011
2
Goal
§  Identify words in the synonymy and hyponymy relations in a
recipe service
–  Expand the thesaurus for recipe search using synonyms
•  When users search サーモン(salmon), we would show the result of サケ (Japanese salmon).
–  Show the additional result for specific ingredients using
hyponyms
•  When users search 海老 (shrimp), we would show the result of ブラッ
クタイガー(tiger prawn) as well.
–  Suggest related category based on ingredients during search
•  When users search ブラックタイガー(tiger prawn), we would show the
海老(shrimp)category as the related category.
3
Contributions
§  Find hyponyms which cannot be found by text similarity
• 
• 
• 
• 
• 
鰯 - オイルサーディン
海老 - ブラックタイガー
えんどう豆 - 絹さや
きのこ - ブナピー
ジャガイモ - メークイン
§  Find synonyms without using text similarity
•  サーモンーサケ
§  Find booming ingredient in a specific category
•  その他の芋−菊芋
•  その他の魚−めかじき、ごまめ
4
Related Work
§  Hyponym relation extraction
•  Most work has detected this relationship by exploiting existing food
knowledge bases such as Wikipedia [Sumida et. al, LRECL2008]
•  Extract synonym and hyponym relation using the hierarchical layout of
Wikipedia and text patterns
➜ Cannot reflect booming ingredients and ingredients that are referred to
using proper names such as brand names (e.g. メークイン (mayqueen)
and loan words
§  Synonym detection
•  Most previous research has focused on text similarity [Islam et. al
EMNLP2009]
•  Find misspelled words by similarity distance or distributional similarity
➜ Cannot find the all synonyms such as loan word (e.g. “サーモン” and ”サ
ケ”)
5
Approach
§  Calculate the relevance between ingredients and category
names based on the position in ingredient lists
–  Main ingredients are likely written in the first position
–  Main ingredients are strongly related to the category names
アスパラカテゴリ
http://recipe.rakuten.co.jp/recipe/1740005182
6
Position of Main Ingredient
Number of Recipe
200
180
160
140
120
100
80
60
40
20
0
1st
2nd
3rd
other
Position of the main ingredient
Among randomly selected 200 recipes, about 90% were listed their main ingredient in the first posi<on 7
Data
§  Data collecting Period
§  June 2011~Apri 2012
§  Number of recipe
§  405,519
§  Each recipe belongs to
only one category
Both
17,276
4%
Event
5,878
2%
Condiment Others
5,603
4,986
1%
1%
§  Number of categories
§  Layer 1: 9
§  Layer 2: 62
§  Layer 3: 790
Meal and
Dish
153,638
38%
Ingredient
218,138
54%
§  Type of level 3 category
§ 
§ 
§ 
§ 
§ 
Ingredient (材料名): asparagus, beef
Meal and dishes (料理名): hanan chi fan
Both (材料・料理名):pasta, curry
Event (イベント): Christmas, Valentine’s day,
Others (その他): picnic, bento
8
Scoring Strategy
§  Frequency
•  Category-relevant ingredients frequently appear in the category
→ Count the occurrence of ingredients in the category
§  TF-IDF (based on category)
•  Category-relevant ingredients frequently appear in the given
category and do not appear other categories
→ Compute the ingredient frequency and inverse category
frequency
§  First Ingredient
•  Category-relevant ingredients appear in the first ingredient
→ Compute the occurrences as the first ingredient in a given
category and normalize with the all occurrence in the category
9
Scoring Strategy
A
Ingredient i
in Category c
B
C
Occr. as
1st
ingredient
Occr. as
2nd
ingredient
Occr.
as ..th
ingredient
Total Occr of
i in c
# of Categories
including
ingredient i
小エビ
10
9
4
50
2
片栗粉
0
3
2
80
5
ブラックタイガー
3
0
0
3
0
§  Create the ingredient-position matrix for each category c
•  Each row represents ingredient in the category c
•  Each column represents the position of ingredients in ingredient
lists of the category c
§  Calculate related score of ingredient i and category c
•  Freq(i, c) = A
•  TF-IDF: B * log(Z/C)
•  First Ingredient : A/B
10
Experiments
§  Data
–  Category: 227 Ingredient-type categories
–  Ingredients: 6,998 ingredients
•  Each ingredients appeared at least 11 times in each
category
–  Ingredient-category pair: 1,101 ingredient-category pairs were
selected as the positive pair among 12,769 pairs
§  Evaluation
–  Compare coverage and precision with morpheme analysis and
Wikipedia-based approach
–  Compare F-scores by changing scoring strategy and threshold
11
Categorization of Related Pairs
Category
Detail
Synonym
Spelling variant
Hyponym
Number of Pair
330
Example
エビ-えび
Abbreviation
11
新じゃが-新じゃがいも
Common name
18
なすび-なす
Misspelled
6
アボガド-アボカド
Variety
320
ブラックタイガーエビ
Processed
214
オイルサーディン-イワシ
Part
202
キャベツの葉-キャベツ
§  About 30% of pairs were in variety relationship
•  These relation is hard to detect using textual similarity
§  Part, processed ingredient frequently appear in a recipe domain
•  These relation should be considered for a search system in
recipe domain
12
Experimental Result
Evaluation
Precision and Recall
F-score
1
1
0.9
0.9
0.8
FI_P
0.7
FREQ_P
0.6
TFIDF_P
0.5
0.4
FI_R
0.3
FREQ_R
0.7
0.6
0.5
0.4
0.2
0.1
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FRE
Q
TFID
F
0.8
0.3
TFIDF_R 0.2
0
FI
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Score Threshold
§  FI approach achieved the highest F-score (F-score=0.929 when threshold
is 0.5)
§  FI approach achieved the highest recall 13
Comparison with Wikipedia-based Approach
§  Confirmed that Wikipedia-based approach has a limitation for
domain-specific words
–  Structure-based approach
http://alaginrc.nict.go.jp/hyponymy/
index.html
えび
えび
えび
えび
えび
えび
えび
えび
えび
えび
えび
うしえび
くまえび
くるまえび
さるえび
しばえび
しろえび
ふとみぞえび
あめりかざりがに
うちだざりがに
たすまにあおおざりがに
たんかいざりがに
14
Analysis
Freq
Ingreds
塩(Salt)
TF-IDF
Score Ingreds
1 小海老 (shrimp)
片栗粉
(potato starch)
0.824 エビ (shrimp)
酒 (liquor)
0.698 海老 (shrimp)
水 (water)
0.593 えび (shrimp)
マヨネーズ
(mayonnaise)
0.489
小海老 (shrimp)
0.452 ケチャップ (ketchup)
片栗粉
(potato starch)
ケチャ (ketchup) 0.416 酒 (liquor)
FirstIngred
Score Ingreds
1 有頭海老(head-on shrimp)
0.512 ブラックタイガー(tiger prawn)
0.5
バナメイエビ
(white-leg shrimp)
Score
1
0.972
0.944
0.448 小海老 (shrimp)
0.941
0.397 甘海老 (shrimp)
0.909
0.297 甘エビ (shrimp)
0.875
0.28 エビ(shrimp)
0.87
§  Freq approach gives high scores to ingredients frequently appeared,
mainly condiments such as salt, water obtain high score
§  TF-IDF approach fails to remove condiments appeared in mainly Shrimp
category such as Ketchup and potato starch (for Ebi-chilli)
§  FI approach gives scores to only ingredients that directly related to
category
15
Conclusion and Future Work
§  Conclusion
–  Proposed approach could successfully found
words in synonym / hyponymy relations in a
recipe domain
–  Related word pairs can be used for category
suggestion
–  Related word pairs can be used as related
keywords
16