Automatic Language Identification

Automatic Language
Identification – A
Syntactic Approach
Mahesh Soundalgekar
October 1, 2015
CFILT, IIT Bombay
The Road Map
• Introduction
• System Architecture
• Classification Approaches
• Experimental Results
• Summary and Future Work
October 1, 2015
CFILT, IIT Bombay
2
Introduction
• Goal : Efficiently crawl Web pages in a given language;
Marathi in our case
•
Different languages use the same Devanagari script
 E.g Marathi, Sanskrit and Hindi
• Necessity to accurately distinguish one language from
others
• We take a syntactic approach to solve this problem, which
has given us excellent results on training data of 2MB with
test data of 10 MB
October 1, 2015
CFILT, IIT Bombay
3
System Architecture
HTML Documents in different encodings
such as Xdvng, DV-TTYogesh
HTML to ASCII
Plain Text + Font Information
Appropriate Encoding Converter
Plain Text in ISCII Encoding
Classifier
Classification Results
October 1, 2015
CFILT, IIT Bombay
4
Classification
Approaches
• Most Frequently Occurring Common Words
 e.g. English : the, an, is, at,a etc
• N-Grams (Most Frequent Character Sequences)
 Bi-grams: th, ’s, re, en
 Tri-grams: the, ing, ion,
 Quad-grams: tion as in classification, association,
gratification etc.
October 1, 2015
CFILT, IIT Bombay
5
Important Factors
• Size of the Training Data – Important to capture the
syntactic essence of a language
• Domains of Training Data – Usages vary from domain
to domain, author to author
•Size of the Test Data – Small test data may not
contain enough information for classification
• Requirement of linguistic knowledge for common
words approach
October 1, 2015
CFILT, IIT Bombay
6
Classifier Architecture
Training
Samples
Test
Document
Generate
Profile
Generate
Profiles
Category
Profiles
Document
Profile
Measure
Profile
Distances
Find minimum
Distance
Identify category
October 1, 2015
CFILT, IIT Bombay
7
Common Words
Approach
• List of selected common words
• Matched with the test documents
• Closest match will give the language of the document
• Advantages:
 Intuitive
 Computationally Efficient
 Space Efficient
October 1, 2015
CFILT, IIT Bombay
8
Top 5 Marathi Common
Words
• ´É
• +ÉÎhÉ
• +ɽä
• ªÉÉ
• iÉä
October 1, 2015
CFILT, IIT Bombay
9
N-Grams Approach
• JAVA
 Bi-grams: _J, JA, AV, VA, A_
 Tri-grams: _JA, JAV, AVA, VA_, A__
 Quad-grams: _JAV, JAVA, AVA_, VA__, A___
• ¨ÉniÉ
 Bi-grams: _¨É, ¨Én, , niÉ, iÉ_
 Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__
October 1, 2015
CFILT, IIT Bombay
10
Measuring Distances
Out_of_Place ()
A
ER
ING
AND
ON
Category profile
sorted in
descending order
October 1, 2015
AR
AND
ER
ED
ON
Test profile
sorted in
descending order
CFILT, IIT Bombay
max_value
2
1
Max_value
0
Distance =3 +
2* max_value
11
Extensions to N-Grams
Method
• Lowest Granularity
 +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ
• Letter Granularity
 +ÉÊniªÉ = +É + Ên + iÉ + ªÉ
• Conjunct Granularity
 +ÉÊniªÉ = +É + Ên + iªÉ
October 1, 2015
CFILT, IIT Bombay
12
Experimental Training
Setup
Language
Total size of
pages in KB
No. of Pages Average size
of a page in
KB
Marathi
700
46
15.2
Hindi
600
24
25
Sanskrit
560
19
29.5
October 1, 2015
CFILT, IIT Bombay
13
Category Profiles
Generated through
Training
Language
No. of
handpicked
Common
Words
No. of NGrams in
Atomic
Approach
No. of NGrams in
Letter
Approach
No. of NGrams in
Conjunct
Approach
Marathi
25
37633
63596
63580
Hindi
25
15450
26886
26865
Sanskrit
21
24119
45380
49368
October 1, 2015
CFILT, IIT Bombay
14
Classification Results
Language
Common
Words
Letter
Atomic
Approach Approach
Marathi
91%
95%
100%
100%
Hindi
93%
80%
92%
93%
Sanskrit
86%
50%
100%
100%
October 1, 2015
CFILT, IIT Bombay
Conjunct
Approach
15
Summary and Future
Work
•Good results have been obtained through syntactic classification
•Common words technique is computationally most
efficient, but with a lesser accuracy
• Our extensions to N-Grams give the desired accuracy
• N-grams technique is robust to syntax errors
• N-Grams technique does not require linguistic knowledge
• We will be Using language identification techniques to identify a
good starting set of pages for crawling activities for the general
purpose search engine
October 1, 2015
CFILT, IIT Bombay
16