Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar October 1, 2015 CFILT, IIT Bombay The Road Map • Introduction • System Architecture • Classification Approaches • Experimental Results • Summary and Future Work October 1, 2015 CFILT, IIT Bombay 2 Introduction • Goal : Efficiently crawl Web pages in a given language; Marathi in our case • Different languages use the same Devanagari script E.g Marathi, Sanskrit and Hindi • Necessity to accurately distinguish one language from others • We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB October 1, 2015 CFILT, IIT Bombay 3 System Architecture HTML Documents in different encodings such as Xdvng, DV-TTYogesh HTML to ASCII Plain Text + Font Information Appropriate Encoding Converter Plain Text in ISCII Encoding Classifier Classification Results October 1, 2015 CFILT, IIT Bombay 4 Classification Approaches • Most Frequently Occurring Common Words e.g. English : the, an, is, at,a etc • N-Grams (Most Frequent Character Sequences) Bi-grams: th, ’s, re, en Tri-grams: the, ing, ion, Quad-grams: tion as in classification, association, gratification etc. October 1, 2015 CFILT, IIT Bombay 5 Important Factors • Size of the Training Data – Important to capture the syntactic essence of a language • Domains of Training Data – Usages vary from domain to domain, author to author •Size of the Test Data – Small test data may not contain enough information for classification • Requirement of linguistic knowledge for common words approach October 1, 2015 CFILT, IIT Bombay 6 Classifier Architecture Training Samples Test Document Generate Profile Generate Profiles Category Profiles Document Profile Measure Profile Distances Find minimum Distance Identify category October 1, 2015 CFILT, IIT Bombay 7 Common Words Approach • List of selected common words • Matched with the test documents • Closest match will give the language of the document • Advantages: Intuitive Computationally Efficient Space Efficient October 1, 2015 CFILT, IIT Bombay 8 Top 5 Marathi Common Words • ´É • +ÉÎhÉ • +É½ä • ªÉÉ • iÉä October 1, 2015 CFILT, IIT Bombay 9 N-Grams Approach • JAVA Bi-grams: _J, JA, AV, VA, A_ Tri-grams: _JA, JAV, AVA, VA_, A__ Quad-grams: _JAV, JAVA, AVA_, VA__, A___ • ¨ÉniÉ Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__ October 1, 2015 CFILT, IIT Bombay 10 Measuring Distances Out_of_Place () A ER ING AND ON Category profile sorted in descending order October 1, 2015 AR AND ER ED ON Test profile sorted in descending order CFILT, IIT Bombay max_value 2 1 Max_value 0 Distance =3 + 2* max_value 11 Extensions to N-Grams Method • Lowest Granularity +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ • Letter Granularity +ÉÊniªÉ = +É + Ên + iÉ + ªÉ • Conjunct Granularity +ÉÊniªÉ = +É + Ên + iªÉ October 1, 2015 CFILT, IIT Bombay 12 Experimental Training Setup Language Total size of pages in KB No. of Pages Average size of a page in KB Marathi 700 46 15.2 Hindi 600 24 25 Sanskrit 560 19 29.5 October 1, 2015 CFILT, IIT Bombay 13 Category Profiles Generated through Training Language No. of handpicked Common Words No. of NGrams in Atomic Approach No. of NGrams in Letter Approach No. of NGrams in Conjunct Approach Marathi 25 37633 63596 63580 Hindi 25 15450 26886 26865 Sanskrit 21 24119 45380 49368 October 1, 2015 CFILT, IIT Bombay 14 Classification Results Language Common Words Letter Atomic Approach Approach Marathi 91% 95% 100% 100% Hindi 93% 80% 92% 93% Sanskrit 86% 50% 100% 100% October 1, 2015 CFILT, IIT Bombay Conjunct Approach 15 Summary and Future Work •Good results have been obtained through syntactic classification •Common words technique is computationally most efficient, but with a lesser accuracy • Our extensions to N-Grams give the desired accuracy • N-grams technique is robust to syntax errors • N-Grams technique does not require linguistic knowledge • We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine October 1, 2015 CFILT, IIT Bombay 16
© Copyright 2024 ExpyDoc