Automatic Language Identification

Category: Education

Presentation Description

lang_identi audio. ppt


By: mido_eid (44 month(s) ago)

very usefull presentation , i like it . thank you very much

Presentation Transcript

Automatic Language Identification – A Syntactic Approach: 

Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar

The Road Map: 

The Road Map Introduction System Architecture Classification Approaches Experimental Results Summary and Future Work


Introduction Goal : Efficiently crawl Web pages in a given language; Marathi in our case Different languages use the same Devanagari script E.g Marathi, Sanskrit and Hindi Necessity to accurately distinguish one language from others We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB


HTML to ASCII Appropriate Encoding Converter Classifier Plain Text + Font Information HTML Documents in different encodings such as Xdvng, DV-TTYogesh Plain Text in ISCII Encoding Classification Results System Architecture

Classification Approaches: 

Classification Approaches Most Frequently Occurring Common Words N-Grams (Most Frequent Character Sequences) Bi-grams: th, ’s, re, en Tri-grams: the, ing, ion, Quad-grams: tion as in classification, association, gratification etc. e.g. English : the, an, is, at,a etc

Important Factors: 

Important Factors Size of the Training Data – Important to capture the syntactic essence of a language Domains of Training Data – Usages vary from domain to domain, author to author Size of the Test Data – Small test data may not contain enough information for classification Requirement of linguistic knowledge for common words approach


Training Samples Category Profiles Test Document Document Profile Generate Profiles Generate Profile Measure Profile Distances Find minimum Distance Identify category Classifier Architecture

Common Words Approach: 

Common Words Approach List of selected common words Matched with the test documents Closest match will give the language of the document Advantages: Intuitive Computationally Efficient Space Efficient

Top 5 Marathi Common Words: 

Top 5 Marathi Common Words ´É +ÉÎhÉ +É½ä ªÉÉ iÉä

N-Grams Approach: 

N-Grams Approach JAVA Bi-grams: _J, JA, AV, VA, A_ Tri-grams: _JA, JAV, AVA, VA_, A__ Quad-grams: _JAV, JAVA, AVA_, VA__, A___ ¨ÉniÉ Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__

Measuring Distances: 

A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Out_of_Place () Distance =3 + 2* max_value Measuring Distances Category profile sorted in descending order Test profile sorted in descending order

Extensions to N-Grams Method: 

Extensions to N-Grams Method Letter Granularity Conjunct Granularity +ÉÊniªÉ = +É + Ên + iÉ + ªÉ +ÉÊniªÉ = +É + Ên + iªÉ Lowest Granularity +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ

Experimental Training Setup: 

Experimental Training Setup

Category Profiles Generated through Training: 

Category Profiles Generated through Training

Classification Results: 

Classification Results

Summary and Future Work: 

Summary and Future Work Good results have been obtained through syntactic classification Common words technique is computationally most efficient, but with a lesser accuracy Our extensions to N-Grams give the desired accuracy N-grams technique is robust to syntax errors N-Grams technique does not require linguistic knowledge We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine