Automatic Language Identification

Views:
 
Category: Education
     
 

Presentation Description

lang_identi audio. ppt

Comments

By: mido_eid (48 month(s) ago)

very usefull presentation , i like it . thank you very much

Presentation Transcript

Automatic Language Identification – A Syntactic Approach: 

Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar

The Road Map: 

The Road Map Introduction System Architecture Classification Approaches Experimental Results Summary and Future Work

Introduction: 

Introduction Goal : Efficiently crawl Web pages in a given language; Marathi in our case Different languages use the same Devanagari script E.g Marathi, Sanskrit and Hindi Necessity to accurately distinguish one language from others We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB

Slide6: 

HTML to ASCII Appropriate Encoding Converter Classifier Plain Text + Font Information HTML Documents in different encodings such as Xdvng, DV-TTYogesh Plain Text in ISCII Encoding Classification Results System Architecture

Classification Approaches: 

Classification Approaches Most Frequently Occurring Common Words N-Grams (Most Frequent Character Sequences) Bi-grams: th, ’s, re, en Tri-grams: the, ing, ion, Quad-grams: tion as in classification, association, gratification etc. e.g. English : the, an, is, at,a etc

Important Factors: 

Important Factors Size of the Training Data – Important to capture the syntactic essence of a language Domains of Training Data – Usages vary from domain to domain, author to author Size of the Test Data – Small test data may not contain enough information for classification Requirement of linguistic knowledge for common words approach

Slide9: 

Training Samples Category Profiles Test Document Document Profile Generate Profiles Generate Profile Measure Profile Distances Find minimum Distance Identify category Classifier Architecture

Common Words Approach: 

Common Words Approach List of selected common words Matched with the test documents Closest match will give the language of the document Advantages: Intuitive Computationally Efficient Space Efficient

Top 5 Marathi Common Words: 

Top 5 Marathi Common Words ´É +ÉÎhÉ +É½ä ªÉÉ iÉä

N-Grams Approach: 

N-Grams Approach JAVA Bi-grams: _J, JA, AV, VA, A_ Tri-grams: _JA, JAV, AVA, VA_, A__ Quad-grams: _JAV, JAVA, AVA_, VA__, A___ ¨ÉniÉ Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__

Measuring Distances: 

A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Out_of_Place () Distance =3 + 2* max_value Measuring Distances Category profile sorted in descending order Test profile sorted in descending order

Extensions to N-Grams Method: 

Extensions to N-Grams Method Letter Granularity Conjunct Granularity +ÉÊniªÉ = +É + Ên + iÉ + ªÉ +ÉÊniªÉ = +É + Ên + iªÉ Lowest Granularity +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ

Experimental Training Setup: 

Experimental Training Setup

Category Profiles Generated through Training: 

Category Profiles Generated through Training

Classification Results: 

Classification Results

Summary and Future Work: 

Summary and Future Work Good results have been obtained through syntactic classification Common words technique is computationally most efficient, but with a lesser accuracy Our extensions to N-Grams give the desired accuracy N-grams technique is robust to syntax errors N-Grams technique does not require linguistic knowledge We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine