logging in or signing up Automatic Language Identification rupi Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 380 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: May 31, 2007 This Presentation is Public Favorites: 0 Presentation Description lang_identi audio. ppt Comments Posting comment... By: mido_eid (18 month(s) ago) very usefull presentation , i like it . thank you very much Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Automatic Language Identification – A Syntactic Approach: Automatic Language Identification – A Syntactic Approach Mahesh SoundalgekarThe Road Map: The Road Map Introduction System Architecture Classification Approaches Experimental Results Summary and Future WorkIntroduction: Introduction Goal : Efficiently crawl Web pages in a given language; Marathi in our case Different languages use the same Devanagari script E.g Marathi, Sanskrit and Hindi Necessity to accurately distinguish one language from others We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MBSlide6: HTML to ASCII Appropriate Encoding Converter Classifier Plain Text + Font Information HTML Documents in different encodings such as Xdvng, DV-TTYogesh Plain Text in ISCII Encoding Classification Results System ArchitectureClassification Approaches: Classification Approaches Most Frequently Occurring Common Words N-Grams (Most Frequent Character Sequences) Bi-grams: th, ’s, re, en Tri-grams: the, ing, ion, Quad-grams: tion as in classification, association, gratification etc. e.g. English : the, an, is, at,a etcImportant Factors: Important Factors Size of the Training Data – Important to capture the syntactic essence of a language Domains of Training Data – Usages vary from domain to domain, author to author Size of the Test Data – Small test data may not contain enough information for classification Requirement of linguistic knowledge for common words approachSlide9: Training Samples Category Profiles Test Document Document Profile Generate Profiles Generate Profile Measure Profile Distances Find minimum Distance Identify category Classifier ArchitectureCommon Words Approach: Common Words Approach List of selected common words Matched with the test documents Closest match will give the language of the document Advantages: Intuitive Computationally Efficient Space EfficientTop 5 Marathi Common Words: Top 5 Marathi Common Words ´É +ÉÎhÉ +É½ä ªÉÉ iÉäN-Grams Approach: N-Grams Approach JAVA Bi-grams: _J, JA, AV, VA, A_ Tri-grams: _JA, JAV, AVA, VA_, A__ Quad-grams: _JAV, JAVA, AVA_, VA__, A___ ¨ÉniÉ Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__Measuring Distances: A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Out_of_Place () Distance =3 + 2* max_value Measuring Distances Category profile sorted in descending order Test profile sorted in descending orderExtensions to N-Grams Method: Extensions to N-Grams Method Letter Granularity Conjunct Granularity +ÉÊniªÉ = +É + Ên + iÉ + ªÉ +ÉÊniªÉ = +É + Ên + iªÉ Lowest Granularity +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉExperimental Training Setup: Experimental Training SetupCategory Profiles Generated through Training: Category Profiles Generated through TrainingClassification Results: Classification ResultsSummary and Future Work: Summary and Future Work Good results have been obtained through syntactic classification Common words technique is computationally most efficient, but with a lesser accuracy Our extensions to N-Grams give the desired accuracy N-grams technique is robust to syntax errors N-Grams technique does not require linguistic knowledge We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Automatic Language Identification rupi Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 380 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: May 31, 2007 This Presentation is Public Favorites: 0 Presentation Description lang_identi audio. ppt Comments Posting comment... By: mido_eid (18 month(s) ago) very usefull presentation , i like it . thank you very much Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Automatic Language Identification – A Syntactic Approach: Automatic Language Identification – A Syntactic Approach Mahesh SoundalgekarThe Road Map: The Road Map Introduction System Architecture Classification Approaches Experimental Results Summary and Future WorkIntroduction: Introduction Goal : Efficiently crawl Web pages in a given language; Marathi in our case Different languages use the same Devanagari script E.g Marathi, Sanskrit and Hindi Necessity to accurately distinguish one language from others We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MBSlide6: HTML to ASCII Appropriate Encoding Converter Classifier Plain Text + Font Information HTML Documents in different encodings such as Xdvng, DV-TTYogesh Plain Text in ISCII Encoding Classification Results System ArchitectureClassification Approaches: Classification Approaches Most Frequently Occurring Common Words N-Grams (Most Frequent Character Sequences) Bi-grams: th, ’s, re, en Tri-grams: the, ing, ion, Quad-grams: tion as in classification, association, gratification etc. e.g. English : the, an, is, at,a etcImportant Factors: Important Factors Size of the Training Data – Important to capture the syntactic essence of a language Domains of Training Data – Usages vary from domain to domain, author to author Size of the Test Data – Small test data may not contain enough information for classification Requirement of linguistic knowledge for common words approachSlide9: Training Samples Category Profiles Test Document Document Profile Generate Profiles Generate Profile Measure Profile Distances Find minimum Distance Identify category Classifier ArchitectureCommon Words Approach: Common Words Approach List of selected common words Matched with the test documents Closest match will give the language of the document Advantages: Intuitive Computationally Efficient Space EfficientTop 5 Marathi Common Words: Top 5 Marathi Common Words ´É +ÉÎhÉ +É½ä ªÉÉ iÉäN-Grams Approach: N-Grams Approach JAVA Bi-grams: _J, JA, AV, VA, A_ Tri-grams: _JA, JAV, AVA, VA_, A__ Quad-grams: _JAV, JAVA, AVA_, VA__, A___ ¨ÉniÉ Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__Measuring Distances: A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value 0 Out_of_Place () Distance =3 + 2* max_value Measuring Distances Category profile sorted in descending order Test profile sorted in descending orderExtensions to N-Grams Method: Extensions to N-Grams Method Letter Granularity Conjunct Granularity +ÉÊniªÉ = +É + Ên + iÉ + ªÉ +ÉÊniªÉ = +É + Ên + iªÉ Lowest Granularity +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉExperimental Training Setup: Experimental Training SetupCategory Profiles Generated through Training: Category Profiles Generated through TrainingClassification Results: Classification ResultsSummary and Future Work: Summary and Future Work Good results have been obtained through syntactic classification Common words technique is computationally most efficient, but with a lesser accuracy Our extensions to N-Grams give the desired accuracy N-grams technique is robust to syntax errors N-Grams technique does not require linguistic knowledge We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine