logging in or signing up SIGIR04 Danior Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 27 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Translating Unknown Querieswith Web Corporafor Cross-LanguageInformation Retrieval (CLIR): Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, TaiwanTranslating Unknown Querieswith Web Corporafor Cross-LanguageInformation Retrieval (CLIR) : Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, TaiwanOutline: Outline Introduction The Proposed Approaches Anchor-Text-Based Approach Search-Result-Based Approach Experiments Applications LiveTrans (http://livetrans.iis.sinica.edu.tw/lt.html) Discussions & ConclusionQuery Translation for CLIR: Query Translation for CLIR Query Translation Source Query Translated Query Mono-Lingual IR Translation Dictionaries S T ProblemProblemMost queries are proper nouns: Problem Most queries are proper nouns Problem Query Translation Source Query Translated Query Mono-Lingual IR George Bush S T Sheffield Yahoo Document ClassificationObservation from Query Logs: Observation from Query Logs Most real queries are Short (2.3 English words [Silverstein’98] & 3.18 Chinese characters [Pu’02]) Out-of-dictionary (82.9% of high frequent query terms ) Problem 12.4% unknown English queries for Chinese documents Most of their Chinese translations also found in the logs Demand for translation The Web as Corpora: The Web as Corpora Query Translation Source Query Translated Query Mono-Lingual IR S T Web Anchor Texts [Lu TOIS’04] Search Result Pages IdeaPurpose: Purpose To increase translation coverage Unknown queries General domains To improve CLIR performance Query expansion Combination of multiple translation approaches To benefit cross-language Web search Speed IdeaDifference from Conventional Approaches: Difference from Conventional Approaches IdeaOur Ideas: Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach IdeaAnchor Text in Multiple Languages: Anchor Text in Multiple Languages [Lu’04] Anchor text: the descriptive part of a link of a Web page Idea Probabilistic Inference Model: Probabilistic Inference Model [Lu’04] Page Authority Co-occurrence ApproachSlide13: Limited domains Powerful spiders required Large training corpora More network bandwidth & storage Drawbacks of Anchor-Text-Based Approach ApproachOur Ideas: Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach IdeaMultilingual Search-Result Pages: Multilingual Search-Result Pages The search-result page in Chinese of the English query “Yahoo” Snippet Snippet IdeaCorrect Translations: Correct Translations Mixed-language characteristic in Chinese pages IdeaRelevant Translations: Relevant Translations Effective query expansion IdeaObservation: Observation 95% Popular queries 70% Random queries Coverage of top-ranked translation candidates in search-result pages Many relevant translations found IdeaSlide19: To extract translation candidates with correct lexical boundaries To select correct or relevant translation candidates To integrate extracted translations from different approaches into improve CLIR performance Challenges ChallengesSearch-Result-Based Approach: Search-Result-Based Approach Search Engine(s) Source Query Translated Query Search-Result Pages Term Extraction … Translation Candidates Translation Selection S T ApproachChallenge 1: Term Extraction: Challenge 1: Term Extraction SCP (Symmetric Conditional Probability) Cohesion holding the words together Low frequency or long terms tend to be discarded [Silva’99] CD (Context Dependency) Dependence on the left- or right- adjacent word/character Low frequency or long terms can be extracted [Chien’97] ApproachTerm Extraction (II): Term Extraction (II) Performance: SCPCD: A combination of SCP and CD – PAT-tree as data structure – LocalMaxs as key term selection algorithm – No threshold ApproachChallenge 2: Translation Selection: Challenge 2: Translation Selection S . . . T1 T2 Tn Translation candidates: 雅虎(Yahoo!) 奇摩(Kimo) 雅虎台灣(Yahoo! Taiwan) Similarity Query term: Yahoo Similarity estimation S and Ti frequently co-occur in the same pages – Not true for synonym S and Ti have similar co-occurring context terms ApproachChi-Square Test: Chi-Square Test A statistical method based on co-occurrence Approach Each translation only needs 3 Web searchesSlide25: Boolean Query ApproachContext Vector Analysis: Context Vector Analysis A vector space model based on co-occurring context terms as feature vectors Weighting scheme: Similarity measure: ApproachComparison of Chi-Square and Context Vector Methods: Comparison of Chi-Square and Context Vector Methods FE: feature extraction N: # of translation candidates ApproachSlide28: Challenge 3: CLIR Retrieval model [Xu’01]: ApproachSlide29: Estimation of P(s|t) Consider various ranges of similarity values score ranking in method m : assigned weight for each m ApproachExperiments: Experiments Experiments on the NTCIR-2 English-Chinese task Experiments on translating Web-query terms Experiments on translating scientists’ names and disease names (English-to-Chinese/Japanese/Korean) EvaluationExperiments onthe NTCIR-2 English-Chinese Task: Experiments on the NTCIR-2 English-Chinese Task EvaluationTranslation Performance: Translation Performance Hong Kong law parallel text collection (238K para.) [Kwok’01] EvaluationTranslation Performance: Translation Performance Web corpora EvaluationTranslation Performance: Translation Performance Search results EvaluationTranslation Performance: Translation Performance Anchor-text collection (109K URLs) [Lu’04] EvaluationTranslation Performance: Translation Performance Search result + anchor text EvaluationPerformance Metric: Performance Metric Top-k inclusion rate The percentage of queries whose translations could be found in the first k extracted translations EvaluationTranslation Performance (II): Translation Performance (II) CV has higher precision rates than X2 CV+X2 has better performance than CV or X2 EvaluationTranslation Performance (III): Translation Performance (III) AT has higher precision rates than CV+X2 CV+X2 has higher coverage rates than AT Complementary EvaluationTranslation Performance (III): Translation Performance (III) CV+X2+AT has the best performance EvaluationExtracted Correct Translations: Extracted Correct Translations EvaluationExtracted Relevant Translations: Extracted Relevant Translations EvaluationCLIR Performance: CLIR Performance Evaluation CLIR Performance: CLIR Performance Evaluation CLIR Performance: CLIR Performance Evaluation Dic: LDC English-Chinese lexicon (102K entries)CLIR Performance: CLIR Performance Evaluation SR: X2+CVCLIR Performance: CLIR Performance Evaluation SR+AT: X2+CV+ATCLIR Performance: CLIR Performance Evaluation All: X2+CV+AT+DictionaryCLIR Performance (II): CLIR Performance (II) Dic has higher precision rates than SR and SR+AT at K = 1 50.3% 61.2% Top-1 inclusion rate EvaluationCLIR Performance (III): CLIR Performance (III) 68.0% 78.1% Top-3 inclusion rate SR or SR+AT has higher precision rates than Dic when K > 3 EvaluationCLIR Performance (III): CLIR Performance (III) Starting converging EvaluationCLIR Performance (IV): CLIR Performance (IV) Using only dictionary Using dictionary + our approaches Improvement: 0.043 0.061 0.064 0.059 0.063 0.064 OOV Inclusion rate: 68.1% 81.8% 86.3% – CLIR performance improvement by translating OOV terms EvaluationExperiments onTranslating of Web-Query Terms: Experiments on Translating of Web-Query Terms Web-query logs: Test query sets: EvaluationSlide54: Web-Query Translation Performance EvaluationSlide55: Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide56: Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide57: Web-Query Translation Performance AT performs worse for random Web queries Evaluation Slide58: Web-Query Translation Performance in Different Types Place > People > Computer & Network > Others > Organization Popular query set: (search-result-based approach) EvaluationCommon Nouns and Verbs: Common Nouns and Verbs The proposed search-result-based approach is less reliable to common terms EvaluationExperiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean): Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean) EvaluationAn Example of Multilingual Translation: An Example of Multilingual Translation EvaluationApplications: LiveTrans http://livetrans.iis.sinica.edu.tw/lt.html A cross-language meta-search engine To provide online translation service of query terms for cross-language Web search Applications ApplicationSlide63: Application Sheffield Transliteration Slide64: Application Industry City in Mid U.K. Sheffield Univ. Sheffield Hallam Univ.Discussion and Conclusions: Discussion and Conclusions Advantages Can translate unknown queries to improve CLIR performance Can provide query expansion for CLIR Can extract translations with multiple meanings Be flexible for query specification Be useful for online cross-language Web search Disadvantages Be Dependent on employed search engines Not perform good for common terms Not applicable to the language pairs without mixed language characteristic ConclusionSlide66: Jaguar Jaguar Car Jaguar Animal ConclusionSlide67: Have a temperature 38”C pneumonia SARS, severe acute respiratory symptom ConclusionThank you for your attention!: Thank you for your attention! Q&A You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
SIGIR04 Danior Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 27 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Translating Unknown Querieswith Web Corporafor Cross-LanguageInformation Retrieval (CLIR): Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, TaiwanTranslating Unknown Querieswith Web Corporafor Cross-LanguageInformation Retrieval (CLIR) : Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, TaiwanOutline: Outline Introduction The Proposed Approaches Anchor-Text-Based Approach Search-Result-Based Approach Experiments Applications LiveTrans (http://livetrans.iis.sinica.edu.tw/lt.html) Discussions & ConclusionQuery Translation for CLIR: Query Translation for CLIR Query Translation Source Query Translated Query Mono-Lingual IR Translation Dictionaries S T ProblemProblemMost queries are proper nouns: Problem Most queries are proper nouns Problem Query Translation Source Query Translated Query Mono-Lingual IR George Bush S T Sheffield Yahoo Document ClassificationObservation from Query Logs: Observation from Query Logs Most real queries are Short (2.3 English words [Silverstein’98] & 3.18 Chinese characters [Pu’02]) Out-of-dictionary (82.9% of high frequent query terms ) Problem 12.4% unknown English queries for Chinese documents Most of their Chinese translations also found in the logs Demand for translation The Web as Corpora: The Web as Corpora Query Translation Source Query Translated Query Mono-Lingual IR S T Web Anchor Texts [Lu TOIS’04] Search Result Pages IdeaPurpose: Purpose To increase translation coverage Unknown queries General domains To improve CLIR performance Query expansion Combination of multiple translation approaches To benefit cross-language Web search Speed IdeaDifference from Conventional Approaches: Difference from Conventional Approaches IdeaOur Ideas: Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach IdeaAnchor Text in Multiple Languages: Anchor Text in Multiple Languages [Lu’04] Anchor text: the descriptive part of a link of a Web page Idea Probabilistic Inference Model: Probabilistic Inference Model [Lu’04] Page Authority Co-occurrence ApproachSlide13: Limited domains Powerful spiders required Large training corpora More network bandwidth & storage Drawbacks of Anchor-Text-Based Approach ApproachOur Ideas: Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach IdeaMultilingual Search-Result Pages: Multilingual Search-Result Pages The search-result page in Chinese of the English query “Yahoo” Snippet Snippet IdeaCorrect Translations: Correct Translations Mixed-language characteristic in Chinese pages IdeaRelevant Translations: Relevant Translations Effective query expansion IdeaObservation: Observation 95% Popular queries 70% Random queries Coverage of top-ranked translation candidates in search-result pages Many relevant translations found IdeaSlide19: To extract translation candidates with correct lexical boundaries To select correct or relevant translation candidates To integrate extracted translations from different approaches into improve CLIR performance Challenges ChallengesSearch-Result-Based Approach: Search-Result-Based Approach Search Engine(s) Source Query Translated Query Search-Result Pages Term Extraction … Translation Candidates Translation Selection S T ApproachChallenge 1: Term Extraction: Challenge 1: Term Extraction SCP (Symmetric Conditional Probability) Cohesion holding the words together Low frequency or long terms tend to be discarded [Silva’99] CD (Context Dependency) Dependence on the left- or right- adjacent word/character Low frequency or long terms can be extracted [Chien’97] ApproachTerm Extraction (II): Term Extraction (II) Performance: SCPCD: A combination of SCP and CD – PAT-tree as data structure – LocalMaxs as key term selection algorithm – No threshold ApproachChallenge 2: Translation Selection: Challenge 2: Translation Selection S . . . T1 T2 Tn Translation candidates: 雅虎(Yahoo!) 奇摩(Kimo) 雅虎台灣(Yahoo! Taiwan) Similarity Query term: Yahoo Similarity estimation S and Ti frequently co-occur in the same pages – Not true for synonym S and Ti have similar co-occurring context terms ApproachChi-Square Test: Chi-Square Test A statistical method based on co-occurrence Approach Each translation only needs 3 Web searchesSlide25: Boolean Query ApproachContext Vector Analysis: Context Vector Analysis A vector space model based on co-occurring context terms as feature vectors Weighting scheme: Similarity measure: ApproachComparison of Chi-Square and Context Vector Methods: Comparison of Chi-Square and Context Vector Methods FE: feature extraction N: # of translation candidates ApproachSlide28: Challenge 3: CLIR Retrieval model [Xu’01]: ApproachSlide29: Estimation of P(s|t) Consider various ranges of similarity values score ranking in method m : assigned weight for each m ApproachExperiments: Experiments Experiments on the NTCIR-2 English-Chinese task Experiments on translating Web-query terms Experiments on translating scientists’ names and disease names (English-to-Chinese/Japanese/Korean) EvaluationExperiments onthe NTCIR-2 English-Chinese Task: Experiments on the NTCIR-2 English-Chinese Task EvaluationTranslation Performance: Translation Performance Hong Kong law parallel text collection (238K para.) [Kwok’01] EvaluationTranslation Performance: Translation Performance Web corpora EvaluationTranslation Performance: Translation Performance Search results EvaluationTranslation Performance: Translation Performance Anchor-text collection (109K URLs) [Lu’04] EvaluationTranslation Performance: Translation Performance Search result + anchor text EvaluationPerformance Metric: Performance Metric Top-k inclusion rate The percentage of queries whose translations could be found in the first k extracted translations EvaluationTranslation Performance (II): Translation Performance (II) CV has higher precision rates than X2 CV+X2 has better performance than CV or X2 EvaluationTranslation Performance (III): Translation Performance (III) AT has higher precision rates than CV+X2 CV+X2 has higher coverage rates than AT Complementary EvaluationTranslation Performance (III): Translation Performance (III) CV+X2+AT has the best performance EvaluationExtracted Correct Translations: Extracted Correct Translations EvaluationExtracted Relevant Translations: Extracted Relevant Translations EvaluationCLIR Performance: CLIR Performance Evaluation CLIR Performance: CLIR Performance Evaluation CLIR Performance: CLIR Performance Evaluation Dic: LDC English-Chinese lexicon (102K entries)CLIR Performance: CLIR Performance Evaluation SR: X2+CVCLIR Performance: CLIR Performance Evaluation SR+AT: X2+CV+ATCLIR Performance: CLIR Performance Evaluation All: X2+CV+AT+DictionaryCLIR Performance (II): CLIR Performance (II) Dic has higher precision rates than SR and SR+AT at K = 1 50.3% 61.2% Top-1 inclusion rate EvaluationCLIR Performance (III): CLIR Performance (III) 68.0% 78.1% Top-3 inclusion rate SR or SR+AT has higher precision rates than Dic when K > 3 EvaluationCLIR Performance (III): CLIR Performance (III) Starting converging EvaluationCLIR Performance (IV): CLIR Performance (IV) Using only dictionary Using dictionary + our approaches Improvement: 0.043 0.061 0.064 0.059 0.063 0.064 OOV Inclusion rate: 68.1% 81.8% 86.3% – CLIR performance improvement by translating OOV terms EvaluationExperiments onTranslating of Web-Query Terms: Experiments on Translating of Web-Query Terms Web-query logs: Test query sets: EvaluationSlide54: Web-Query Translation Performance EvaluationSlide55: Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide56: Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation Slide57: Web-Query Translation Performance AT performs worse for random Web queries Evaluation Slide58: Web-Query Translation Performance in Different Types Place > People > Computer & Network > Others > Organization Popular query set: (search-result-based approach) EvaluationCommon Nouns and Verbs: Common Nouns and Verbs The proposed search-result-based approach is less reliable to common terms EvaluationExperiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean): Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean) EvaluationAn Example of Multilingual Translation: An Example of Multilingual Translation EvaluationApplications: LiveTrans http://livetrans.iis.sinica.edu.tw/lt.html A cross-language meta-search engine To provide online translation service of query terms for cross-language Web search Applications ApplicationSlide63: Application Sheffield Transliteration Slide64: Application Industry City in Mid U.K. Sheffield Univ. Sheffield Hallam Univ.Discussion and Conclusions: Discussion and Conclusions Advantages Can translate unknown queries to improve CLIR performance Can provide query expansion for CLIR Can extract translations with multiple meanings Be flexible for query specification Be useful for online cross-language Web search Disadvantages Be Dependent on employed search engines Not perform good for common terms Not applicable to the language pairs without mixed language characteristic ConclusionSlide66: Jaguar Jaguar Car Jaguar Animal ConclusionSlide67: Have a temperature 38”C pneumonia SARS, severe acute respiratory symptom ConclusionThank you for your attention!: Thank you for your attention! Q&A