SIGIR04

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR): 

Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, Taiwan

Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) : 

Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval (CLIR) Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien Academia Sinica, Taiwan

Outline: 

Outline Introduction The Proposed Approaches Anchor-Text-Based Approach Search-Result-Based Approach Experiments Applications LiveTrans (http://livetrans.iis.sinica.edu.tw/lt.html) Discussions & Conclusion

Query Translation for CLIR: 

Query Translation for CLIR Query Translation Source Query Translated Query Mono-Lingual IR Translation Dictionaries S T Problem

Problem Most queries are proper nouns: 

Problem Most queries are proper nouns Problem Query Translation Source Query Translated Query Mono-Lingual IR George Bush S T Sheffield Yahoo Document Classification

Observation from Query Logs: 

Observation from Query Logs Most real queries are Short (2.3 English words [Silverstein’98] & 3.18 Chinese characters [Pu’02]) Out-of-dictionary (82.9% of high frequent query terms ) Problem 12.4% unknown English queries for Chinese documents Most of their Chinese translations also found in the logs Demand for translation

The Web as Corpora: 

The Web as Corpora Query Translation Source Query Translated Query Mono-Lingual IR S T Web Anchor Texts [Lu TOIS’04] Search Result Pages Idea

Purpose: 

Purpose To increase translation coverage Unknown queries General domains To improve CLIR performance Query expansion Combination of multiple translation approaches To benefit cross-language Web search Speed Idea

Difference from Conventional Approaches: 

Difference from Conventional Approaches Idea

Our Ideas: 

Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach Idea

Anchor Text in Multiple Languages: 

Anchor Text in Multiple Languages [Lu’04] Anchor text: the descriptive part of a link of a Web page Idea

Probabilistic Inference Model: 

Probabilistic Inference Model [Lu’04] Page Authority Co-occurrence Approach

Slide13: 

Limited domains Powerful spiders required Large training corpora More network bandwidth & storage Drawbacks of Anchor-Text-Based Approach Approach

Our Ideas: 

Our Ideas Anchor-Text-Based Approach – [Lu TOIS’04] Search-Result-Based Approach Idea

Multilingual Search-Result Pages: 

Multilingual Search-Result Pages The search-result page in Chinese of the English query “Yahoo” Snippet Snippet Idea

Correct Translations: 

Correct Translations Mixed-language characteristic in Chinese pages Idea

Relevant Translations: 

Relevant Translations Effective query expansion Idea

Observation: 

Observation 95% Popular queries 70% Random queries Coverage of top-ranked translation candidates in search-result pages Many relevant translations found Idea

Slide19: 

To extract translation candidates with correct lexical boundaries To select correct or relevant translation candidates To integrate extracted translations from different approaches into improve CLIR performance Challenges Challenges

Search-Result-Based Approach: 

Search-Result-Based Approach Search Engine(s) Source Query Translated Query Search-Result Pages Term Extraction … Translation Candidates Translation Selection S T Approach

Challenge 1: Term Extraction: 

Challenge 1: Term Extraction SCP (Symmetric Conditional Probability) Cohesion holding the words together Low frequency or long terms tend to be discarded [Silva’99] CD (Context Dependency) Dependence on the left- or right- adjacent word/character Low frequency or long terms can be extracted [Chien’97] Approach

Term Extraction (II): 

Term Extraction (II) Performance: SCPCD: A combination of SCP and CD – PAT-tree as data structure – LocalMaxs as key term selection algorithm – No threshold Approach

Challenge 2: Translation Selection: 

Challenge 2: Translation Selection S . . . T1 T2 Tn Translation candidates: 雅虎(Yahoo!) 奇摩(Kimo) 雅虎台灣(Yahoo! Taiwan) Similarity Query term: Yahoo Similarity estimation S and Ti frequently co-occur in the same pages – Not true for synonym S and Ti have similar co-occurring context terms Approach

Chi-Square Test: 

Chi-Square Test A statistical method based on co-occurrence Approach Each translation only needs 3 Web searches

Slide25: 

Boolean Query Approach

Context Vector Analysis: 

Context Vector Analysis A vector space model based on co-occurring context terms as feature vectors Weighting scheme: Similarity measure: Approach

Comparison of Chi-Square and Context Vector Methods: 

Comparison of Chi-Square and Context Vector Methods FE: feature extraction N: # of translation candidates Approach

Slide28: 

Challenge 3: CLIR Retrieval model [Xu’01]: Approach

Slide29: 

Estimation of P(s|t) Consider various ranges of similarity values score ranking in method m : assigned weight for each m Approach

Experiments: 

Experiments Experiments on the NTCIR-2 English-Chinese task Experiments on translating Web-query terms Experiments on translating scientists’ names and disease names (English-to-Chinese/Japanese/Korean) Evaluation

Experiments on the NTCIR-2 English-Chinese Task: 

Experiments on the NTCIR-2 English-Chinese Task Evaluation

Translation Performance: 

Translation Performance Hong Kong law parallel text collection (238K para.) [Kwok’01] Evaluation

Translation Performance: 

Translation Performance Web corpora Evaluation

Translation Performance: 

Translation Performance Search results Evaluation

Translation Performance: 

Translation Performance Anchor-text collection (109K URLs) [Lu’04] Evaluation

Translation Performance: 

Translation Performance Search result + anchor text Evaluation

Performance Metric: 

Performance Metric Top-k inclusion rate The percentage of queries whose translations could be found in the first k extracted translations Evaluation

Translation Performance (II): 

Translation Performance (II) CV has higher precision rates than X2 CV+X2 has better performance than CV or X2 Evaluation

Translation Performance (III): 

Translation Performance (III) AT has higher precision rates than CV+X2 CV+X2 has higher coverage rates than AT Complementary Evaluation

Translation Performance (III): 

Translation Performance (III) CV+X2+AT has the best performance Evaluation

Extracted Correct Translations: 

Extracted Correct Translations Evaluation

Extracted Relevant Translations: 

Extracted Relevant Translations Evaluation

CLIR Performance: 

CLIR Performance Evaluation

CLIR Performance: 

CLIR Performance Evaluation

CLIR Performance: 

CLIR Performance Evaluation Dic: LDC English-Chinese lexicon (102K entries)

CLIR Performance: 

CLIR Performance Evaluation SR: X2+CV

CLIR Performance: 

CLIR Performance Evaluation SR+AT: X2+CV+AT

CLIR Performance: 

CLIR Performance Evaluation All: X2+CV+AT+Dictionary

CLIR Performance (II): 

CLIR Performance (II) Dic has higher precision rates than SR and SR+AT at K = 1 50.3% 61.2% Top-1 inclusion rate Evaluation

CLIR Performance (III): 

CLIR Performance (III) 68.0% 78.1% Top-3 inclusion rate SR or SR+AT has higher precision rates than Dic when K > 3 Evaluation

CLIR Performance (III): 

CLIR Performance (III) Starting converging Evaluation

CLIR Performance (IV): 

CLIR Performance (IV) Using only dictionary Using dictionary + our approaches Improvement: 0.043 0.061 0.064 0.059 0.063 0.064 OOV Inclusion rate: 68.1% 81.8% 86.3% – CLIR performance improvement by translating OOV terms Evaluation

Experiments on Translating of Web-Query Terms: 

Experiments on Translating of Web-Query Terms Web-query logs: Test query sets: Evaluation

Slide54: 

Web-Query Translation Performance Evaluation

Slide55: 

Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation

Slide56: 

Web-Query Translation Performance Popular Web Queries > Random Web Queries Evaluation

Slide57: 

Web-Query Translation Performance AT performs worse for random Web queries Evaluation

Slide58: 

Web-Query Translation Performance in Different Types Place > People > Computer & Network > Others > Organization Popular query set: (search-result-based approach) Evaluation

Common Nouns and Verbs: 

Common Nouns and Verbs The proposed search-result-based approach is less reliable to common terms Evaluation

Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean): 

Experiments on Translating Scientists’ Names and Technical Terms (English-to-Chinese/Japanese/Korean) Evaluation

An Example of Multilingual Translation: 

An Example of Multilingual Translation Evaluation

Applications: 

LiveTrans http://livetrans.iis.sinica.edu.tw/lt.html A cross-language meta-search engine To provide online translation service of query terms for cross-language Web search Applications Application

Slide63: 

Application Sheffield Transliteration

Slide64: 

Application Industry City in Mid U.K. Sheffield Univ. Sheffield Hallam Univ.

Discussion and Conclusions: 

Discussion and Conclusions Advantages Can translate unknown queries to improve CLIR performance Can provide query expansion for CLIR Can extract translations with multiple meanings Be flexible for query specification Be useful for online cross-language Web search Disadvantages Be Dependent on employed search engines Not perform good for common terms Not applicable to the language pairs without mixed language characteristic Conclusion

Slide66: 

Jaguar Jaguar Car Jaguar Animal Conclusion

Slide67: 

Have a temperature 38”C pneumonia SARS, severe acute respiratory symptom Conclusion

Thank you for your attention!: 

Thank you for your attention! Q&A