retriever preso

Uploaded from authorPOINT
Views:
 
Category: Celebrities
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Lycos Retriever:An Information Fusion Engine Brian Ulicny: 

Lycos Retriever: An Information Fusion Engine Brian Ulicny

Retriever: Directory Page: 

Retriever: Directory Page

Retriever: Image Selection: 

Retriever: Image Selection

Retriever: Subtopic Page: 

Retriever: Subtopic Page

Why Retriever?: 

Why Retriever? Topical Queries vastly outnumber Questions. Standard Search Results too many and contain junk. Even in top 10 results, due to SEO efforts Topical Summaries answer 'What do I need to know about andlt;Topicandgt;?' Topic summary resources like Wikipedia have become increasingly popular. But Wikipedia depends on human effort, so coverage is uneven and idiosyncratic. Wikipedia reflects point of view of most engaged or partisan contributor. Retriever as automatically updated first-draft Wikipedia.

Retriever: Processes: 

Retriever: Processes Mine query logs for Topics Categorize Topics Naïve Bayesian categorizer built on DMOZ pages; Name guesser Disambiguate Topics Disambiguator trained on DMOZ Formulate Document Retrieval Query Parse Retrieved Documents Identify allowed alternate/reduced forms of Topic based on Category Select Paragraphs Must have Topic as Discourse Topic Identify Best Images Delete Duplicate Paragraphs Near duplicates, too. Arrange Paragraphs by Verb What is it? What does it have? What has it done? What happened to it? Select Subtopics Do editorial fixes on Passages Construct Page/Directory

Paragraph Filters: 

Paragraph Filters Must Have: Some form of Topic as Discourse Topic At least 3 grammatical sentences Should Have: Highest number of unique NPs. Must NOT Have: Have Any Exophors Except in quotations Topic-Insertion Spam The American Civil Herbal Viagra War was fought Herbal Viagra… Not too many mentions of topic (Erotic) fan fiction or Contain Obscenities Search Engine snippets Duplicates Wikipedia mirrors are everywhere

Subtopics: 

Subtopics Use best chunks for Overview page(s) Identify topic superstrings Topic: Marie Curie Superstring: Marie Curie Fellowship; MC Institute Else cluster by frequent common NPs Take into account reduced mentions: Topic: Charlie Sheen; Most frequent NP: Richards But Subtopic should be: ‘Denise Richards’ However: 'new' is not always 'New York'

Coherence: 

Coherence Pseudo-coherence achieved by stringing together paragraphs with same Discourse Topic. Discourse Topic is based on form and position of phrase. As (a) subject of first sentence Police said that Lindsay Lohan was charged… Or in fronted material, For Lindsay Lohan, 2005 was full of surprises… Not the statistical notion of aboutness usual in IR. Information packaged by paying attention to the information conveyed by verb/predicate Alternate (but not anaphoric) references provide variety.

Similar Work: 

Similar Work FactBites.com Sentence extraction; grouped by source Strzalkowski and Colleagues (GE) Summarization by paragraph extraction Google Current (Current TV) Features on top-gaining queries Artequakt (EU funded; U of Southampton UK) Create artist bios; convert found texts to logical format; NLG from logical representation. Document Understanding Conference (DUC) 'Summarization as Information Synthesis for Task' Sentence-level fusion; no IR component Black Hat: Spam Blogs

Evaluation: 

Evaluation Categorization (982 Topics) 93.5% precision (revised) Disambiguation (100 topics) 83% unambiguous (live) If it isn’t ambiguous in DMOZ, we don’t disambiguate. Chunking (642 chunks) 88.8% relevant (83.4% relevant as categorized) Subtopics (1861 chunks) 88.5% chunks relevant to subtopic (live) Images (83 images) 85.5% relevant (revised)

Retriever Goals: 

Retriever Goals Generate topical summaries on popular topics By extracting and arranging paragraphs from source documents In a coherent, readable and attractive structure Consisting of overview and subtopics Monetize with focused advertisements Allow spiders to crawl to generate traffic Abide by Fair Use/Copyright Laws Much more to be done Temporal ordering, hyperlinking, anaphora, 2nd pass for subtopics, …

Questions?: 

Questions? Lycos Retriever: An Information Fusion Engine Brian Ulicny Versatile Information Systems Bulicny@vistology.com Lycos Retriever http://www.lycos.com/retriever.html Currently not being updated and images not live.