logging in or signing up bradshaw ECDL2003 Janelle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 30 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Redeeming Relevance for Subject Search in Citation Indexes: Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa shannon-bradshaw@uiowa.eduCitation Indexes: Citation Indexes Valuable tools for research Examples: SCI, CiteSeer, arXiv, CiteBase Permit traversal of citation networks Identify significant contributions Subject search is often the entry pointSubject search: Subject search Query similarity Citation frequency Citation frequency: Citation frequency PageRank Example: 2 papers similar in terms of relevance published at roughly the same time Paper A cited only by its author Paper B cited 10 times by other authors Paper B likely to have greater priority for reading Problem: Problem Boolean retrieval metrics Many top documents are not relevant Effective for Web-searches Any one of several popular pages will do Not so for users of citation indexesReference Directed Indexing (RDI): Reference Directed Indexing (RDI) Objective: To combine strong measures of both relevance and significance in a single metric Intuition: The opinions of authors who cite a document effectively distinguish both what a document is about and how important a contribution it makes Similar to the use of anchor text to index Web documents Example: Example Paper by Ron Azuma and Gary Bishop On tracking the heads of users in augmented reality systems Head tracking is necessary in order to generate the correct perspective view A single reference to Azuma: A single reference to AzumaSummarizes Azuma paper as…: Summarizes Azuma paper as… A six degrees of freedom tracking system With additional details: Improves dynamic registration Optical beacon ceiling tracker Linear accelerometers Rate gyroscopes Leveraging multiple citations: Leveraging multiple citations For any document cited more than once… We can compare the words of all authors Terms used by many referrers make good index terms for a documentRepeated use of “tracking” and “augmented reality”: Repeated use of “tracking” and “augmented reality”A voting technique: A voting technique RDI treats each citing document as a voter The presence of a query term in referential text is a vote of “yes” The absence of that term, a “no” The documents with the most votes for the query terms rank highestRelated Work: Related Work McBryan – World Wide Web Worm Brin & Page – Google Chakrabarti et. al - CLEVER Mendelzon et. al - TOPIC Bharat et. al – Hilltop Craswell et. al – Effective Site FindingContributions: Contributions Application to scientific literature “Anchor text” for unrestricted subject search “Anchor text” for combining measures of relevance and significanceRosetta: Rosetta Experimental system in which we implemented RDI Term weighting metric: Ranking metric: Experiments: Experiments 10,000 research papers Gathered from CiteSeer Each document cited at least once Evaluated Retrieval precision Impact of search resultsComparison system: Comparison system We compared Rosetta to a traditional content-based retrieval system Comparison system uses TFIDF for term weighting: And the Cosine ranking metric: Indexing: Indexing Indexed collection in both Rosetta and the TFIDF/Cosine system Rosetta indexed documents based on references to them The TFIDF/Cosine system indexed documents based on words used within them Required that each document was cited at least once to ensure that both systems indexed the same set of documentsSlide20: As referential text, Rosetta used CiteSeer’s “contexts of citation” Slide21: As referential text, Rosetta used CiteSeer’s “contexts of citation” Queries: Queries 32 queries in our test set Queries were key terms extracted from “Keywords” sections of documents Queries extracted from sample of 24 documents Document from which key term was extracted established the topic of interest Queries: Queries Relevance assessments: Relevance assessments The topic of interest for a query was the idea identified by the corresponding key term Relevant documents directly addressed this same topic Example: Query: “force feedback” Relevant: Work on providing a sense of touch in VR applications or other computer simulations Retrieval interface: Retrieval interface Meta-interface Queried both systems Used top 10 search results from each system Integrated all 20 search results Presented them in random order No way to determine the source of a retrieved documentExperimental summary : Experimental summary 32 queries drawn from document key terms Document identified the topic of interest Relevant documents addressed the same topic Used a meta-search interface Evaluated top 10 from both systems Origin of search results hiddenPrecision at top 10: Precision at top 10 On average RDI provided a 16.6% improvement over TFIDF/Cosine 1 or 2 more relevant documents in the top 10 Result is significant t-test of the mean paired difference Test statistic = 3.227 Significant at a confidence level of 99.5%Precision at top 10 (cont’d): Precision at top 10 (cont’d)Many retrieval errors avoided: Many retrieval errors avoided Example: software architecture diagrams Most papers about software architecture frequently use the term “diagrams” Few are about tools for diagramming TFIDF/Cosine system -- 0/10 relevant Rosetta -- 4/10 relevant (3 in top 5) Rosetta made the correct distinction more often Rosetta Shortcomings: Rosetta Shortcomings Retrieval metric sorts search results by number of query terms matched Some authors reuse portions of text in which other documents are cited Impact of search results: Impact of search results A look at the number of citations to documents retrieved for each query Compared RDI to a baseline provided by the TFIDF/Cosine system TFIDF/Cosine includes no measure of impact Seeking only a measure of the relative impact of documents retrieved by RDI on a given topic Experiment: Experiment For each query… Calculated the average citations/year for each document Average publication year for Rosetta – 1994 TFIDF/Cosine – 1995 Found the median number of citations/year for each set of search results Found the difference between the median for Rosetta and the median for TFIDF/CosineDifference in impact: Difference in impact On average the median citations/year… 8.9 for Rosetta 1.5 for the baselineDifference in impact (cont’d): Difference in impact (cont’d)Summary of Experiments: Summary of Experiments Small study – results are tentative Surpassed retrieval precision of a widely used relevance-based approach Consistently retrieved documents that have had a significant impact Future Work: Future Work Retrieval metric that eliminates Boolean component Large scale implementation with CiteSeer data Studies with more sophisticated relevance-based retrieval systems Comparison with popularity-based retrieval techniquesContact: Contact Shannon Bradshaw The University of Iowa shannon-bradshaw@uiowa.edu www.biz.uiowa.edu/sbradshaw You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
bradshaw ECDL2003 Janelle Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 30 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 19, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Redeeming Relevance for Subject Search in Citation Indexes: Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa shannon-bradshaw@uiowa.eduCitation Indexes: Citation Indexes Valuable tools for research Examples: SCI, CiteSeer, arXiv, CiteBase Permit traversal of citation networks Identify significant contributions Subject search is often the entry pointSubject search: Subject search Query similarity Citation frequency Citation frequency: Citation frequency PageRank Example: 2 papers similar in terms of relevance published at roughly the same time Paper A cited only by its author Paper B cited 10 times by other authors Paper B likely to have greater priority for reading Problem: Problem Boolean retrieval metrics Many top documents are not relevant Effective for Web-searches Any one of several popular pages will do Not so for users of citation indexesReference Directed Indexing (RDI): Reference Directed Indexing (RDI) Objective: To combine strong measures of both relevance and significance in a single metric Intuition: The opinions of authors who cite a document effectively distinguish both what a document is about and how important a contribution it makes Similar to the use of anchor text to index Web documents Example: Example Paper by Ron Azuma and Gary Bishop On tracking the heads of users in augmented reality systems Head tracking is necessary in order to generate the correct perspective view A single reference to Azuma: A single reference to AzumaSummarizes Azuma paper as…: Summarizes Azuma paper as… A six degrees of freedom tracking system With additional details: Improves dynamic registration Optical beacon ceiling tracker Linear accelerometers Rate gyroscopes Leveraging multiple citations: Leveraging multiple citations For any document cited more than once… We can compare the words of all authors Terms used by many referrers make good index terms for a documentRepeated use of “tracking” and “augmented reality”: Repeated use of “tracking” and “augmented reality”A voting technique: A voting technique RDI treats each citing document as a voter The presence of a query term in referential text is a vote of “yes” The absence of that term, a “no” The documents with the most votes for the query terms rank highestRelated Work: Related Work McBryan – World Wide Web Worm Brin & Page – Google Chakrabarti et. al - CLEVER Mendelzon et. al - TOPIC Bharat et. al – Hilltop Craswell et. al – Effective Site FindingContributions: Contributions Application to scientific literature “Anchor text” for unrestricted subject search “Anchor text” for combining measures of relevance and significanceRosetta: Rosetta Experimental system in which we implemented RDI Term weighting metric: Ranking metric: Experiments: Experiments 10,000 research papers Gathered from CiteSeer Each document cited at least once Evaluated Retrieval precision Impact of search resultsComparison system: Comparison system We compared Rosetta to a traditional content-based retrieval system Comparison system uses TFIDF for term weighting: And the Cosine ranking metric: Indexing: Indexing Indexed collection in both Rosetta and the TFIDF/Cosine system Rosetta indexed documents based on references to them The TFIDF/Cosine system indexed documents based on words used within them Required that each document was cited at least once to ensure that both systems indexed the same set of documentsSlide20: As referential text, Rosetta used CiteSeer’s “contexts of citation” Slide21: As referential text, Rosetta used CiteSeer’s “contexts of citation” Queries: Queries 32 queries in our test set Queries were key terms extracted from “Keywords” sections of documents Queries extracted from sample of 24 documents Document from which key term was extracted established the topic of interest Queries: Queries Relevance assessments: Relevance assessments The topic of interest for a query was the idea identified by the corresponding key term Relevant documents directly addressed this same topic Example: Query: “force feedback” Relevant: Work on providing a sense of touch in VR applications or other computer simulations Retrieval interface: Retrieval interface Meta-interface Queried both systems Used top 10 search results from each system Integrated all 20 search results Presented them in random order No way to determine the source of a retrieved documentExperimental summary : Experimental summary 32 queries drawn from document key terms Document identified the topic of interest Relevant documents addressed the same topic Used a meta-search interface Evaluated top 10 from both systems Origin of search results hiddenPrecision at top 10: Precision at top 10 On average RDI provided a 16.6% improvement over TFIDF/Cosine 1 or 2 more relevant documents in the top 10 Result is significant t-test of the mean paired difference Test statistic = 3.227 Significant at a confidence level of 99.5%Precision at top 10 (cont’d): Precision at top 10 (cont’d)Many retrieval errors avoided: Many retrieval errors avoided Example: software architecture diagrams Most papers about software architecture frequently use the term “diagrams” Few are about tools for diagramming TFIDF/Cosine system -- 0/10 relevant Rosetta -- 4/10 relevant (3 in top 5) Rosetta made the correct distinction more often Rosetta Shortcomings: Rosetta Shortcomings Retrieval metric sorts search results by number of query terms matched Some authors reuse portions of text in which other documents are cited Impact of search results: Impact of search results A look at the number of citations to documents retrieved for each query Compared RDI to a baseline provided by the TFIDF/Cosine system TFIDF/Cosine includes no measure of impact Seeking only a measure of the relative impact of documents retrieved by RDI on a given topic Experiment: Experiment For each query… Calculated the average citations/year for each document Average publication year for Rosetta – 1994 TFIDF/Cosine – 1995 Found the median number of citations/year for each set of search results Found the difference between the median for Rosetta and the median for TFIDF/CosineDifference in impact: Difference in impact On average the median citations/year… 8.9 for Rosetta 1.5 for the baselineDifference in impact (cont’d): Difference in impact (cont’d)Summary of Experiments: Summary of Experiments Small study – results are tentative Surpassed retrieval precision of a widely used relevance-based approach Consistently retrieved documents that have had a significant impact Future Work: Future Work Retrieval metric that eliminates Boolean component Large scale implementation with CiteSeer data Studies with more sophisticated relevance-based retrieval systems Comparison with popularity-based retrieval techniquesContact: Contact Shannon Bradshaw The University of Iowa shannon-bradshaw@uiowa.edu www.biz.uiowa.edu/sbradshaw