logging in or signing up IMLS2006presentation Altoro Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 57 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Automated Metadata Generation and New Resource Discovery Software and ServicesSlide2: Steve Mitchell smitch@ucr.edu Presentation for IMLS WebWise 2006 Participants in iVia, NSDL iVia and Data Fountains: IMLS Library of the University of California, Riverside: Ruth Jackson, Steve Mitchell, Johannes Ruscheinski, Paul Vander Griend NSF NSDL Core Integration, Cornell University: Carl Lagoze and John Saylor Computer Science Department, Cornell University: Rich Caruana and Thorsten Joachims Computer Science Department, University of Massachusetts: Andrew McCallum University of California, Riverside, Computing and Communications: Charles Rowley, Jerry Keith and Tim Paul Indian Institute of Technology, Bombay: Soumen Chakrabarti Slide3: iVia and Data Fountains Software and Services Enable Projects to Create or Augment Collections Through: 1. A metadata generation service for natural language fields including keyphrases and description/ summary 2. A metadata generation service for controlled subject vocabularies/schema (LCC and LCSH) 3. A metadata extraction service for natural language and other fields when metatags supplied on pages (e.g., titles, creators…) Slide4: 4. A selected, rich full-text identification and extraction service. Natural language can be retained or processed into the most significant keyphrases. 5. An Internet resource discovery service using expert guided and focused crawlers 6. Both metadata generation and resource discovery are available in semi-automated (expert refinement) and fully automated modes Slide5: Of use to all who create and maintain portals, subject directories, catalogs or databases that contain collections of Internet resources. This software and service will help these collections scale and meet the challenges of: * Keeping up with the growing numbers of useful collection/ subject specific resources on the Internet * The relatively small size of most collections, and searches yielding few results * Controlling the high costs of manually created metadata Our Services and Software are Open and Community Based: Our Services and Software are Open and Community Based Data Fountains metadata generation, rich text extraction and resource discovery services are available through the DF co-op iVia and Data Fountains software are open source and freely available (GPL/LGPL) to the community to build with NSDL iVia is software/service being integrated into NSDL CI to be of assistance to NSDL associated projectsNew Resource Identification Through Focused or Guided Web Crawling: New Resource Identification Through Focused or Guided Web Crawling Used in Internet resource identification geared not towards the whole Internet but towards identifying resources of value to specific subject and other communities. In comparison with large scale crawlers, focused crawling can cover specialized topics in more depth and keep the crawl fresher or more current because there is less to cover for each crawler Among our five crawlers are: Expert (or Manually) Guided Crawler (EGC) Nalanda iVia Focused Crawler (NiFC)Expert (or Manually) Guided Crawler (EGC):Drills down and out from a given URL: Expert (or Manually) Guided Crawler (EGC): Drills down and out from a given URLNalanda iVia Focused Crawler utilitzes:: Nalanda iVia Focused Crawler utilitzes: Web graph or co-citation analysis to determine important sites for a subject community (the most intensely interlinked) Similarity analysis that compares keyphrase profiles for a prospective resource with keyphrase profiles for known high value resources in a subject Preferential focused crawling which allows us to identify and follow only the most relevant links on a page. Nalanda iVia features an “apprentice learner” program that is able to determine, through cues in an HTML page, the most promising links to crawl. This makes focused crawling much more efficient by reducing the total number of links that are crawled. Combined HITS and PageRank crawling algorithm improve the crawling. Following is an illustration of a small unit of a Web graph showing inter-linkage or co-citation among related Internet resources/sites: Nalanda iVia Focused Crawler controls:: Nalanda iVia Focused Crawler controls:Rich, Full-Text Identification and Harvest: Rich, Full-Text Identification and Harvest Rich natural language text is that text most likely to include author intended descriptions of the themes of the resource. E.g., abstracts or introductions. Different resource types often have differing areas where rich text can be found and different types of rich text Rich text can greatly improve end-user retrieval (via proximity operators) for finely granular terms or phrases and is one step in and critical to improving generation of other forms of metadata Simple semantic rules (i.e., “aboutness” indicators) are used to identify rich text Rich text can be extracted as found or processed as keyphrases in context. 1-3 pages can be kept.Automated Record Building:Metadata Creation through Extractors/Classifiers: Automated Record Building: Metadata Creation through Extractors/Classifiers Metatag data is identified and harvested (creator, title, etc., if any). If there is none it is then generated automatically Rich, full-text (abstracts, “about”, summaries…) is identified and 1-5 pages harvested and processed into keyphrases The most significant natural language keyphrases are identified and extracted Annotation-like descriptions are developed from the full-text and keyphrases if not in metatags Classifiers are used to model and map key terms to controlled subjects (LCC and LCSH)Controlled Subject Generation: Controlled Subject Generation LCC and LCSH applied through classification algorithms that build models mapping natural language to controlled vocabularies Millions of training examples used from Library catalogs. Difficulty: fairly dirty, inconsistently applied data, but most importantly often not enough training examples with LCC/LCSH/URL Algorithms used: kNearestNeighbor/Naïve Bayes then Logistic Regression and shortly SVM combined with HMM, among others Current IMLS supported research goes for three years to explore/improve specific classifiers, as well as to develop hybrids and suites of classifiersAutomated Record Building: User Metadata Creation/Extraction Options in iVia: Automated Record Building: User Metadata Creation/Extraction Options in iViaAutomated Record Building:User Metadata Creation/Extraction Options in Data Fountains: Automated Record Building: User Metadata Creation/Extraction Options in Data FountainsAutomated Record Building:Semi-automated Keyphrase Generation: Automated Record Building: Semi-automated Keyphrase GenerationAutomated Record Building:Record Ready for Collection or Expert Refinement: Automated Record Building: Record Ready for Collection or Expert RefinementSlide19: iVia - http://ivia.ucr.edu Contact Steve Mitchell, smitch@ucr.edu NSDL iVia - http://nsdl-ivia.ucr.edu For account contact John Saylor, jms1@cornell.edu http://datafountains.ucr.edu For account contact Steve Mitchell, smitch@ucr.edu Thank YouThanks to IMLS, NSF NSDL, the Library and Computing and Communications Group of the University of California at Riverside, FIPSE and the Librarians Association of the University of California for current and past support: Thank You Thanks to IMLS, NSF NSDL, the Library and Computing and Communications Group of the University of California at Riverside, FIPSE and the Librarians Association of the University of California for current and past support Further reading: Paynter, G., 2005, Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources. JCDL, Denver. http://ivia.ucr.edu/projects/publications/Paynter-2005-JCDL-Metadata-Assignment.pdf Vannevar Bush Best Paper Award, JCDL, 2005 You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
IMLS2006presentation Altoro Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 57 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Automated Metadata Generation and New Resource Discovery Software and ServicesSlide2: Steve Mitchell smitch@ucr.edu Presentation for IMLS WebWise 2006 Participants in iVia, NSDL iVia and Data Fountains: IMLS Library of the University of California, Riverside: Ruth Jackson, Steve Mitchell, Johannes Ruscheinski, Paul Vander Griend NSF NSDL Core Integration, Cornell University: Carl Lagoze and John Saylor Computer Science Department, Cornell University: Rich Caruana and Thorsten Joachims Computer Science Department, University of Massachusetts: Andrew McCallum University of California, Riverside, Computing and Communications: Charles Rowley, Jerry Keith and Tim Paul Indian Institute of Technology, Bombay: Soumen Chakrabarti Slide3: iVia and Data Fountains Software and Services Enable Projects to Create or Augment Collections Through: 1. A metadata generation service for natural language fields including keyphrases and description/ summary 2. A metadata generation service for controlled subject vocabularies/schema (LCC and LCSH) 3. A metadata extraction service for natural language and other fields when metatags supplied on pages (e.g., titles, creators…) Slide4: 4. A selected, rich full-text identification and extraction service. Natural language can be retained or processed into the most significant keyphrases. 5. An Internet resource discovery service using expert guided and focused crawlers 6. Both metadata generation and resource discovery are available in semi-automated (expert refinement) and fully automated modes Slide5: Of use to all who create and maintain portals, subject directories, catalogs or databases that contain collections of Internet resources. This software and service will help these collections scale and meet the challenges of: * Keeping up with the growing numbers of useful collection/ subject specific resources on the Internet * The relatively small size of most collections, and searches yielding few results * Controlling the high costs of manually created metadata Our Services and Software are Open and Community Based: Our Services and Software are Open and Community Based Data Fountains metadata generation, rich text extraction and resource discovery services are available through the DF co-op iVia and Data Fountains software are open source and freely available (GPL/LGPL) to the community to build with NSDL iVia is software/service being integrated into NSDL CI to be of assistance to NSDL associated projectsNew Resource Identification Through Focused or Guided Web Crawling: New Resource Identification Through Focused or Guided Web Crawling Used in Internet resource identification geared not towards the whole Internet but towards identifying resources of value to specific subject and other communities. In comparison with large scale crawlers, focused crawling can cover specialized topics in more depth and keep the crawl fresher or more current because there is less to cover for each crawler Among our five crawlers are: Expert (or Manually) Guided Crawler (EGC) Nalanda iVia Focused Crawler (NiFC)Expert (or Manually) Guided Crawler (EGC):Drills down and out from a given URL: Expert (or Manually) Guided Crawler (EGC): Drills down and out from a given URLNalanda iVia Focused Crawler utilitzes:: Nalanda iVia Focused Crawler utilitzes: Web graph or co-citation analysis to determine important sites for a subject community (the most intensely interlinked) Similarity analysis that compares keyphrase profiles for a prospective resource with keyphrase profiles for known high value resources in a subject Preferential focused crawling which allows us to identify and follow only the most relevant links on a page. Nalanda iVia features an “apprentice learner” program that is able to determine, through cues in an HTML page, the most promising links to crawl. This makes focused crawling much more efficient by reducing the total number of links that are crawled. Combined HITS and PageRank crawling algorithm improve the crawling. Following is an illustration of a small unit of a Web graph showing inter-linkage or co-citation among related Internet resources/sites: Nalanda iVia Focused Crawler controls:: Nalanda iVia Focused Crawler controls:Rich, Full-Text Identification and Harvest: Rich, Full-Text Identification and Harvest Rich natural language text is that text most likely to include author intended descriptions of the themes of the resource. E.g., abstracts or introductions. Different resource types often have differing areas where rich text can be found and different types of rich text Rich text can greatly improve end-user retrieval (via proximity operators) for finely granular terms or phrases and is one step in and critical to improving generation of other forms of metadata Simple semantic rules (i.e., “aboutness” indicators) are used to identify rich text Rich text can be extracted as found or processed as keyphrases in context. 1-3 pages can be kept.Automated Record Building:Metadata Creation through Extractors/Classifiers: Automated Record Building: Metadata Creation through Extractors/Classifiers Metatag data is identified and harvested (creator, title, etc., if any). If there is none it is then generated automatically Rich, full-text (abstracts, “about”, summaries…) is identified and 1-5 pages harvested and processed into keyphrases The most significant natural language keyphrases are identified and extracted Annotation-like descriptions are developed from the full-text and keyphrases if not in metatags Classifiers are used to model and map key terms to controlled subjects (LCC and LCSH)Controlled Subject Generation: Controlled Subject Generation LCC and LCSH applied through classification algorithms that build models mapping natural language to controlled vocabularies Millions of training examples used from Library catalogs. Difficulty: fairly dirty, inconsistently applied data, but most importantly often not enough training examples with LCC/LCSH/URL Algorithms used: kNearestNeighbor/Naïve Bayes then Logistic Regression and shortly SVM combined with HMM, among others Current IMLS supported research goes for three years to explore/improve specific classifiers, as well as to develop hybrids and suites of classifiersAutomated Record Building: User Metadata Creation/Extraction Options in iVia: Automated Record Building: User Metadata Creation/Extraction Options in iViaAutomated Record Building:User Metadata Creation/Extraction Options in Data Fountains: Automated Record Building: User Metadata Creation/Extraction Options in Data FountainsAutomated Record Building:Semi-automated Keyphrase Generation: Automated Record Building: Semi-automated Keyphrase GenerationAutomated Record Building:Record Ready for Collection or Expert Refinement: Automated Record Building: Record Ready for Collection or Expert RefinementSlide19: iVia - http://ivia.ucr.edu Contact Steve Mitchell, smitch@ucr.edu NSDL iVia - http://nsdl-ivia.ucr.edu For account contact John Saylor, jms1@cornell.edu http://datafountains.ucr.edu For account contact Steve Mitchell, smitch@ucr.edu Thank YouThanks to IMLS, NSF NSDL, the Library and Computing and Communications Group of the University of California at Riverside, FIPSE and the Librarians Association of the University of California for current and past support: Thank You Thanks to IMLS, NSF NSDL, the Library and Computing and Communications Group of the University of California at Riverside, FIPSE and the Librarians Association of the University of California for current and past support Further reading: Paynter, G., 2005, Developing Practical Automatic Metadata Assignment and Evaluation Tools for Internet Resources. JCDL, Denver. http://ivia.ucr.edu/projects/publications/Paynter-2005-JCDL-Metadata-Assignment.pdf Vannevar Bush Best Paper Award, JCDL, 2005