logging in or signing up 05Hlavaoverview Christo Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 15 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: December 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Automated A&I: What it is and who should consider it : Automated A&I: What it is and who should consider it 9:45am – 10:15am Marjorie M.K. Hlava, President, Access Innovations, IncWhat will we cover in 30 minutes?: What will we cover in 30 minutes? Why we do it What is automated indexing/abstracting The building blocks Standards The science of it What are the parts of the technologies available? The pros and cons Not covering: Not covering Automatically building a taxonomy XML – standardizes data Does not find data Why do we do it?: Why do we do it? “Our ability to create information has substantially outpaced our ability to retrieve Information” (Delphi group 2002) There are 250 Megs of information for every human being on earth – and growing Infoglut has a negative impact on productivity Fortune 500 companies lost $12 billion due to inability to find information in 2003 25 – 35% of time is spent looking for information What to do about it?: What to do about it? Thesauri control indexing Indexing enables search and retrieval “The better the controls on language used to organize research, the better the search experience” LW Moulton Search promotes productive use and reuse of information Search increases worker productivity and innovationIndexing improves searching: Indexing improves searching Subjects (controlled terms) Skill of the indexers Keywords Depend on the skill of the searcherSearch works IF: Search works IF The content Exists digitally Has good metadata The metadata reflects the content and the user needs The users Are trained in using the search tool Can create a query and interpret the result Like the user interfaceSo….: So…. We need good metadata Good indexing How to get it? Consistent Right depth User oriented Comprehensive Precision saves time! Basic areas of Automatic Language Processing (ALP): Basic areas of Automatic Language Processing (ALP) Auto Translation Auto Indexing Auto Abstracting Artificial Intelligence Searching Spell Checking Semantic Web Natural Language Processes (NLP) Computational Linguistics Natural Language Processing: Natural Language Processing Syntactic Semantic Morphological Phraseological Lemmatization (stemming) Statistical Grammatical Common SenseStatistical: Statistical Cluster analysis Neural networks Co-occurrence Bayesian Inference Etc. Basic to all: Basic to all Boolean Inverted indexes (those computer indexes)Word and Term Parsing: Word and Term Parsing Stemming -ing, -ed, -es, -’s, -s’, etc. Depuralization Truncation Left and right Wild cards Organi*ation Variant Spellings Centre, center Hyphens Parsing: Parsing Code stripping HTML, MS Word, Photocomp codes, etc. Punctuation Useful in metadata identification Stop wordsWhat is automated indexing/abstracting?: What is automated indexing/abstracting? Two very different things Abstracting – digesting an object into a short descriptive unit Indexing – applying terms to an object from a controlled vocabulary or thesaurusThesaurus and Indexing Standards – ISO : Thesaurus and Indexing Standards – ISO ISO 2788:1986 Documentation - Guidelines for the establishment and development of monolingual thesauri ISO 5964:1985 Documentation - Guidelines for the establishment and development of multilingual thesauri ISO 5963:1985 Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms ISO 999:1996 Information and documentation - Guidelines for the content, organization and presentation of indexesISO TC 46/SC 9: ISO TC 46/SC 9 Information and Documentation - Identification and Description TC 46 is ISO's Technical Committee (TC) for information and documentation standards. SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources. Thesaurus and Indexing Standards – ANSI/NISO: Thesaurus and Indexing Standards – ANSI/NISO ANSI/NISO Z39.19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson Reports to use: Reports to use Report on the Workshop on Electronic Thesauri, November 4-5, 1999 http://www.niso.org/news/events_workshops/thes99rprt.html Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference Structures June 1997 http://archive.ala.org/alcts/organization/ccs/sac/rpt97rev.html Other links: Other links http://esw.w3.org/topic/SkosDev/ThesaurusLinks/XmlFormats MARC-21 XMLSchema. Zthes Z39.50 profile for thesaurus navigation (2001). TML thesaurus markup language (1999). ADL Thesaurus Protocol XML formats (2002). MeSH XML format (2001). GEMET XML format (2003). APAIS XML thesaurus format, an extension of Zthes (2000). Open University thesaurus schemas (2002). Soergel XML thesaurus specification (2001). Automatic abstracting: Automatic abstracting Works well in some areas: Defined source fields Like HTML Headers Field formatted data XML feeds Simple concepts View Source: View Source HTML Header: HTML Header Source text: Source text <title>Access Innovations - Leader in taxonomy development, automated indexing, knowledge capital management</title> <style type="text/css"> <!-- .style1 {font-weight: bold} --> </style> </head> <meta name="author" content="root"> <meta name="copyright" content="© 1996 - 2005 Access Innovations, Inc."> <meta name="keywords" content="content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control"> <meta name="description" content="Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project management"> <meta name="robots" content="index, follow">Resulting auto abstract: Resulting auto abstract Access Innovations**leader in taxonomy development, automated indexing, knowledge capital management especially in content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control. Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project managementMain strategies for auto indexing: Main strategies for auto indexing Rules Based – two types IF THEN – sequel rules IF ELSE IF – syntactic rules Not dependent on the collection Co-occurrence Training sets Training Compares to already classified dataManual indexing — Pros: Manual indexing — Pros Understand the conceptual nuances Understand and map to user expectations High quality High precision Use for text and images and AVManual indexing — Cons: Manual indexing — Cons Slow Inconsistent over time (editorial drift) Not scalable Highly skilled process – lots of training Difficult to manage centrally or to audit behaviors Resource intensive (people) High costAutomatic indexing – Pros: Automatic indexing – Pros Cost savings (Check the ROI) Few resources needed Excellent logical consistency (Rules) Very scalable Can be centrally managed Frees staff for other tasks FastAutomatic indexing – Cons: Automatic indexing – Cons Quality Consistency (co-occurrence) IT overhead Exceptions hard to manage Difficult to extend or train (co-occurrence) Might not be synchronized with user behavior Co-occurrence – Pros : Co-occurrence – Pros Possible cost savings Possible speed for creation Fewer people involved in creation Can add Sequel rules High tech value to management Co-occurrence – Cons: Co-occurrence – Cons Requires a training set 10 – 35 “perfect records” per term Main terms and synonyms need training Can’t deal with exceptions (except with rules) Needs homogenous collections Needs a large collection for ROI Depends on full text Consistency variations New terms mean retraining the entire set Dependence on IT or the vendor Training of editors to use the resultsRules based indexing – Pros: Rules based indexing – Pros Easy to modify High consistency Adaptable Word game for intuitive an analytical minds Basic rules generated automatically (less than a two hour set up and run time) Modify ~20% of rules for greater accuracy Takes seconds to change a rule 10 seconds to insert a condition (compared to recollecting training sets)Rule based indexing – Cons: Rule based indexing – Cons Perceived time to build the rules Requires editorial attention (not IT) Need to evaluate statistics periodically How was the M.A.I. evaluation conducted?: How was the M.A.I. evaluation conducted? Phase I: An objective (quantitative) analysis of the M.A.I. statistics generated during testing. ● Data Harmony supplied with electronic format articles and NAL Thesaurus ● Indexing staff trained to use MAIstro ● Indexer tested sample articles from their individual assignments ● Statistics gathered and analyzed Phase II: A subjective analysis of indexers’ opinions/feelings regarding of M.A.I. ● Two-part questionnaire - 16 statements requiring responses using an agreement scale - 6 essay questions requiring verbal responses Evaluation carried out by Geoffrey Yeadon @ NAL Usability testing: Usability testing Training by the vendor Software usability (user friendliness) Software speed Applicability/Utility Use to assign terms Option to use assisted mode Worth the time invested Quality of indexing Use for enhancing production Accelerate the indexing process Human intervention needed – optionalFactors in the technology evolution: Factors in the technology evolution 1964 COSATI Report ASIST Literature Cold War Increased computer power The Internet What changes might be expected in the near future?: What changes might be expected in the near future? Indexing taxonomy available for use in the search Query expansion using the thesaurus More new systems Merging of technologies Semantic web will disappear More synonym rings OWL and Z39.19 will merge – cross walk Thank you for coming!: Thank you for coming! Glossary Available Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn.com You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
05Hlavaoverview Christo Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 15 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: December 03, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Automated A&I: What it is and who should consider it : Automated A&I: What it is and who should consider it 9:45am – 10:15am Marjorie M.K. Hlava, President, Access Innovations, IncWhat will we cover in 30 minutes?: What will we cover in 30 minutes? Why we do it What is automated indexing/abstracting The building blocks Standards The science of it What are the parts of the technologies available? The pros and cons Not covering: Not covering Automatically building a taxonomy XML – standardizes data Does not find data Why do we do it?: Why do we do it? “Our ability to create information has substantially outpaced our ability to retrieve Information” (Delphi group 2002) There are 250 Megs of information for every human being on earth – and growing Infoglut has a negative impact on productivity Fortune 500 companies lost $12 billion due to inability to find information in 2003 25 – 35% of time is spent looking for information What to do about it?: What to do about it? Thesauri control indexing Indexing enables search and retrieval “The better the controls on language used to organize research, the better the search experience” LW Moulton Search promotes productive use and reuse of information Search increases worker productivity and innovationIndexing improves searching: Indexing improves searching Subjects (controlled terms) Skill of the indexers Keywords Depend on the skill of the searcherSearch works IF: Search works IF The content Exists digitally Has good metadata The metadata reflects the content and the user needs The users Are trained in using the search tool Can create a query and interpret the result Like the user interfaceSo….: So…. We need good metadata Good indexing How to get it? Consistent Right depth User oriented Comprehensive Precision saves time! Basic areas of Automatic Language Processing (ALP): Basic areas of Automatic Language Processing (ALP) Auto Translation Auto Indexing Auto Abstracting Artificial Intelligence Searching Spell Checking Semantic Web Natural Language Processes (NLP) Computational Linguistics Natural Language Processing: Natural Language Processing Syntactic Semantic Morphological Phraseological Lemmatization (stemming) Statistical Grammatical Common SenseStatistical: Statistical Cluster analysis Neural networks Co-occurrence Bayesian Inference Etc. Basic to all: Basic to all Boolean Inverted indexes (those computer indexes)Word and Term Parsing: Word and Term Parsing Stemming -ing, -ed, -es, -’s, -s’, etc. Depuralization Truncation Left and right Wild cards Organi*ation Variant Spellings Centre, center Hyphens Parsing: Parsing Code stripping HTML, MS Word, Photocomp codes, etc. Punctuation Useful in metadata identification Stop wordsWhat is automated indexing/abstracting?: What is automated indexing/abstracting? Two very different things Abstracting – digesting an object into a short descriptive unit Indexing – applying terms to an object from a controlled vocabulary or thesaurusThesaurus and Indexing Standards – ISO : Thesaurus and Indexing Standards – ISO ISO 2788:1986 Documentation - Guidelines for the establishment and development of monolingual thesauri ISO 5964:1985 Documentation - Guidelines for the establishment and development of multilingual thesauri ISO 5963:1985 Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms ISO 999:1996 Information and documentation - Guidelines for the content, organization and presentation of indexesISO TC 46/SC 9: ISO TC 46/SC 9 Information and Documentation - Identification and Description TC 46 is ISO's Technical Committee (TC) for information and documentation standards. SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources. Thesaurus and Indexing Standards – ANSI/NISO: Thesaurus and Indexing Standards – ANSI/NISO ANSI/NISO Z39.19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson Reports to use: Reports to use Report on the Workshop on Electronic Thesauri, November 4-5, 1999 http://www.niso.org/news/events_workshops/thes99rprt.html Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference Structures June 1997 http://archive.ala.org/alcts/organization/ccs/sac/rpt97rev.html Other links: Other links http://esw.w3.org/topic/SkosDev/ThesaurusLinks/XmlFormats MARC-21 XMLSchema. Zthes Z39.50 profile for thesaurus navigation (2001). TML thesaurus markup language (1999). ADL Thesaurus Protocol XML formats (2002). MeSH XML format (2001). GEMET XML format (2003). APAIS XML thesaurus format, an extension of Zthes (2000). Open University thesaurus schemas (2002). Soergel XML thesaurus specification (2001). Automatic abstracting: Automatic abstracting Works well in some areas: Defined source fields Like HTML Headers Field formatted data XML feeds Simple concepts View Source: View Source HTML Header: HTML Header Source text: Source text <title>Access Innovations - Leader in taxonomy development, automated indexing, knowledge capital management</title> <style type="text/css"> <!-- .style1 {font-weight: bold} --> </style> </head> <meta name="author" content="root"> <meta name="copyright" content="© 1996 - 2005 Access Innovations, Inc."> <meta name="keywords" content="content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control"> <meta name="description" content="Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project management"> <meta name="robots" content="index, follow">Resulting auto abstract: Resulting auto abstract Access Innovations**leader in taxonomy development, automated indexing, knowledge capital management especially in content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control. Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project managementMain strategies for auto indexing: Main strategies for auto indexing Rules Based – two types IF THEN – sequel rules IF ELSE IF – syntactic rules Not dependent on the collection Co-occurrence Training sets Training Compares to already classified dataManual indexing — Pros: Manual indexing — Pros Understand the conceptual nuances Understand and map to user expectations High quality High precision Use for text and images and AVManual indexing — Cons: Manual indexing — Cons Slow Inconsistent over time (editorial drift) Not scalable Highly skilled process – lots of training Difficult to manage centrally or to audit behaviors Resource intensive (people) High costAutomatic indexing – Pros: Automatic indexing – Pros Cost savings (Check the ROI) Few resources needed Excellent logical consistency (Rules) Very scalable Can be centrally managed Frees staff for other tasks FastAutomatic indexing – Cons: Automatic indexing – Cons Quality Consistency (co-occurrence) IT overhead Exceptions hard to manage Difficult to extend or train (co-occurrence) Might not be synchronized with user behavior Co-occurrence – Pros : Co-occurrence – Pros Possible cost savings Possible speed for creation Fewer people involved in creation Can add Sequel rules High tech value to management Co-occurrence – Cons: Co-occurrence – Cons Requires a training set 10 – 35 “perfect records” per term Main terms and synonyms need training Can’t deal with exceptions (except with rules) Needs homogenous collections Needs a large collection for ROI Depends on full text Consistency variations New terms mean retraining the entire set Dependence on IT or the vendor Training of editors to use the resultsRules based indexing – Pros: Rules based indexing – Pros Easy to modify High consistency Adaptable Word game for intuitive an analytical minds Basic rules generated automatically (less than a two hour set up and run time) Modify ~20% of rules for greater accuracy Takes seconds to change a rule 10 seconds to insert a condition (compared to recollecting training sets)Rule based indexing – Cons: Rule based indexing – Cons Perceived time to build the rules Requires editorial attention (not IT) Need to evaluate statistics periodically How was the M.A.I. evaluation conducted?: How was the M.A.I. evaluation conducted? Phase I: An objective (quantitative) analysis of the M.A.I. statistics generated during testing. ● Data Harmony supplied with electronic format articles and NAL Thesaurus ● Indexing staff trained to use MAIstro ● Indexer tested sample articles from their individual assignments ● Statistics gathered and analyzed Phase II: A subjective analysis of indexers’ opinions/feelings regarding of M.A.I. ● Two-part questionnaire - 16 statements requiring responses using an agreement scale - 6 essay questions requiring verbal responses Evaluation carried out by Geoffrey Yeadon @ NAL Usability testing: Usability testing Training by the vendor Software usability (user friendliness) Software speed Applicability/Utility Use to assign terms Option to use assisted mode Worth the time invested Quality of indexing Use for enhancing production Accelerate the indexing process Human intervention needed – optionalFactors in the technology evolution: Factors in the technology evolution 1964 COSATI Report ASIST Literature Cold War Increased computer power The Internet What changes might be expected in the near future?: What changes might be expected in the near future? Indexing taxonomy available for use in the search Query expansion using the thesaurus More new systems Merging of technologies Semantic web will disappear More synonym rings OWL and Z39.19 will merge – cross walk Thank you for coming!: Thank you for coming! Glossary Available Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn.com