Share PowerPoint. Anywhere!

05Hlavaoverview

Uploaded from authorPOINT Lite
Download as Download Not Available PPT
Presentation Description

No description available

Views: 2
Like it  ( Likes) Dislike it  ( Dislikes)
Added: December 30, 2007 This presentation is Public
Presentation Category :Entertainment
Tags Add Tags
Presentation StatisticsNew!
Views on authorSTREAM: 2
Presentation Transcript

Automated A&I: What it is and who should consider it : Automated A&I: What it is and who should consider it 9:45am – 10:15am Marjorie M.K. Hlava, President, Access Innovations, Inc


What will we cover in 30 minutes? : What will we cover in 30 minutes? Why we do it What is automated indexing/abstracting The building blocks Standards The science of it What are the parts of the technologies available? The pros and cons


Not covering : Not covering Automatically building a taxonomy XML – standardizes data Does not find data


Why do we do it? : Why do we do it? “Our ability to create information has substantially outpaced our ability to retrieve Information” (Delphi group 2002) There are 250 Megs of information for every human being on earth – and growing Infoglut has a negative impact on productivity Fortune 500 companies lost $12 billion due to inability to find information in 2003 25 – 35% of time is spent looking for information


What to do about it? : What to do about it? Thesauri control indexing Indexing enables search and retrieval “The better the controls on language used to organize research, the better the search experience” LW Moulton Search promotes productive use and reuse of information Search increases worker productivity and innovation


Indexing improves searching : Indexing improves searching Subjects (controlled terms) Skill of the indexers Keywords Depend on the skill of the searcher


Search works IF : Search works IF The content Exists digitally Has good metadata The metadata reflects the content and the user needs The users Are trained in using the search tool Can create a query and interpret the result Like the user interface


So…. : So…. We need good metadata Good indexing How to get it? Consistent Right depth User oriented Comprehensive Precision saves time!


Basic areas of Automatic Language Processing (ALP) : Basic areas of Automatic Language Processing (ALP) Auto Translation Auto Indexing Auto Abstracting Artificial Intelligence Searching Spell Checking Semantic Web Natural Language Processes (NLP) Computational Linguistics


Natural Language Processing : Natural Language Processing Syntactic Semantic Morphological Phraseological Lemmatization (stemming) Statistical Grammatical Common Sense


Statistical : Statistical Cluster analysis Neural networks Co-occurrence Bayesian Inference Etc.


Basic to all : Basic to all Boolean Inverted indexes (those computer indexes)


Word and Term Parsing : Word and Term Parsing Stemming -ing, -ed, -es, -’s, -s’, etc. Depuralization Truncation Left and right Wild cards Organi*ation Variant Spellings Centre, center Hyphens


Parsing : Parsing Code stripping HTML, MS Word, Photocomp codes, etc. Punctuation Useful in metadata identification Stop words


What is automated indexing/abstracting? : What is automated indexing/abstracting? Two very different things Abstracting – digesting an object into a short descriptive unit Indexing – applying terms to an object from a controlled vocabulary or thesaurus


Thesaurus and Indexing Standards – ISO : Thesaurus and Indexing Standards – ISO ISO 2788:1986 Documentation - Guidelines for the establishment and development of monolingual thesauri ISO 5964:1985 Documentation - Guidelines for the establishment and development of multilingual thesauri ISO 5963:1985 Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms ISO 999:1996 Information and documentation - Guidelines for the content, organization and presentation of indexes


ISO TC 46/SC 9 : ISO TC 46/SC 9 Information and Documentation - Identification and Description TC 46 is ISO's Technical Committee (TC) for information and documentation standards. SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources.


Thesaurus and Indexing Standards – ANSI/NISO : Thesaurus and Indexing Standards – ANSI/NISO ANSI/NISO Z39.19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson


Reports to use : Reports to use Report on the Workshop on Electronic Thesauri, November 4-5, 1999 http://www.niso.org/news/events_workshops/thes99rprt.html Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference Structures June 1997 http://archive.ala.org/alcts/organization/ccs/sac/rpt97rev.html


Other links : Other links http://esw.w3.org/topic/SkosDev/ThesaurusLinks/XmlFormats MARC-21 XMLSchema. Zthes Z39.50 profile for thesaurus navigation (2001). TML thesaurus markup language (1999). ADL Thesaurus Protocol XML formats (2002). MeSH XML format (2001). GEMET XML format (2003). APAIS XML thesaurus format, an extension of Zthes (2000). Open University thesaurus schemas (2002). Soergel XML thesaurus specification (2001).


Automatic abstracting : Automatic abstracting Works well in some areas: Defined source fields Like HTML Headers Field formatted data XML feeds Simple concepts


View Source : View Source


HTML Header : HTML Header


Source text : Source text Access Innovations - Leader in taxonomy development, automated indexing, knowledge capital management


Resulting auto abstract : Resulting auto abstract Access Innovations**leader in taxonomy development, automated indexing, knowledge capital management especially in content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control. Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project management


Main strategies for auto indexing : Main strategies for auto indexing Rules Based – two types IF THEN – sequel rules IF ELSE IF – syntactic rules Not dependent on the collection Co-occurrence Training sets Training Compares to already classified data


Manual indexing — Pros : Manual indexing — Pros Understand the conceptual nuances Understand and map to user expectations High quality High precision Use for text and images and AV


Manual indexing — Cons : Manual indexing — Cons Slow Inconsistent over time (editorial drift) Not scalable Highly skilled process – lots of training Difficult to manage centrally or to audit behaviors Resource intensive (people) High cost


Automatic indexing – Pros : Automatic indexing – Pros Cost savings (Check the ROI) Few resources needed Excellent logical consistency (Rules) Very scalable Can be centrally managed Frees staff for other tasks Fast


Automatic indexing – Cons : Automatic indexing – Cons Quality Consistency (co-occurrence) IT overhead Exceptions hard to manage Difficult to extend or train (co-occurrence) Might not be synchronized with user behavior


Co-occurrence – Pros : Co-occurrence – Pros Possible cost savings Possible speed for creation Fewer people involved in creation Can add Sequel rules High tech value to management


Co-occurrence – Cons : Co-occurrence – Cons Requires a training set 10 – 35 “perfect records” per term Main terms and synonyms need training Can’t deal with exceptions (except with rules) Needs homogenous collections Needs a large collection for ROI Depends on full text Consistency variations New terms mean retraining the entire set Dependence on IT or the vendor Training of editors to use the results


Rules based indexing – Pros : Rules based indexing – Pros Easy to modify High consistency Adaptable Word game for intuitive an analytical minds Basic rules generated automatically (less than a two hour set up and run time) Modify ~20% of rules for greater accuracy Takes seconds to change a rule 10 seconds to insert a condition (compared to recollecting training sets)


Rule based indexing – Cons : Rule based indexing – Cons Perceived time to build the rules Requires editorial attention (not IT) Need to evaluate statistics periodically


How was the M.A.I. evaluation conducted? : How was the M.A.I. evaluation conducted? Phase I: An objective (quantitative) analysis of the M.A.I. statistics generated during testing. ● Data Harmony supplied with electronic format articles and NAL Thesaurus ● Indexing staff trained to use MAIstro ● Indexer tested sample articles from their individual assignments ● Statistics gathered and analyzed Phase II: A subjective analysis of indexers’ opinions/feelings regarding of M.A.I. ● Two-part questionnaire - 16 statements requiring responses using an agreement scale - 6 essay questions requiring verbal responses Evaluation carried out by Geoffrey Yeadon @ NAL


Usability testing : Usability testing Training by the vendor Software usability (user friendliness) Software speed Applicability/Utility Use to assign terms Option to use assisted mode Worth the time invested Quality of indexing Use for enhancing production Accelerate the indexing process Human intervention needed – optional


Factors in the technology evolution : Factors in the technology evolution 1964 COSATI Report ASIST Literature Cold War Increased computer power The Internet


What changes might be expected in the near future? : What changes might be expected in the near future? Indexing taxonomy available for use in the search Query expansion using the thesaurus More new systems Merging of technologies Semantic web will disappear More synonym rings OWL and Z39.19 will merge – cross walk


Thank you for coming! : Thank you for coming! Glossary Available Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn.com