05Hlavaoverview

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Automated A&I: What it is and who should consider it : 

Automated A&I: What it is and who should consider it 9:45am – 10:15am Marjorie M.K. Hlava, President, Access Innovations, Inc

What will we cover in 30 minutes?: 

What will we cover in 30 minutes? Why we do it What is automated indexing/abstracting The building blocks Standards The science of it What are the parts of the technologies available? The pros and cons

Not covering: 

Not covering Automatically building a taxonomy XML – standardizes data Does not find data

Why do we do it?: 

Why do we do it? “Our ability to create information has substantially outpaced our ability to retrieve Information” (Delphi group 2002) There are 250 Megs of information for every human being on earth – and growing Infoglut has a negative impact on productivity Fortune 500 companies lost $12 billion due to inability to find information in 2003 25 – 35% of time is spent looking for information

What to do about it?: 

What to do about it? Thesauri control indexing Indexing enables search and retrieval “The better the controls on language used to organize research, the better the search experience” LW Moulton Search promotes productive use and reuse of information Search increases worker productivity and innovation

Indexing improves searching: 

Indexing improves searching Subjects (controlled terms) Skill of the indexers Keywords Depend on the skill of the searcher

Search works IF: 

Search works IF The content Exists digitally Has good metadata The metadata reflects the content and the user needs The users Are trained in using the search tool Can create a query and interpret the result Like the user interface

So….: 

So…. We need good metadata Good indexing How to get it? Consistent Right depth User oriented Comprehensive Precision saves time!

Basic areas of Automatic Language Processing (ALP): 

Basic areas of Automatic Language Processing (ALP) Auto Translation Auto Indexing Auto Abstracting Artificial Intelligence Searching Spell Checking Semantic Web Natural Language Processes (NLP) Computational Linguistics

Natural Language Processing: 

Natural Language Processing Syntactic Semantic Morphological Phraseological Lemmatization (stemming) Statistical Grammatical Common Sense

Statistical: 

Statistical Cluster analysis Neural networks Co-occurrence Bayesian Inference Etc.

Basic to all: 

Basic to all Boolean Inverted indexes (those computer indexes)

Word and Term Parsing: 

Word and Term Parsing Stemming -ing, -ed, -es, -’s, -s’, etc. Depuralization Truncation Left and right Wild cards Organi*ation Variant Spellings Centre, center Hyphens

Parsing: 

Parsing Code stripping HTML, MS Word, Photocomp codes, etc. Punctuation Useful in metadata identification Stop words

What is automated indexing/abstracting?: 

What is automated indexing/abstracting? Two very different things Abstracting – digesting an object into a short descriptive unit Indexing – applying terms to an object from a controlled vocabulary or thesaurus

Thesaurus and Indexing Standards – ISO : 

Thesaurus and Indexing Standards – ISO ISO 2788:1986 Documentation - Guidelines for the establishment and development of monolingual thesauri ISO 5964:1985 Documentation - Guidelines for the establishment and development of multilingual thesauri ISO 5963:1985 Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms ISO 999:1996 Information and documentation - Guidelines for the content, organization and presentation of indexes

ISO TC 46/SC 9: 

ISO TC 46/SC 9 Information and Documentation - Identification and Description TC 46 is ISO's Technical Committee (TC) for information and documentation standards. SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources.

Thesaurus and Indexing Standards – ANSI/NISO: 

Thesaurus and Indexing Standards – ANSI/NISO ANSI/NISO Z39.19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson

Reports to use: 

Reports to use Report on the Workshop on Electronic Thesauri, November 4-5, 1999 http://www.niso.org/news/events_workshops/thes99rprt.html Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference Structures June 1997 http://archive.ala.org/alcts/organization/ccs/sac/rpt97rev.html

Other links: 

Other links http://esw.w3.org/topic/SkosDev/ThesaurusLinks/XmlFormats MARC-21 XMLSchema. Zthes Z39.50 profile for thesaurus navigation (2001). TML thesaurus markup language (1999). ADL Thesaurus Protocol XML formats (2002). MeSH XML format (2001). GEMET XML format (2003). APAIS XML thesaurus format, an extension of Zthes (2000). Open University thesaurus schemas (2002). Soergel XML thesaurus specification (2001).

Automatic abstracting: 

Automatic abstracting Works well in some areas: Defined source fields Like HTML Headers Field formatted data XML feeds Simple concepts

View Source: 

View Source

HTML Header: 

HTML Header

Source text: 

Source text <title>Access Innovations - Leader in taxonomy development, automated indexing, knowledge capital management</title> <style type="text/css"> <!-- .style1 {font-weight: bold} --> </style> </head> <meta name="author" content="root"> <meta name="copyright" content="&copy; 1996 - 2005 Access Innovations, Inc."> <meta name="keywords" content="content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control"> <meta name="description" content="Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project management"> <meta name="robots" content="index, follow">

Resulting auto abstract: 

Resulting auto abstract Access Innovations**leader in taxonomy development, automated indexing, knowledge capital management especially in content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control. Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project management

Main strategies for auto indexing: 

Main strategies for auto indexing Rules Based – two types IF THEN – sequel rules IF ELSE IF – syntactic rules Not dependent on the collection Co-occurrence Training sets Training Compares to already classified data

Manual indexing — Pros: 

Manual indexing — Pros Understand the conceptual nuances Understand and map to user expectations High quality High precision Use for text and images and AV

Manual indexing — Cons: 

Manual indexing — Cons Slow Inconsistent over time (editorial drift) Not scalable Highly skilled process – lots of training Difficult to manage centrally or to audit behaviors Resource intensive (people) High cost

Automatic indexing – Pros: 

Automatic indexing – Pros Cost savings (Check the ROI) Few resources needed Excellent logical consistency (Rules) Very scalable Can be centrally managed Frees staff for other tasks Fast

Automatic indexing – Cons: 

Automatic indexing – Cons Quality Consistency (co-occurrence) IT overhead Exceptions hard to manage Difficult to extend or train (co-occurrence) Might not be synchronized with user behavior

Co-occurrence – Pros : 

Co-occurrence – Pros Possible cost savings Possible speed for creation Fewer people involved in creation Can add Sequel rules High tech value to management

Co-occurrence – Cons: 

Co-occurrence – Cons Requires a training set 10 – 35 “perfect records” per term Main terms and synonyms need training Can’t deal with exceptions (except with rules) Needs homogenous collections Needs a large collection for ROI Depends on full text Consistency variations New terms mean retraining the entire set Dependence on IT or the vendor Training of editors to use the results

Rules based indexing – Pros: 

Rules based indexing – Pros Easy to modify High consistency Adaptable Word game for intuitive an analytical minds Basic rules generated automatically (less than a two hour set up and run time) Modify ~20% of rules for greater accuracy Takes seconds to change a rule 10 seconds to insert a condition (compared to recollecting training sets)

Rule based indexing – Cons: 

Rule based indexing – Cons Perceived time to build the rules Requires editorial attention (not IT) Need to evaluate statistics periodically

How was the M.A.I. evaluation conducted?: 

How was the M.A.I. evaluation conducted? Phase I: An objective (quantitative) analysis of the M.A.I. statistics generated during testing. ● Data Harmony supplied with electronic format articles and NAL Thesaurus ● Indexing staff trained to use MAIstro ● Indexer tested sample articles from their individual assignments ● Statistics gathered and analyzed Phase II: A subjective analysis of indexers’ opinions/feelings regarding of M.A.I. ● Two-part questionnaire - 16 statements requiring responses using an agreement scale - 6 essay questions requiring verbal responses Evaluation carried out by Geoffrey Yeadon @ NAL

Usability testing: 

Usability testing Training by the vendor Software usability (user friendliness) Software speed Applicability/Utility Use to assign terms Option to use assisted mode Worth the time invested Quality of indexing Use for enhancing production Accelerate the indexing process Human intervention needed – optional

Factors in the technology evolution: 

Factors in the technology evolution 1964 COSATI Report ASIST Literature Cold War Increased computer power The Internet

What changes might be expected in the near future?: 

What changes might be expected in the near future? Indexing taxonomy available for use in the search Query expansion using the thesaurus More new systems Merging of technologies Semantic web will disappear More synonym rings OWL and Z39.19 will merge – cross walk

Thank you for coming!: 

Thank you for coming! Glossary Available Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn.com

authorStream Live Help