Automated A&I: What it is and who should consider it : Automated A&I: What it is and who should consider it 9:45am – 10:15am Marjorie M.K. Hlava, President, Access Innovations, Inc
What will we cover in 30 minutes? : What will we cover in 30 minutes? Why we do it
What is automated indexing/abstracting
The building blocks
Standards
The science of it
What are the parts of the technologies available?
The pros and cons
Not covering : Not covering Automatically building a taxonomy
XML – standardizes data
Does not find data
Why do we do it? : Why do we do it? “Our ability to create information has substantially outpaced our ability to retrieve Information” (Delphi group 2002)
There are 250 Megs of information for every human being on earth – and growing
Infoglut has a negative impact on productivity
Fortune 500 companies lost $12 billion due to inability to find information in 2003
25 – 35% of time is spent looking for information
What to do about it? : What to do about it? Thesauri control indexing
Indexing enables search and retrieval
“The better the controls on language used to organize research, the better the search experience” LW Moulton
Search promotes productive use and reuse of information
Search increases worker productivity and innovation
Indexing improves searching : Indexing improves searching Subjects (controlled terms)
Skill of the indexers
Keywords
Depend on the skill of the searcher
Search works IF : Search works IF The content
Exists digitally
Has good metadata
The metadata reflects the content and the user needs
The users
Are trained in using the search tool
Can create a query and interpret the result
Like the user interface
So…. : So…. We need good metadata
Good indexing
How to get it?
Consistent
Right depth
User oriented
Comprehensive
Precision saves time!
Basic areas of Automatic Language Processing (ALP) : Basic areas of Automatic Language Processing (ALP) Auto Translation
Auto Indexing
Auto Abstracting
Artificial Intelligence
Searching
Spell Checking
Semantic Web
Natural Language Processes (NLP)
Computational Linguistics
Natural Language Processing : Natural Language Processing Syntactic
Semantic
Morphological
Phraseological
Lemmatization (stemming)
Statistical
Grammatical
Common Sense
Statistical : Statistical Cluster analysis
Neural networks
Co-occurrence
Bayesian Inference
Etc.
Basic to all : Basic to all Boolean
Inverted indexes
(those computer indexes)
Word and Term Parsing : Word and Term Parsing Stemming
-ing, -ed, -es, -’s, -s’, etc.
Depuralization
Truncation
Left and right
Wild cards
Organi*ation
Variant Spellings
Centre, center
Hyphens
Parsing : Parsing Code stripping
HTML, MS Word, Photocomp codes, etc.
Punctuation
Useful in metadata identification
Stop words
What is automated indexing/abstracting? : What is automated indexing/abstracting? Two very different things
Abstracting – digesting an object into a short descriptive unit
Indexing – applying terms to an object from a controlled vocabulary or thesaurus
Thesaurus and Indexing Standards – ISO : Thesaurus and Indexing Standards – ISO
ISO 2788:1986
Documentation - Guidelines for the establishment and development of monolingual thesauri
ISO 5964:1985
Documentation - Guidelines for the establishment and development of multilingual thesauri
ISO 5963:1985
Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms
ISO 999:1996
Information and documentation - Guidelines for the content, organization and presentation of indexes
ISO TC 46/SC 9 : ISO TC 46/SC 9
Information and Documentation - Identification and Description
TC 46 is ISO's Technical Committee (TC) for information and documentation standards.
SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources.
Thesaurus and Indexing Standards – ANSI/NISO : Thesaurus and Indexing Standards – ANSI/NISO
ANSI/NISO Z39.19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri
NISO Z39.19-200x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies
NISO TR02-1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson
Reports to use : Reports to use Report on the Workshop on Electronic Thesauri, November 4-5, 1999 http://www.niso.org/news/events_workshops/thes99rprt.html
Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference Structures June 1997 http://archive.ala.org/alcts/organization/ccs/sac/rpt97rev.html
Other links : Other links http://esw.w3.org/topic/SkosDev/ThesaurusLinks/XmlFormats
MARC-21 XMLSchema.
Zthes Z39.50 profile for thesaurus navigation (2001).
TML thesaurus markup language (1999).
ADL Thesaurus Protocol XML formats (2002).
MeSH XML format (2001).
GEMET XML format (2003).
APAIS XML thesaurus format, an extension of Zthes (2000).
Open University thesaurus schemas (2002).
Soergel XML thesaurus specification (2001).
Automatic abstracting : Automatic abstracting Works well in some areas:
Defined source fields
Like HTML Headers
Field formatted data
XML feeds
Simple concepts
View Source : View Source
HTML Header : HTML Header
Source text : Source text Access Innovations - Leader in taxonomy development, automated indexing, knowledge capital management
Resulting auto abstract : Resulting auto abstract Access Innovations**leader in taxonomy development, automated indexing, knowledge capital management especially in content management, information management, database construction, database design, data processing, SGML, HTML, XML, legacy data, data conversion, data capture, automatic categorization system, machine aided indexer, markup, abstracting, indexing, document management, thesaurus development, taxonomy, taxonomies, ontology, ontologies, topic tree, pre-coordinate indexing, post-coordinate indexing, classification, hierarchy, thesaurus, thesauri, vocabulary control. Access Innovations leads in machine aided indexing software, thesaurus construction tools, taxonomy management software, abstracting and indexing, XML markup, scanning, text management, project management
Main strategies for auto indexing : Main strategies for auto indexing
Rules Based – two types
IF THEN – sequel rules
IF ELSE IF – syntactic rules
Not dependent on the collection
Co-occurrence
Training sets
Training
Compares to already classified data
Manual indexing — Pros : Manual indexing — Pros
Understand the conceptual nuances
Understand and map to user expectations
High quality
High precision
Use for text and images and AV
Manual indexing — Cons : Manual indexing — Cons Slow
Inconsistent over time (editorial drift)
Not scalable
Highly skilled process – lots of training
Difficult to manage centrally or to audit behaviors
Resource intensive (people)
High cost
Automatic indexing – Pros : Automatic indexing – Pros Cost savings (Check the ROI)
Few resources needed
Excellent logical consistency (Rules)
Very scalable
Can be centrally managed
Frees staff for other tasks
Fast
Automatic indexing – Cons : Automatic indexing – Cons Quality
Consistency (co-occurrence)
IT overhead
Exceptions hard to manage
Difficult to extend or train (co-occurrence)
Might not be synchronized with user behavior
Co-occurrence – Pros : Co-occurrence – Pros Possible cost savings
Possible speed for creation
Fewer people involved in creation
Can add Sequel rules
High tech value to management
Co-occurrence – Cons : Co-occurrence – Cons Requires a training set
10 – 35 “perfect records” per term
Main terms and synonyms need training
Can’t deal with exceptions (except with rules)
Needs homogenous collections
Needs a large collection for ROI
Depends on full text
Consistency variations
New terms mean retraining the entire set
Dependence on IT or the vendor
Training of editors to use the results
Rules based indexing – Pros : Rules based indexing – Pros
Easy to modify
High consistency
Adaptable
Word game for intuitive an analytical minds
Basic rules generated automatically (less than a two hour set up and run time)
Modify ~20% of rules for greater accuracy
Takes seconds to change a rule
10 seconds to insert a condition (compared to recollecting training sets)
Rule based indexing – Cons : Rule based indexing – Cons Perceived time to build the rules
Requires editorial attention (not IT)
Need to evaluate statistics periodically
How was the M.A.I. evaluation conducted? : How was the M.A.I. evaluation conducted?
Phase I:
An objective (quantitative) analysis of the M.A.I. statistics generated during testing.
● Data Harmony supplied with electronic format articles and NAL Thesaurus
● Indexing staff trained to use MAIstro
● Indexer tested sample articles from their individual assignments
● Statistics gathered and analyzed
Phase II:
A subjective analysis of indexers’ opinions/feelings regarding of M.A.I.
● Two-part questionnaire
- 16 statements requiring responses using an agreement scale
- 6 essay questions requiring verbal responses
Evaluation carried out by Geoffrey Yeadon @ NAL
Usability testing : Usability testing Training by the vendor
Software usability (user friendliness)
Software speed
Applicability/Utility
Use to assign terms
Option to use assisted mode
Worth the time invested
Quality of indexing
Use for enhancing production
Accelerate the indexing process
Human intervention needed – optional
Factors in the technology evolution : Factors in the technology evolution 1964 COSATI Report
ASIST Literature
Cold War
Increased computer power
The Internet
What changes might be expected in the near future? : What changes might be expected in the near future? Indexing taxonomy available for use in the search
Query expansion using the thesaurus
More new systems
Merging of technologies
Semantic web will disappear
More synonym rings
OWL and Z39.19 will merge – cross walk
Thank you for coming! : Thank you for coming! Glossary Available
Marjorie M. K. Hlava
Access Innovations / Data Harmony
mhlava@accessinn.com