KM Techniques Examples 2005

Uploaded from authorPOINTLite
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Knowledge Management Systems: Development and Applications Part II: Techniques and Examples: 

Knowledge Management Systems: Development and Applications Part II: Techniques and Examples Hsinchun Chen, Ph.D. McClelland Professor, Director, Artificial Intelligence Lab and Hoffman E-Commerce Lab The University of Arizona Founder, Knowledge Computing Corporation 美國亞歷桑那大學,陳炘鈞 博士 Acknowledgement: NSF DLI1, DLI2, NSDL, DG, ITR, IDM, CSS, NIH/NLM, NCI, NIJ, CIA, NCSA, HP, SAP

Slide2: 

Discovering and Managing Knowledge: Text/Web Mining and Digital Library

Knowledge: 

Knowledge Revealed underlying assumptions in KM Implied different roles of knowledge in organizations Textual knowledge - Most efficient way to store, retrieve, and transfer vast amount of information Advanced processing needed to obtain knowledge Traditionally done by humans It is useful to review the discipline of Human-Computer Interaction to understand human analysis needs

Slide6: 

Text Mining: Intersection of IR and AI Information Retrieval (IR) and Gerald Salton • Inverted Index, Boolean, and Probabilistic, 1970s • Expert Systems, User Modeling and Natural Language Processing, 1980s • Machine Learning for Information Retrieval, 1990s • Search Engines and Digital Libraries, late 1990s and 2000s

Slide7: 

Text Mining: Intersection of IR and AI Artificial Intelligence (AI) and Herbert Simon • General Problem Solvers, 1970s • Expert Systems, 1980s • Machine Learning and Data Mining, 1990s • Agents, Network/Graph Learning, late 1990s and 2000s

Slide8: 

Representing Knowledge •IR Approach •Indexing and Subject Headings •Dictionaries, Thesauri, and Classification Schemes •AI Approach •Cognitive Modeling •Semantic Networks, Production Systems, Logic, Frames, and Ontologies

Slide9: 

For Web Mining: Web mining techniques: resource discovery on the Web, information extraction from Web resources, and uncovering general patterns (Etzioni, 1996) Pattern extraction, meta searching, spidering Web page summarization (Hearst, 1994; McDonald & Chen, 2002) Web page classification (Glover et al., 2002; Lee et al., 2002; Kwon & Lee, 2003) Web page clustering (Roussinov & Chen, 2001; Chen et al., 1998; Jain & Dube, 1988) Web page visualization (Yang et al., 2003; Spence, 2001; Shneiderman, 1996)

Slide11: 

Text Mining Techniques: Linguistic analysis/NLP: identify key concepts (who/what/where…) Statistical/co-occurrence analysis: create automatic thesaurus, link analysis Statistical and neural networks clustering/categorization: identify similar documents/users/communities and create knowledge maps Visualization and HCI: tree/network, 1/2/3D, zooming/detail-in-context

Slide12: 

Text Mining Techniques: Linguistic Analysis Word and inverted index: stemming, suffixes, morphological analysis, Boolean, proximity, range, fuzzy search Phrasal analysis: noun phrases, verb phrases, entity extraction, mutual information Sentence-level analysis: context-free grammar, transformational grammar Semantic analysis: semantic grammar, case-based reasoning, frame/script

Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1 : 

Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1

Slide14: 

Text Mining Techniques: Statistical/Co-Occurrence Analysis Similarity functions: Jaccard, Cosine Weighting heuristics Bi-gram, tri-gram, N-gram Finite State Automata (FSA) Dictionaries and thesauri

Slide15: 

Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1

Slide16: 

Text Mining Techniques: Clustering/Categorization Hierarchical clustering: single-link, multi-link, Ward’s Statistical clustering: multi-dimensional scaling (MDS), factor analysis Neural network clustering: self-organizing map (SOM) Ontologies: directories, classification schemes

Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1 : 

Automatic Generation of CL: Foundation from NSF/DARPA/NASA Digital Library Initiative-1

Slide18: 

KMS Techniques: Visualization/HCI Structures: trees/hierarchies, networks Dimensions: 1D, 2D, 2.5D, 3D, N-D (glyphs) Interactions: zooming, spotlight, fisheye views, fractal views

Automatic Generation of CL: : 

Automatic Generation of CL:

Slide20: 

Entity Extraction and Co-reference based on TREC and MUG Visualization techniques and HCI Text segmentation and summarization Automatic Generation of CL: (Continued)

Slide21: 

Ontology-enhanced semantic tagging (e.g., UMLS Semantic Nets) Ontology-enhanced query expansion (e.g., WordNet, UMLS Metathesaurus) Integration of CL: Spreading-activation based term suggestion (e.g., Hopfield net)

YAHOO vs. OOHAY:: 

YAHOO vs. OOHAY: YAHOO: manual, high-precision OOHAY: automatic, high-recall Acknowledgements: NSF, NIH, NLM, NIJ, DARPA

Slide23: 

From YAHOO! To OOHAY? Y A H O O ! ? Object Oriented Hierarchical Automatic Yellowpage

Slide24: 

Text and Web Mining in Digital Libraries: AI Lab Research Prototypes

Web Analysis (1M): Web pages, spidering, noun phrasing, categorization: 

Web Analysis (1M): Web pages, spidering, noun phrasing, categorization

OOHAY: Visualizing the Web: 

OOHAY: Visualizing the Web

Slide28: 

OOHAY: Visualizing the Web

Slide29: 

Lessons Learned: Web pages are noisy: need filtering Spidering needs help: domain lexicons, multi-threads SOM is computational feasible for large-scale application SOM performance for web pages = 50% Web knowledge map (directory) is interesting for browsing, not for searching Techniques applicable to Intranet and marketing intelligence

News Classification (1M): Chinese news content, mutual information indexing, PAT tree, categorization: 

News Classification (1M): Chinese news content, mutual information indexing, PAT tree, categorization

Slide37: 

Lessons Learned: News readers are not knowledge workers News articles are professionally written and precise. SOM performance for news articles = 85% Statistical indexing techniques perform well for Chinese documents Corporate users may need multiple sources and dynamic search help Techniques applicable to eCommerce (eCatalogs) and ePortal

Personal Agents (1K): Web spidering, meta searching, noun phrasing, dynamic categorization: 

Personal Agents (1K): Web spidering, meta searching, noun phrasing, dynamic categorization

Slide40: 

2. Search results from spiders are displayed dynamically 1. Enter Starting URLs and Key Phrases to be searched OOHAY: CI Spider For project information and free download: http://ai.bpa.arizona.edu

Slide41: 

2. Search results from spiders are displayed dynamically 1. Enter Starting URLs and Key Phrases to be searched OOHAY: CI Spider, Meta Spider, Med Spider For project information and free download: http://ai.bpa.arizona.edu

Slide42: 

OOHAY: Meta Spider, News Spider, Cancer Spider For project information and free download: http://ai.bpa.arizona.edu

Slide43: 

4. SOM is generated based on the phrases selected. Steps 3 and 4 can be done in iterations to refine the results. 3. Noun Phrases are extracted from the web ages and user can selected preferred phrases for further summarization. OOHAY: CI Spider, Meta Spider, Med Spider For project information and free download: http://ai.bpa.arizona.edu

Slide44: 

Lessons Learned: Meta spidering is useful for information consolidation Noun phrasing is useful for topic classification (dynamic folders) SOM usefulness is suspect for small collections Knowledge workers like personalization, client searching, and collaborative information sharing Corporate users need multiple sources and dynamic search help Techniques applicable to marketing and competitive analyses

CRM Data Analysis (5K): Call center Q/A, noun phrasing, dynamic categorization, problem analysis, agent assistance: 

CRM Data Analysis (5K): Call center Q/A, noun phrasing, dynamic categorization, problem analysis, agent assistance

Slide48: 

Lessons Learned: Call center data are noisy: typos and errors Noun phrasing useful for Q/A classification Q/A classification could identify problem areas Q/A classification could improve agent productivity: email, online chat, and VoIP Q/A classification could improve new agent training Techniques applicable to virtual call center and CRM applications

Nano Patent Mapping (100K): Nano patents, content/network analysis and visualization, impact analysis: 

Nano Patent Mapping (100K): Nano patents, content/network analysis and visualization, impact analysis

Data: U.S. NSE Patents: 

Data: U.S. NSE Patents Top assignee countries and institutions

Data: U.S. NSE Patents (cont.): 

Data: U.S. NSE Patents (cont.) Top technology fields (US Patent Classification first-level categories)

Content Map Analysis: 

Content Map Analysis NSE Grant Content Map (1991 – 1995) NSE Patent Content Map (1991 – 1995)

Content Map Analysis: 

Content Map Analysis NSE Patent Content Map (1996 – 2000) NSE Grant Content Map (1996 – 2000) * Region color indicates the growth rate of the associated technology topic. The number associated with the colors were the actual growth rate: # of grants/patents during 1991-1995 / # of grants/patents during 1996-2000 for a particular topic (region). Regions with comparable growth rate as the entire field were assigned the green color.

Sample Patent Citation Networks: 

Sample Patent Citation Networks Backbone citation network for the field “Chemistry: molecular biology and microbiology (all patents shown were cited by more than five times) PI-inventors and their patents form a closely linked cluster within the largest connected component of the backbone citation network

H1.1 Patent – Number of Cites: 

H1.1 Patent – Number of Cites H1.1 supported: PI-inventors’ patents had significantly higher number of cites measure than most other comparison groups (except IBM) Order of the groups: NSF, IBM > Top10, UC, US > EntireSet, Japan > European, Others

H2.1 Inventor – Number of Cites: 

H2.1 Inventor – Number of Cites H2.1 supported: PI-inventors had significantly higher number of cites measure than most other comparison groups Order of the groups: NSF > Top10, Japan, EntireSet, US, IBM > UC, European, Others Japanese inventors had high number of cites measure despite the small number of cites for each patent they file

Slide57: 

Lessons Learned: Units of analysis: inventors, institutions, and countries USPTO patents are clean and comprehensive Content and network analyses help reveal trends and key innovations/inventors Patent analyses help with impact study

Newsgroup Categorization (1K): Workgroup communication, noun phrasing, dynamic categorization, glyphs visualization: 

Newsgroup Categorization (1K): Workgroup communication, noun phrasing, dynamic categorization, glyphs visualization

Slide59: 

Thread Disadvantages: No sub-topic identification Difficult to identify experts Difficult to learn participants’ attitude toward the community

Slide60: 

Thread Representation Time Message Person Length of Time

Slide61: 

People Representation Time Message Thread Length of Time

Slide62: 

Visual Effects: Thickness = how active a subtopic is Length in x-dimension = the time duration of a sub-topic

Slide63: 

Proposed Interface (Interaction Summary) Visual Effects: Healthy sub-garden with many blooming high flowers = popular active sub-topic A long, blooming flower is a healthy thread

Slide64: 

Proposed Interface (Expert Indicator) Visual Effects: Healthy sub-garden with many blooming high flowers = popular sub-topic A long, blooming people flower is a recognized expert.

Slide65: 

Lessons Learned: P1000: A picture is indeed worth 1000 words Expert identification is critical for KM support Glyphs are powerful for capturing multi-dimensional data Techniques applicable to collaborative applications, e.g., email, online chats, newsgroup, and such

GIS Multimedia Data Mining (10GBs): Geoscience data, texture image indexing, multimedia content: 

GIS Multimedia Data Mining (10GBs): Geoscience data, texture image indexing, multimedia content

Slide67: 

Airphoto analysis: Texture (Gabor filter)

Slide68: 

AVHRR satellite data: Temperature/vegetation

Slide69: 

Lessons Learned: Image analysis techniques are application dependent (unlike text analysis) Image killer apps not found yet Multimedia applications require integration of data, text, and image mining techniques Multimedia KMS not ready for prime-time consumption yet

Slide70: 

Knowledge Management Systems: Future

Other Emerging Categorization Challenges/Opportunities:: 

Other Emerging Categorization Challenges/Opportunities: Multilingual terminology and semantic issues Web analysis and categorization issues E-Commerce information (transactions) classification issues Multimedia content and wireless delivery issues Future: semantic web, multilingual web, multimedia web, wireless web!

Slide72: 

The Road Ahead The Semantic Web: XML, RDF, Ontologies The Wireless Web: WML, WIFI, display The Multimedia Web: content indexing and analysis The Multilingual Web: cross-lingual MT and IR

Slide73: 

For Project Information at AI Lab: http://ai.arizona.edu hchen@eller.arizona.edu