DATA MININGIntroductory and Advanced TopicsPart III : © Prentice Hall 1 DATA MININGIntroductory and Advanced TopicsPart III Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.
Data Mining Outline : © Prentice Hall 2 Data Mining Outline PART I
Introduction
Related Concepts
Data Mining Techniques
PART II
Classification
Clustering
Association Rules
PART III
Web Mining
Spatial Mining
Temporal Mining
Web Mining Outline : © Prentice Hall 3 Web Mining Outline Goal: Examine the use of data mining on the World Wide Web
Introduction
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Mining Issues : © Prentice Hall 4 Web Mining Issues Size
>350 million pages (1999)
Grows at about 1 million pages a day
Google indexes 3 billion documents
Diverse types of data
Web Data : © Prentice Hall 5 Web Data Web pages
Intra-page structures
Inter-page structures
Usage data
Supplemental data
Profiles
Registration information
Cookies
Web Mining Taxonomy : © Prentice Hall 6 Web Mining Taxonomy Modified from [zai01]
Web Content Mining : © Prentice Hall 7 Web Content Mining Extends work of basic search engines
Search Engines
IR application
Keyword based
Similarity between query and document
Crawlers
Indexing
Profiles
Link analysis
Crawlers : © Prentice Hall 8 Crawlers Robot (spider) traverses the hypertext sructure in the Web.
Collect information from visited pages
Used to construct indexes for search engines
Traditional Crawler – visits entire Web (?) and replaces index
Periodic Crawler – visits portions of the Web and updates subset of index
Incremental Crawler – selectively searches the Web and incrementally modifies index
Focused Crawler – visits pages related to a particular subject
Focused Crawler : © Prentice Hall 9 Focused Crawler Only visit links from a page if that page is determined to be relevant.
Classifier is static after learning phase.
Components:
Classifier which assigns relevance score to each page based on crawl topic.
Distiller to identify hub pages.
Crawler visits pages to based on crawler and distiller scores.
Focused Crawler : © Prentice Hall 10 Focused Crawler Classifier to related documents to topics
Classifier also determines how useful outgoing links are
Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.
Focused Crawler : © Prentice Hall 11 Focused Crawler
Context Focused Crawler : © Prentice Hall 12 Context Focused Crawler Context Graph:
Context graph created for each seed document .
Root is the sedd document.
Nodes at each level show documents with links to documents at next higher level.
Updated during crawl itself .
Approach:
Construct context graph and classifiers using seed documents as training data.
Perform crawling using classifiers and context graph created.
Context Graph : © Prentice Hall 13 Context Graph
Virtual Web View : © Prentice Hall 14 Virtual Web View Multiple Layered DataBase (MLDB) built on top of the Web.
Each layer of the database is more generalized (and smaller) and centralized than the one beneath it.
Upper layers of MLDB are structured and can be accessed with SQL type queries.
Translation tools convert Web documents to XML.
Extraction tools extract desired information to place in first layer of MLDB.
Higher levels contain more summarized data obtained through generalizations of the lower levels.
Personalization : © Prentice Hall 15 Personalization Web access or contents tuned to better fit the desires of each user.
Manual techniques identify user’s preferences based on profiles or demographics.
Collaborative filtering identifies preferences based on ratings from similar users.
Content based filtering retrieves pages based on similarity between pages and user profiles.
Web Structure Mining : © Prentice Hall 16 Web Structure Mining Mine structure (links, graph) of the Web
Techniques
PageRank
CLEVER
Create a model of the Web organization.
May be combined with content mining to more effectively retrieve important pages.
PageRank : © Prentice Hall 17 PageRank Used by Google
Prioritize pages returned from search by looking at Web structure.
Importance of page is calculated based on number of pages which point to it – Backlinks.
Weighting is used to provide more importance to backlinks coming form important pages.
PageRank (cont’d) : © Prentice Hall 18 PageRank (cont’d) PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
PR(i): PageRank for a page i which points to target page p.
Ni: number of links coming out of page i
CLEVER : © Prentice Hall 19 CLEVER Identify authoritative and hub pages.
Authoritative Pages :
Highly important pages.
Best source for requested information.
Hub Pages :
Contain links to highly important pages.
HITS : © Prentice Hall 20 HITS Hyperlink-Induces Topic Search
Based on a set of keywords, find set of relevant pages – R.
Identify hub and authority pages for these.
Expand R to a base set, B, of pages linked to or from R.
Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.
HITS Algorithm : © Prentice Hall 21 HITS Algorithm
Web Usage Mining : © Prentice Hall 22 Web Usage Mining Extends work of basic search engines
Search Engines
IR application
Keyword based
Similarity between query and document
Crawlers
Indexing
Profiles
Link analysis
Web Usage Mining Applications : © Prentice Hall 23 Web Usage Mining Applications Personalization
Improve structure of a site’s Web pages
Aid in caching and prediction of future page references
Improve design of individual pages
Improve effectiveness of e-commerce (sales and advertising)
Web Usage Mining Activities : © Prentice Hall 24 Web Usage Mining Activities Preprocessing Web log
Cleanse
Remove extraneous information
Sessionize
Session: Sequence of pages referenced by one user at a sitting.
Pattern Discovery
Count patterns that occur in sessions
Pattern is sequence of pages references in session.
Similar to association rules
Transaction: session
Itemset: pattern (or subset)
Order is important
Pattern Analysis
ARs in Web Mining : © Prentice Hall 25 ARs in Web Mining Web Mining:
Content
Structure
Usage
Frequent patterns of sequential page references in Web searching.
Uses:
Caching
Clustering users
Develop user profiles
Identify important pages
Web Usage Mining Issues : © Prentice Hall 26 Web Usage Mining Issues Identification of exact user not possible.
Exact sequence of pages referenced by a user not possible due to caching.
Session not well defined
Security, privacy, and legal issues
Web Log Cleansing : © Prentice Hall 27 Web Log Cleansing Replace source IP address with unique but non-identifying ID.
Replace exact URL of pages referenced with unique but non-identifying ID.
Delete error records and records containing not page data (such as figures and code)
Sessionizing : © Prentice Hall 28 Sessionizing Divide Web log into sessions.
Two common techniques:
Number of consecutive page references from a source IP address occurring within a predefined time interval (e.g. 25 minutes).
All consecutive page references from a source IP address where the interclick time is less than a predefined threshold.
Data Structures : © Prentice Hall 29 Data Structures Keep track of patterns identified during Web usage mining process
Common techniques:
Trie
Suffix Tree
Generalized Suffix Tree
WAP Tree
Trie vs. Suffix Tree : © Prentice Hall 30 Trie vs. Suffix Tree Trie:
Rooted tree
Edges labeled which character (page) from pattern
Path from root to leaf represents pattern.
Suffix Tree:
Single child collapsed with parent. Edge contains labels of both prior edges.
Trie and Suffix Tree : © Prentice Hall 31 Trie and Suffix Tree
Generalized Suffix Tree : © Prentice Hall 32 Generalized Suffix Tree Suffix tree for multiple sessions.
Contains patterns from all sessions.
Maintains count of frequency of occurrence of a pattern in the node.
WAP Tree:
Compressed version of generalized suffix tree
Types of Patterns : © Prentice Hall 33 Types of Patterns Algorithms have been developed to discover different types of patterns.
Properties:
Ordered – Characters (pages) must occur in the exact order in the original session.
Duplicates – Duplicate characters are allowed in the pattern.
Consecutive – All characters in pattern must occur consecutive in given session.
Maximal – Not subsequence of another pattern.
Pattern Types : © Prentice Hall 34 Pattern Types Association Rules
None of the properties hold
Episodes
Only ordering holds
Sequential Patterns
Ordered and maximal
Forward Sequences
Ordered, consecutive, and maximal
Maximal Frequent Sequences
All properties hold
Episodes : © Prentice Hall 35 Episodes Partially ordered set of pages
Serial episode – totally ordered with time constraint
Parallel episode – partial ordered with time constraint
General episode – partial ordered with no time constraint
DAG for Episode : © Prentice Hall 36 DAG for Episode
Spatial Mining Outline : © Prentice Hall 37 Spatial Mining Outline Goal: Provide an introduction to some spatial mining techniques.
Introduction
Spatial Data Overview
Spatial Data Mining Primitives
Generalization/Specialization
Spatial Rules
Spatial Classification
Spatial Clustering
Spatial Object : © Prentice Hall 38 Spatial Object Contains both spatial and nonspatial attributes.
Must have a location type attributes:
Latitude/longitude
Zip code
Street address
May retrieve object using either (or both) spatial or nonspatial attributes.
Spatial Data Mining Applications : © Prentice Hall 39 Spatial Data Mining Applications Geology
GIS Systems
Environmental Science
Agriculture
Medicine
Robotics
May involved both spatial and temporal aspects
Spatial Queries : © Prentice Hall 40 Spatial Queries Spatial selection may involve specialized selection comparison operations:
Near
North, South, East, West
Contained in
Overlap/intersect
Region (Range) Query – find objects that intersect a given region.
Nearest Neighbor Query – find object close to identified object.
Distance Scan – find object within a certain distance of an identified object where distance is made increasingly larger.
Spatial Data Structures : © Prentice Hall 41 Spatial Data Structures Data structures designed specifically to store or index spatial data.
Often based on B-tree or Binary Search Tree
Cluster data on disk basked on geographic location.
May represent complex spatial structure by placing the spatial object in a containing structure of a specific geographic shape.
Techniques:
Quad Tree
R-Tree
k-D Tree
MBR : © Prentice Hall 42 MBR Minimum Bounding Rectangle
Smallest rectangle that completely contains the object
MBR Examples : © Prentice Hall 43 MBR Examples
Quad Tree : © Prentice Hall 44 Quad Tree Hierarchical decomposition of the space into quadrants (MBRs)
Each level in the tree represents the object as the set of quadrants which contain any portion of the object.
Each level is a more exact representation of the object.
The number of levels is determined by the degree of accuracy desired.
Quad Tree Example : © Prentice Hall 45 Quad Tree Example
R-Tree : © Prentice Hall 46 R-Tree As with Quad Tree the region is divided into successively smaller rectangles (MBRs).
Rectangles need not be of the same size or number at each level.
Rectangles may actually overlap.
Lowest level cell has only one object.
Tree maintenance algorithms similar to those for B-trees.
R-Tree Example : © Prentice Hall 47 R-Tree Example
K-D Tree : © Prentice Hall 48 K-D Tree Designed for multi-attribute data, not necessarily spatial
Variation of binary search tree
Each level is used to index one of the dimensions of the spatial object.
Lowest level cell has only one object
Divisions not based on MBRs but successive divisions of the dimension range.
k-D Tree Example : © Prentice Hall 49 k-D Tree Example
Topological Relationships : © Prentice Hall 50 Topological Relationships Disjoint
Overlaps or Intersects
Equals
Covered by or inside or contained in
Covers or contains
Distance Between Objects : © Prentice Hall 51 Distance Between Objects Euclidean
Manhattan
Extensions:
Progressive Refinement : © Prentice Hall 52 Progressive Refinement Make approximate answers prior to more accurate ones.
Filter out data not part of answer
Hierarchical view of data based on spatial relationships
Coarse predicate recursively refined
Progressive Refinement : © Prentice Hall 53 Progressive Refinement
Spatial Data Dominant Algorithm : © Prentice Hall 54 Spatial Data Dominant Algorithm
STING : © Prentice Hall 55 STING STatistical Information Grid-based
Hierarchical technique to divide area into rectangular cells
Grid data structure contains summary information about each cell
Hierarchical clustering
Similar to quad tree
STING : © Prentice Hall 56 STING
STING Build Algorithm : © Prentice Hall 57 STING Build Algorithm
STING Algorithm : © Prentice Hall 58 STING Algorithm
Spatial Rules : © Prentice Hall 59 Spatial Rules Characteristic Rule
The average family income in Dallas is $50,000.
Discriminant Rule
The average family income in Dallas is $50,000, while in Plano the average income is $75,000.
Association Rule
The average family income in Dallas for families living near White Rock Lake is $100,000.
Spatial Association Rules : © Prentice Hall 60 Spatial Association Rules Either antecedent or consequent must contain spatial predicates.
View underlying database as set of spatial objects.
May create using a type of progressive refinement
Spatial Association Rule Algorithm : © Prentice Hall 61 Spatial Association Rule Algorithm
Spatial Classification : © Prentice Hall 62 Spatial Classification Partition spatial objects
May use nonspatial attributes and/or spatial attributes
Generalization and progressive refinement may be used.
ID3 Extension : © Prentice Hall 63 ID3 Extension Neighborhood Graph
Nodes – objects
Edges – connects neighbors
Definition of neighborhood varies
ID3 considers nonspatial attributes of all objects in a neighborhood (not just one) for classification.
Spatial Decision Tree : © Prentice Hall 64 Spatial Decision Tree Approach similar to that used for spatial association rules.
Spatial objects can be described based on objects close to them – Buffer.
Description of class based on aggregation of nearby objects.
Spatial Decision Tree Algorithm : © Prentice Hall 65 Spatial Decision Tree Algorithm
Spatial Clustering : © Prentice Hall 66 Spatial Clustering Detect clusters of irregular shapes
Use of centroids and simple distance approaches may not work well.
Clusters should be independent of order of input.
Spatial Clustering : © Prentice Hall 67 Spatial Clustering
CLARANS Extensions : © Prentice Hall 68 CLARANS Extensions Remove main memory assumption of CLARANS.
Use spatial index techniques.
Use sampling and R*-tree to identify central objects.
Change cost calculations by reducing the number of objects examined.
Voronoi Diagram
Voronoi : © Prentice Hall 69 Voronoi
SD(CLARANS) : © Prentice Hall 70 SD(CLARANS) Spatial Dominant
First clusters spatial components using CLARANS
Then iteratively replaces medoids, but limits number of pairs to be searched.
Uses generalization
Uses a learning to to derive description of cluster.
SD(CLARANS) Algorithm : © Prentice Hall 71 SD(CLARANS) Algorithm
DBCLASD : © Prentice Hall 72 DBCLASD Extension of DBSCAN
Distribution Based Clustering of LArge Spatial Databases
Assumes items in cluster are uniformly distributed.
Identifies distribution satisfied by distances between nearest neighbors.
Objects added if distribution is uniform.
DBCLASD Algorithm : © Prentice Hall 73 DBCLASD Algorithm
Aggregate Proximity : © Prentice Hall 74 Aggregate Proximity Aggregate Proximity – measure of how close a cluster is to a feature.
Aggregate proximity relationship finds the k closest features to a cluster.
CRH Algorithm – uses different shapes:
Encompassing Circle
Isothetic Rectangle
Convex Hull
CRH : © Prentice Hall 75 CRH
Temporal Mining Outline : © Prentice Hall 76 Temporal Mining Outline Goal: Examine some temporal data mining issues and approaches.
Introduction
Modeling Temporal Events
Time Series
Pattern Detection
Sequences
Temporal Association Rules
Temporal Database : © Prentice Hall 77 Temporal Database Snapshot – Traditional database
Temporal – Multiple time points
Ex:
Temporal Queries : © Prentice Hall 78 Temporal Queries Query
Database
Intersection Query
Inclusion Query
Containment Query
Point Query – Tuple retrieved is valid at a particular point in time.
Types of Databases : © Prentice Hall 79 Types of Databases Snapshot – No temporal support
Transaction Time – Supports time when transaction inserted data
Timestamp
Range
Valid Time – Supports time range when data values are valid
Bitemporal – Supports both transaction and valid time.
Modeling Temporal Events : © Prentice Hall 80 Modeling Temporal Events Techniques to model temporal events.
Often based on earlier approaches
Finite State Recognizer (Machine) (FSR)
Each event recognizes one character
Temporal ordering indicated by arcs
May recognize a sequence
Require precisely defined transitions between states
Approaches
Markov Model
Hidden Markov Model
Recurrent Neural Network
FSR : © Prentice Hall 81 FSR
Markov Model (MM) : © Prentice Hall 82 Markov Model (MM) Directed graph
Vertices represent states
Arcs show transitions between states
Arc has probability of transition
At any time one state is designated as current state.
Markov Property – Given a current state, the transition probability is independent of any previous states.
Applications: speech recognition, natural language processing
Markov Model : © Prentice Hall 83 Markov Model
Hidden Markov Model (HMM) : © Prentice Hall 84 Hidden Markov Model (HMM) Like HMM, but states need not correspond to observable states.
HMM models process that produces as output a sequence of observable symbols.
HMM will actually output these symbols.
Associated with each node is the probability of the observation of an event.
Train HMM to recognize a sequence.
Transition and observation probabilities learned from training set.
Hidden Markov Model : © Prentice Hall 85 Hidden Markov Model Modified from [RJ86]
HMM Algorithm : © Prentice Hall 86 HMM Algorithm
HMM Applications : © Prentice Hall 87 HMM Applications Given a sequence of events and an HMM, what is the probability that the HMM produced the sequence?
Given a sequence and an HMM, what is the most likely state sequence which produced this sequence?
Recurrent Neural Network (RNN) : © Prentice Hall 88 Recurrent Neural Network (RNN) Extension to basic NN
Neuron can obtian input form any other neuron (including output layer).
Can be used for both recognition and prediction applications.
Time to produce output unknown
Temporal aspect added by backlinks.
RNN : © Prentice Hall 89 RNN
Time Series : © Prentice Hall 90 Time Series Set of attribute values over time
Time Series Analysis – finding patterns in the values.
Trends
Cycles
Seasonal
Outliers
Analysis Techniques : © Prentice Hall 91 Analysis Techniques Smoothing – Moving average of attribute values.
Autocorrelation – relationships between different subseries
Yearly, seasonal
Lag – Time difference between related items.
Correlation Coefficient r
Smoothing : © Prentice Hall 92 Smoothing
Correlation with Lag of 3 : © Prentice Hall 93 Correlation with Lag of 3
Similarity : © Prentice Hall 94 Similarity Determine similarity between a target pattern, X, and sequence, Y: sim(X,Y)
Similar to Web usage mining
Similar to earlier word processing and spelling corrector applications.
Issues:
Length
Scale
Gaps
Outliers
Baseline
Longest Common Subseries : © Prentice Hall 95 Longest Common Subseries Find longest subseries they have in common.
Ex:
X = <10,5,6,9,22,15,4,2>
Y = <6,9,10,5,6,22,15,4,2>
Output: <22,15,4,2>
Sim(X,Y) = l/n = 4/9
Similarity based on Linear Transformation : © Prentice Hall 96 Similarity based on Linear Transformation Linear transformation function f
Convert a value form one series to a value in the second
ef – tolerated difference in results
d – time value difference allowed
Prediction : © Prentice Hall 97 Prediction Predict future value for time series
Regression may not be sufficient
Statistical Techniques
ARMA
ARIMA
NN
Pattern Detection : © Prentice Hall 98 Pattern Detection Identify patterns of behavior in time series
Speech recognition, signal processing
FSR, MM, HMM
String Matching : © Prentice Hall 99 String Matching Find given pattern in sequence
Knuth-Morris-Pratt: Construct FSM
Boyer-Moore: Construct FSM
Distance between Strings : © Prentice Hall 100 Distance between Strings Cost to convert one to the other
Transformations
Match: Current characters in both strings are the same
Delete: Delete current character in input string
Insert: Insert current character in target string into string
Distance between Strings : © Prentice Hall 101 Distance between Strings
Frequent Sequence : © Prentice Hall 102 Frequent Sequence
Frequent Sequence Example : © Prentice Hall 103 Frequent Sequence Example Purchases made by customers
s(<{A},{C}>) = 1/3
s(<{A},{D}>) = 2/3
s(<{B,C},{D}>) = 2/3
Frequent Sequence Lattice : © Prentice Hall 104 Frequent Sequence Lattice
SPADE : © Prentice Hall 105 SPADE Sequential Pattern Discovery using Equivalence classes
Identifies patterns by traversing lattice in a top down manner.
Divides lattice into equivalent classes and searches each separately.
ID-List: Associates customers and transactions with each item.
SPADE Example : © Prentice Hall 106 SPADE Example ID-List for Sequences of length 1:
Count for <{A}> is 3
Count for <{A},{D}> is 2
Q1 Equivalence Classes : © Prentice Hall 107 Q1 Equivalence Classes
SPADE Algorithm : © Prentice Hall 108 SPADE Algorithm
Temporal Association Rules : © Prentice Hall 109 Temporal Association Rules Transaction has time:
<TID,CID,I1,I2, …, Im,ts,te>
[ts,te] is range of time the transaction is active.
Types:
Inter-transaction rules
Episode rules
Trend dependencies
Sequence association rules
Calendric association rules
Inter-transaction Rules : © Prentice Hall 110 Inter-transaction Rules Intra-transaction association rules
Traditional association Rules
Inter-transaction association rules
Rules across transactions
Sliding window – How far apart (time or number of transactions) to look for related itemsets.
Episode Rules : © Prentice Hall 111 Episode Rules Association rules applied to sequences of events.
Episode – set of event predicates and partial ordering on them
Trend Dependencies : © Prentice Hall 112 Trend Dependencies Association rules across two database states based on time.
Ex: (SSN,=) (Salary, )
Confidence=4/5
Support=4/36
Sequence Association Rules : © Prentice Hall 113 Sequence Association Rules Association rules involving sequences
Ex:
<{A},{C}> <{A},{D}>
Support = 1/3
Confidence 1
Calendric Association Rules : © Prentice Hall 114 Calendric Association Rules Each transaction has a unique timestamp.
Group transactions based on time interval within which they occur.
Identify large itemsets by looking at transactions only in this predefined interval.