Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts:

Data Mining, Data Warehousing and Knowledge Discovery Basic Algorithms and Concepts Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in

Overview:

Overview Why Data Mining? Data Mining concepts Data Mining algorithms Tabular data mining Association, Classification and Clustering Sequence data mining Streaming data mining Data Warehousing concepts

Why Data Mining:

Why Data Mining From a managerial perspective: Strategic decision making Wealth generation Analyzing trends Security

Data Mining:

Data Mining Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data No Query… …But an “Interestingness criteria”

Data Mining:

Data Mining + = Data Interestingness criteria Hidden patterns

Data Mining:

Data Mining + = Data Interestingness criteria Hidden patterns Type of Patterns

Data Mining:

Data Mining + = Data Interestingness criteria Hidden patterns Type of data Type of Interestingness criteria

Type of Data:

Type of Data Tabular (Ex: Transaction data) Relational Multi-dimensional Spatial (Ex: Remote sensing data) Temporal (Ex: Log information) Streaming (Ex: multimedia, network traffic) Spatio-temporal (Ex: GIS) Tree (Ex: XML data) Graphs (Ex: WWW, BioMolecular data) Sequence (Ex: DNA, activity logs) Text, Multimedia …

Type of Interestingness:

Type of Interestingness Frequency Rarity Correlation Length of occurrence (for sequence and temporal data) Consistency Repeating / periodicity “Abnormal” behavior Other patterns of interestingness…

Data Mining vs Statistical Inference:

Data Mining vs Statistical Inference Statistics: Conceptual Model (Hypothesis) Statistical Reasoning “Proof” (Validation of Hypothesis)

Data Mining vs Statistical Inference:

Data Mining vs Statistical Inference Data mining: Mining Algorithm Based on Interestingness Data Pattern (model, rule, hypothesis) discovery

Data Mining Concepts:

Data Mining Concepts Associations and Item-sets: An association is a rule of the form: if X then Y . It is denoted as X Y Example: If India wins in cricket, sales of sweets go up. For any rule if X Y Y X, then X and Y are called an “interesting item-set”. Example: People buying school uniforms in June also buy school bags (People buying school bags in June also buy school uniforms)

Data Mining Concepts:

Data Mining Concepts Support and Confidence: The support for a rule R is the ratio of the number of occurrences of R, given all occurrences of all rules. The confidence of a rule X Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X.

Data Mining Concepts:

Data Mining Concepts Support and Confidence: Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Book Bag Book Bag Bag Pencil Books Support for {Bag, Uniform} = 5/10 = 0.5 Confidence for Bag Uniform = 5/8 = 0.625

Mining for Frequent Item-sets:

Mining for Frequent Item-sets The Apriori Algorithm: Given minimum required support s as interestingness criterion: Search for all individual elements (1-element item-set) that have a minimum support of s Repeat From the results of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s This becomes the set of all frequent (i+1)-element item-sets that are interesting Until item-set size reaches maximum..

Mining for Frequent Item-sets:

Mining for Frequent Item-sets The Apriori Algorithm: (Example) Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books Let minimum support = 0.3 Interesting 1-element item-sets: {Bag}, {Uniform}, {Crayons}, {Pencil}, {Books} Interesting 2-element item-sets: {Bag,Uniform} {Bag,Crayons} {Bag,Pencil} {Bag,Books} {Uniform,Crayons} {Uniform,Pencil} {Pencil,Books}

Mining for Frequent Item-sets:

Mining for Frequent Item-sets The Apriori Algorithm: (Example) Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books Let minimum support = 0.3 Interesting 3-element item-sets: {Bag,Uniform,Crayons}

Mining for Association Rules:

Mining for Association Rules Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books Association rules are of the form A B Which are directional… Association rule mining requires two thresholds: minsup and minconf

Mining for Association Rules:

Mining for Association Rules Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books General Procedure: Use apriori to generate frequent itemsets of different sizes At each iteration divide each frequent itemset X into two parts LHS and RHS. This represents a rule of the form LHS RHS The confidence of such a rule is support(X)/support(LHS) Discard all rules whose confidence is less than minconf . Mining association rules using apriori

Mining for Association Rules:

Mining for Association Rules Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books Example: The frequent itemset {Bag, Uniform, Crayons} has a support of 0.3. This can be divided into the following rules: {Bag} {Uniform, Crayons} {Bag, Uniform} {Crayons} {Bag, Crayons} {Uniform} {Uniform} {Bag, Crayons} {Uniform, Crayons} {Bag} {Crayons} {Bag, Uniform} Mining association rules using apriori

Mining for Association Rules:

Mining for Association Rules Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books Confidence for these rules are as follows: {Bag} {Uniform, Crayons} 0.375 {Bag, Uniform} {Crayons} 0.6 {Bag, Crayons} {Uniform} 0.75 {Uniform} {Bag, Crayons} 0.428 {Uniform, Crayons} {Bag} 0.75 {Crayons} {Bag, Uniform} 0.75 Mining association rules using apriori If minconf is 0.7, then we have discovered the following rules…

Mining for Association Rules:

Mining for Association Rules Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books People who buy a school bag and a set of crayons are likely to buy school uniform. People who buy school uniform and a set of crayons are likely to buy a school bag. People who buy just a set of crayons are likely to buy a school bag and school uniform as well. Mining association rules using apriori

Generalized Association Rules:

Generalized Association Rules Since customers can buy any number of items in one transaction, the transaction relation would be in the form of a list of individual purchases. Bill No. Date Item 15563 23.10.2003 Books 15563 23.10.2003 Crayons 15564 23.10.2003 Uniform 15564 23.10.2003 Crayons

Generalized Association Rules:

Generalized Association Rules A transaction for the purposes of data mining is obtained by performing a GROUP BY of the table over various fields. Bill No. Date Item 15563 23.10.2003 Books 15563 23.10.2003 Crayons 15564 23.10.2003 Uniform 15564 23.10.2003 Crayons

Generalized Association Rules:

Generalized Association Rules A GROUP BY over Bill No. would show frequent buying patterns across different customers. A GROUP BY over Date would show frequent buying patterns across different days. Bill No. Date Item 15563 23.10.2003 Books 15563 23.10.2003 Crayons 15564 23.10.2003 Uniform 15564 23.10.2003 Crayons

Classification and Clustering:

Classification and Clustering Given a set of data elements: Classification maps each data element to one of a set of pre-determined classes based on the difference among data elements belonging to different classes Clustering groups data elements into different groups based on the similarity between elements within a single group

Classification Techniques:

Classification Techniques Decision Tree Identification Outlook Temp Play? Sunny 30 Yes Overcast 15 No Sunny 16 Yes Cloudy 27 Yes Overcast 25 Yes Overcast 17 No Cloudy 17 No Cloudy 35 Yes Classification problem Weather Play(Yes,No)

Classification Techniques:

Classification Techniques Hunt’s method for decision tree identification: Given N element types and m decision classes: For i 1 to N do Add element i to the i-1 element item-sets from the previous iteration Identify the set of decision classes for each item-set If an item-set has only one decision class, then that item-set is done, remove that item-set from subsequent iterations done

Classification Techniques:

Classification Techniques Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Overcast Chilly No Sunny Chilly Yes Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes Sunny Cloudy Overcast Yes Yes/No Yes/No

Classification Techniques:

Classification Techniques Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Overcast Chilly No Sunny Chilly Yes Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes Sunny Cloudy Overcast Yes Yes/No Yes/No

Classification Techniques:

Classification Techniques Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Overcast Chilly No Sunny Chilly Yes Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes Cloudy Warm Yes Cloudy Chilly No Cloudy Pleasant Yes

Classification Techniques:

Classification Techniques Decision Tree Identification Example Outlook Temp Play? Sunny Warm Yes Overcast Chilly No Sunny Chilly Yes Cloudy Pleasant Yes Overcast Pleasant Yes Overcast Chilly No Cloudy Chilly No Cloudy Warm Yes Overcast Warm Overcast Chilly No Overcast Pleasant Yes

Classification Techniques:

Classification Techniques Decision Tree Identification Example Yes/No Yes/No Yes Yes/No Sunny Cloudy Overcast Yes No Yes No Yes Warm Chilly Pleasant Chilly Pleasant

Classification Techniques:

Classification Techniques Decision Tree Identification Example Top down technique for decision tree identification Decision tree created is sensitive to the order in which items are considered If an N-item-set does not result in a clear decision, classification classes have to be modeled by rough sets .

Other Classification Algorithms:

Other Classification Algorithms Quinlan’s depth-first strategy builds the decision tree in a depth-first fashion, by considering all possible tests that give a decision and selecting the test that gives the best information gain. It hence eliminates tests that are inconclusive. SLIQ (Supervised Learning in Quest) developed in the QUEST project of IBM uses a top-down breadth-first strategy to build a decision tree. At each level in the tree, an entropy value of each node is calculated and nodes having the lowest entropy values selected and expanded.

Clustering Techniques:

Clustering Techniques Clustering partitions the data set into clusters or equivalence classes. Similarity among members of a class more than similarity among members across classes. Similarity measures: Euclidian distance or other application specific measures.

Euclidian Distance for Tables:

Euclidian Distance for Tables Warm Pleasant Chilly Sunny Cloudy Overcast Play Don’t Play (Cloudy,Pleasant,Play) (Overcast,Chilly,Don’t Play)

Clustering Techniques:

Clustering Techniques General Strategy: Draw a graph connecting items which are close to one another with edges. Partition the graph into maximally connected subcomponents. Construct an MST for the graph Merge items that are connected by the minimum weight of the MST into a cluster

Clustering Techniques:

Clustering Techniques Clustering types: Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level Partitional clustering: Clusters are formed at only one level

Clustering Techniques:

Clustering Techniques Nearest Neighbour Clustering Algorithm: Given n elements x1, x2, … xn, and threshold t, . j 1, k 1, Clusters = {} Repeat Find the nearest neighbour of xj Let the nearest neighbour be in cluster m If distance to nearest neighbour > t, then create a new cluster and k k+1; else assign xj to cluster m j j+1 until j > n

Clustering Techniques:

Clustering Techniques Iterative partitional clustering: Given n elements x1, x2, … xn, and k clusters, each with a center. Assign each element to its closest cluster center After all assignments have been made, compute the cluster centroids for each of the cluster Repeat the above two steps with the new centroids until the algorithm converges

Mining Sequence Data:

Mining Sequence Data Characteristics of Sequence Data: Collection of data elements which are ordered sequences In a sequence, each item has an index associated with it A k-sequence is a sequence of length k. Support for sequence j is the number of m-sequences (m>=j) which contain j as a sequence Sequence data: transaction logs, DNA sequences, patient ailment history, …

Mining Sequence Data:

Mining Sequence Data Some Definitions: A sequence is a list of itemsets of finite length. Example: {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil} … the purchases of a single customer over time… The order of items within an itemset does not matter; but the order of itemsets matter A subsequence is a sequence with some itemsets deleted

Mining Sequence Data:

Mining Sequence Data Some Definitions: A sequence S’ = {a 1 , a 2 , …, a m } is said to be contained within another sequence S, if S contains a subsequence {b 1 , b 2 , … b m } such that a 1 b 1 , a 2 b 2 , …, a m b m . Hence, {pen}{pencil}{ruler,pencil} is contained in {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}

Mining Sequence Data:

Mining Sequence Data Apriori Algorithm for Sequences: L 1 Set of all interesting 1-sequences k 1 while L k is not empty do Generate all candidate k+1 sequences L k+1 Set of all interesting k+1-sequences done

Mining Sequence Data:

Mining Sequence Data Generating Candidate Sequences: Given L 1 , L 2 , … L k , candidate sequences of L k+1 are generated as follows: For each sequence s in L k , concatenate s with all new 1-sequences found while generating L k-1

Mining Sequence Data:

Mining Sequence Data Example: minsup = 0.5 a b c d e Interesting 1-sequences: b d a e a a e b d b b e d e a b d a e a a a a b a a a Candidate 2-sequences c b d b aa, ab, ad, ae a b b a b ba, bb, bd, be a b d e da, db, dd, de ea, eb, ed, ee

Mining Sequence Data:

Mining Sequence Data Example: minsup = 0.5 a b c d e Interesting 2-sequences: b d a e ab, bd a e b d b e Candidate 2-sequences e a b d a aba, abb, abd, abe, a a a a aab, bab, dab, eab, b a a a bda, bdb, bdd, bde, c b d b bbd, dbd, ebd. a b b a b a b d e Interesting 3-sequences = {}

Mining Sequence Data:

Mining Sequence Data Language Inference: Given a set of sequences, consider each sequence as the behavioural trace of a machine, and infer the machine that can display the given sequence as behavior. aabb ababcac abbac … Input set of sequences Output state machine

Mining Sequence Data:

Mining Sequence Data Inferring the syntax of a language given its sentences Applications: discerning behavioural patterns, emergent properties discovery, collaboration modeling, … State machine discovery is the reverse of state machine construction Discovery is “maximalist” in nature…

Mining Sequence Data:

Mining Sequence Data “Maximal” nature of language inference: abc aabc aabbc abbc a,b,c a b c a b c b c b c “Most general” state machine “Most specific” state machine

Mining Sequence Data:

Mining Sequence Data “Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000) Given a set of n sequences: Create a state machine for the first sequence for j 2 to n do Create a state machine for the j th sequence Merge this sequence into the earlier sequence as follows: Merge all halt states in the new state machine to the halt state in the existing state machine If two or more paths to the halt state share the same suffix, merge the suffixes together into a single path Done

Mining Sequence Data:

Mining Sequence Data “Shortest-run Generalization” (Srinivasa and Spiliopoulou 2000) aabcb aac aabc a a b c b a a b c b c a a b c b c a a c b b

Mining Streaming Data:

Mining Streaming Data Characteristics of streaming data: Large data sequence No storage Often an infinite sequence Examples: Stock market quotes, streaming audio/video, network traffic

Mining Streaming Data:

Mining Streaming Data Running mean: Let n = number of items read so far, avg = running average calculated so far, On reading the next number num: avg (n*avg+num) / (n+1) n n+1

Mining Streaming Data:

Mining Streaming Data Running variance: var = (num-avg) 2 = num 2 - 2*num*avg + avg 2 Let A = num 2 of all numbers read so far B = 2*num*avg of all numbers read so far C = avg 2 of all numbers read so far avg = average of numbers read so far n = number of numbers read so far

Mining Streaming Data:

Mining Streaming Data Running variance: On reading next number num: avg (avg*n + num) / (n+1) n n+1 A A + num 2 B B + 2*avg*num C C + avg 2 var = A + B + C

Mining Streaming Data:

Mining Streaming Data -Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) Let streaming data be in the form of “frames” where each frame comprises of one or more data elements. Support for data element k within a frame is defined as (#occurrences of k)/(#elements in frame) -Consistency for data element k is the “sustained” support for k over all frames read so far, with a “leakage” of (1- )

Mining Streaming Data:

Mining Streaming Data -Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) *sup(k) (1-) level t (k) = (1- )*level t-1 (k) + *sup(k)

Data Warehousing:

Data Warehousing A platform for online analytical processing (OLAP) Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis Also called “data marts” A critical component of the decision support system (DSS) of enterprises Some typical DW queries: Which item sells best in each region that has retail outlets Which advertising strategy is best for South India? Which (age_group/occupation) in South India likes fast food, and which (age_group/occupation) likes to cook?

Data Warehousing:

Data Warehousing Order Processing Inventory Sales Data Cleaning Data Warehouse (OLAP) OLTP

OLTP vs OLAP:

OLTP vs OLAP Transactional Data (OLTP) Analysis Data (OLAP) Small or medium size databases Very large databases Transient data Archival data Frequent insertions and updates Infrequent updates Small query shadow Very large query shadow Normalization important to handle updates De-normalization important to handle queries

Data Cleaning:

Data Cleaning Performs logical transformation of transactional data to suit the data warehouse Model of operations model of enterprise Usually a semi-automatic process

Data Cleaning:

Data Cleaning Orders Order_id Price Cust_id Inventory Prod_id Price Price_chng Sales Cust_id Cust_prof Tot_sales Data Warehouse Customers Products Orders Inventory Price Time

Multi-dimensional Data Model:

Multi-dimensional Data Model Time Jan’01 Jun’01 Jan’02 Jun’02 Price Customers Products Orders

Some MDBMS Operations:

Some MDBMS Operations Roll-up Add dimensions Drill-down Collapse dimensions Vector-distance operations (ex: clustering) Vector space browsing

Star Schema:

Star Schema Fact table Dim Tbl_1 Dim Tbl_1 Dim Tbl_1 Dim Tbl_1

References Agrawal, R. Srikant: `` Fast Algorithms for Mining Association Rules '', Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: " The Quest Data Mining System ", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996. Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997. Jennifer Widom. Research Problems in Data Warehousing. Proc. of Int’l Conf. On Information and Knowledge Management, 1995.

References:

References A. Shoshani. OLAP and Statistical Databases: Similarities and Differences. Proc. of ACM PODS 1997. Panos Vassiliadis, Timos Sellis. A Survey on Logical Models for OLAP Databases. ACM SIGMOD Record M. Gyssens, Laks VS Lakshmanan. A Foundation for Multi-Dimensional Databases. Proc of VLDB 1997, Athens, Greece. Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions Based on Consistent Patterns. Proc. of CoopIS 1999, Edinburg, UK. Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000, Como, Italy.

You do not have the permission to view this presentation. In order to view it, please
contact the author of the presentation.

Send to Blogs and Networks

Processing ....

Premium member

Use HTTPs

HTTPS (Hypertext Transfer Protocol Secure) is a protocol used by Web servers to transfer and display Web content securely. Most web browsers block content or generate a “mixed content” warning when users access web pages via HTTPS that contain embedded content loaded via HTTP. To prevent users from facing this, Use HTTPS option.