DATA MINING ALGORITHAMS By: Snehal Kulkarni Sneha Nikam

Outline…….:

2 Outline……. 0. Introduction I. Data Warehousing II. Data Mining III Data Mining Algo.

0. Introduction:

3 0. Introduction Data Warehousing, data mining: what and why (now)? Relation to Data Warehouse

A producer wants to know….:

4 Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? What product prom- -otions have the biggest impact on revenue? What is the most effective distribution channel? A producer wants to know….

Data, Data everywhere yet ...:

5 Data, Data everywhere yet ... I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented I can’t use the data I found results are unexpected data needs to be transformed from one form to other UNCOVER HIDDEN INFORMATION DATA MINING

What is a Data Warehouse?:

6 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

What are the users saying...:

7 What are the users saying... Data should be integrated across the enterprise Summary data has a real value to the organization Historical data holds the key to understanding data over time What-if capabilities are required

What is Data Warehousing?:

8 What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data Information

Evolution:

9 Evolution 60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request 70’s: Terminal-based DSS and EIS (executive information systems) still inflexible, not integrated with desktop tools 80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational databases 90’s: Data warehousing with integrated OLAP engines and tools

Very Large Data Bases:

10 Very Large Data Bases Terabytes -- 10^12 bytes: Petabytes -- 10^15 bytes: Exabytes -- 10^18 bytes: Zettabytes -- 10^21 bytes: Zottabytes -- 10^24 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images Intelligence Agency Videos

Data Warehousing -- It is a process:

11 Data Warehousing -- It is a process Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organization’s operational database

Data Warehouse Architecture:

12 Data Warehouse Architecture Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository Relational Databases Legacy Data Purchased Data ERP Systems

Data Mining works with Warehouse Data:

13 Data Mining works with Warehouse Data Data Warehousing provides the Enterprise with a memory Data Mining provides the Enterprise with intelligence

We want to know ...:

14 We want to know ... Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? Which of my customers are likely to be the most loyal? Data Mining helps extract such information

Knowledge Discovery (KDD) Process:

Knowledge Discovery (KDD) Process Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation

Data Mining and Business Intelligence :

July 4, 2011 16 Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

Data Mining: Confluence of Multiple Disciplines :

July 4, 2011 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization

Application Areas:

18 Application Areas Industry Application Finance Credit Card Analysis Insurance Claims, Fraud Analysis Telecommunication Call record analysis Transport Logistics management Consumer goods promotion analysis Data Service providers Value added data Utilities Power usage analysis

Data Mining in Use:

19 Data Mining in Use The US Government uses Data Mining to track fraud A Supermarket becomes an information broker Basketball teams use it to track game strategy Cross Selling Warranty Claims Routing Holding on to Good Customers Weeding out Bad Customers

What makes data mining possible?:

20 What makes data mining possible? Advances in the following areas are making data mining deployable: data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques. -- Gartner Group

Why Not Traditional Data Analysis?:

July 4, 2011 21 Why Not Traditional Data Analysis? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications

Multi-Dimensional View of Data Mining:

July 4, 2011 22 Multi-Dimensional View of Data Mining Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Data Mining: Classification Schemes:

July 4, 2011 23 Data Mining: Classification Schemes General functionality -Descriptive data mining -Predictive data mining Different views lead to different classifications -Data view: Kinds of data to be mined -Knowledge view: Kinds of knowledge to be discovered -Method view: Kinds of techniques utilized -Application view: Kinds of applications adapted

Data mining is not:

Data mining is not Brute-force crunching of bulk data “Blind” application of algorithms Going to find relationships where none exist Presenting data in different ways A database intensive task A difficult to understand tech. requiring an advanced degree in computer science

Definition:

Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

30 Data Mining Development Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis Data Structures Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques

Related Concepts:

31 Related Concepts Database/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Data Warehousing OLAP/DSS Statistics Machine Learning Pattern Matching Goal: some areas which are related to data mining.

To summarize ...:

32 To summarize ... Data Ware housing Systems are used to “optimize” a business The Data mining helps to “maximize” the business

CLUSTER ANALYSIS:

CLUSTER ANALYSIS By Nikam Sneha

Cluster Analysis:

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Summary

What’s Clustering?:

What’s Clustering? Clustering is a kind of unsupervised learning. Clustering is a method of grouping data that share similar trend and patterns. Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. Example: After clustering: Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.

Examples of Clustering Applications:

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Data Structure:

Data Structures - Data Matrix: - Dissimilarity Matrix: Data Structure

Type of data in clustering analysis:

Type of data in clustering analysis Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

Interval-valued variables:

Interval-valued variables Standardize data Calculate the mean absolute deviation: where Calculate the standardized measurement ( z-score ) Using mean absolute deviation is more robust than using standard deviation

Similarity and Dissimilarity Between Objects:

Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular ones include: Minkowski distance : where i = ( x i1 , x i2 , …, x ip ) and j = ( x j1 , x j2 , …, x jp ) are two p -dimensional data objects, and q is a positive integer If q = 1 , d is Manhattan distance

Similarity and Dissimilarity Between Objects (Cont.):

Similarity and Dissimilarity Between Objects (Cont.) If q = 2 , d is Euclidean distance: Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures

Binary Variables:

Binary Variables A contingency table for binary data Distance measure for symmetric binary variables: Distance measure for asymmetric binary variables: Jaccard coefficient ( similarity measure for asymmetric binary variables):

Dissimilarity between Binary Variables:

Dissimilarity between Binary Variables Example gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0

Nominal Variables:

Nominal Variables A generalization of the binary variable in that it can take more than 2 states, e.g. map_color may have 4 states:red, yellow, blue, green Method 1: Simple matching m : # of matches, p : total # of variables Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

Ordinal Variables:

Ordinal Variables An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled replace x if by their rank map the range of each variable onto [0, 1] by replacing i -th object in the f - th variable by compute the dissimilarity using methods for interval-scaled variables

Ratio-Scaled Variables:

Ratio-Scaled Variables Ratio-scaled variable : a positive measurement on a nonlinear scale, approximately at exponential scale, such as Ae Bt or Ae -Bt Methods: treat them like interval-scaled variables— not a good choice! (why?—the scale can be distorted) apply logarithmic transformation y if = log( x if ) treat them as continuous ordinal data treat their rank as interval-scaled

Variables of Mixed Types:

Variables of Mixed Types A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects f is binary or nominal: d ij (f) = 0 if x if = x jf , or d ij (f) = 1 otherwise f is interval-based: use the normalized distance f is ordinal or ratio-scaled compute ranks r if and and treat z if as interval-scaled

Major Clustering Approaches (I):

Major Clustering Approaches (I) Partitioning approach : Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach : Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Density-based approach : Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue

Major Clustering Approaches (II):

Major Clustering Approaches (II) Grid-based approach : based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE Model-based : A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB

Typical Alternatives to Calculate the Distance between Clusters:

Typical Alternatives to Calculate the Distance between Clusters Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis ( K i , K j ) = min(t ip , t jq ) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis ( K i , K j ) = max(t ip , t jq ) Average: avg distance between an element in one cluster and an element in the other, i.e., dis ( K i , K j ) = avg (t ip , t jq ) Centroid : distance between the centroids of two clusters, i.e., dis ( K i , K j ) = dis ( C i , C j ) Medoid : distance between the medoids of two clusters, i.e., dis ( K i , K j ) = dis (M i , M j ) Medoid : one chosen, centrally located object in the cluster

Centroid, Radius and Diameter of a Cluster (for numerical data sets):

Centroid , Radius and Diameter of a Cluster (for numerical data sets) Centroid: the “middle” of a cluster Radius: square root of average distance from any point of the cluster to its centroid Diameter: square root of average mean squared distance between all pairs of points in the cluster

Partitioning Algorithms: Basic Concept:

Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, such that min sum of squared distance Given a k , find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k- medoids algorithms k-means (MacQueen’67): Each cluster is represented by the center of the cluster k- medoids or PAM (Partition around medoids ) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

The K-Means Clustering Method :

The K-Means Clustering Method Given k , the k-means algorithm is implemented in four steps:

Slide 54:

The K-Means Clustering Method (cont.) Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassign reassign

The K-Means Clustering Method (cont.) :

The K-Means Clustering Method (cont.) Strength: Relatively efficient : O ( tkn ), where n is # objects, k is # clusters, and t is # iterations. Normally, k , t << n . Comparing: PAM: O(k(n-k) 2 ), CLARA: O(ks 2 + k(n-k)) Comment: Often terminates at a local optimum . The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes

What Is the Problem of the K-Means Method?:

What Is the Problem of the K-Means Method? The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Slide 57:

A Typical K- Medoids Algorithm (PAM) Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary choose k object as initial medoids Assign each remaining object to nearest medoids Randomly select a nonmedoid object,O ramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O and O ramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Summary:

Summary Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches There are still lots of research issues on cluster analysis, such as constraint-based clustering

Questions?:

Questions?

Slide 60:

Thank You

You do not have the permission to view this presentation. In order to view it, please
contact the author of the presentation.

Send to Blogs and Networks

Processing ....

Premium member

Use HTTPs

HTTPS (Hypertext Transfer Protocol Secure) is a protocol used by Web servers to transfer and display Web content securely. Most web browsers block content or generate a “mixed content” warning when users access web pages via HTTPS that contain embedded content loaded via HTTP. To prevent users from facing this, Use HTTPS option.

By: Haneen12 (101 month(s) ago)

Can I download this presentation pleaze ?