dataware housing and dataminnning

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Data warehousing and mining:

Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay

Introduction:

Dr. Sunita Sarawagi Data Warehousing & Mining 2 Introduction Organizations getting larger and amassing ever increasing amounts of data Historic data encodes useful information about working of an organization. However, data scattered across multiple sources, in multiple formats. Data warehousing: process of consolidating data in a centralized location Data mining: process of analyzing data to find useful patterns and relationships

Typical data analysis tasks:

Dr. Sunita Sarawagi Data Warehousing & Mining 3 Typical data analysis tasks Report the per-capita deposits broken down by region and profession. Are deposits from rural coastal areas increasing over last five years? What percent of small business loans were cleared? Why is it less than last year’s? How did similar businesses that did not take loans perform? What should be the new rules for loan eligibility?

PowerPoint Presentation:

Dr. Sunita Sarawagi Data Warehousing & Mining 4 Bombay branch Delhi branch Calcutta branch Census data Operational data Detailed transactional data Data warehouse Merge Clean Summarize Direct Query Reporting tools Mining tools OLAP Decision support tools Oracle SAS Relational DBMS+ e.g. Redbrick IMS Crystal reports Essbase Intelligent Miner GIS data

Data warehouse construction:

Dr. Sunita Sarawagi Data Warehousing & Mining 5 Data warehouse construction Heterogeneous data integration merge from various sources, fuzzy matches remove inconsistencies Data cleaning: missing data, outliers, clean fields e.g. names/addresses Data mining techniques Data loading: summarize, create indices Products: Prism warehouse manager, Platinum info refiner, info pump, QDB, Vality

Warehouse maintenance:

Dr. Sunita Sarawagi Data Warehousing & Mining 6 Warehouse maintenance Data refresh when to refresh, what form to send updates? Materialized view maintenance with batch updates. Query evaluation using materialized views Monitoring and reporting tools HP intelligent warehouse advisor

PowerPoint Presentation:

Dr. Sunita Sarawagi Data Warehousing & Mining 7 Bombay branch Delhi branch Calcutta branch Census data Operational data Detailed transactional data Data warehouse Merge Clean Summarize Direct Query Reporting tools Mining tools OLAP Decision support tools Oracle SAS Relational DBMS+ e.g. Redbrick IMS Crystal reports Essbase Intelligent Miner GIS data

OLAP:

Dr. Sunita Sarawagi Data Warehousing & Mining 8 OLAP Fast, interactive answers to large aggregate queries . Multidimensional model: dimensions with hierarchies Dim 1: Bank location: branch-->city-->state Dim 2: Customer: sub profession --> profession Dim 3: Time: month --> quarter --> year Measures : loan amount, #transactions, balance

OLAP:

Dr. Sunita Sarawagi Data Warehousing & Mining 9 OLAP Navigational operators: Pivot, drill-down, roll-up, select. Hypothesis driven search: E.g. factors affecting defaulters view defaulting rate on age aggregated over other dimensions for particular age segment detail along profession Need interactive response to aggregate queries ..

OLAP products:

Dr. Sunita Sarawagi Data Warehousing & Mining 10 OLAP products About 30 OLAP vendors Dominant ones: Oracle Express: largest market share: 20% Arbor Essbase: technology leader Microsoft Plato: introduced late last year, rapidly taking over...

Microsoft OLAP strategy:

Dr. Sunita Sarawagi Data Warehousing & Mining 11 Microsoft OLAP strategy Plato: OLAP server: powerful, integrating various operational sources OLE-DB for OLAP: emerging industry standard based on MDX --> extension of SQL for OLAP Pivot-table services: integrate with Office 2000 Every desktop will have OLAP capability. Client side caching and calculations Partitioned and virtual cube Hybrid relational and multidimensional storage

Data mining:

Dr. Sunita Sarawagi Data Warehousing & Mining 12 Data mining Process of semi-automatically analyzing large databases to find interesting and useful patterns Overlaps with machine learning, statistics, artificial intelligence and databases but more scalable in number of features and instances more automated to handle heterogeneous data

Some basic operations:

Dr. Sunita Sarawagi Data Warehousing & Mining 13 Some basic operations Predictive: Regression Classification Descriptive: Clustering / similarity matching Association rules and variants Deviation detection

Classification:

Dr. Sunita Sarawagi Data Warehousing & Mining 14 Classification Given old data about customers and payments, predict new applicant’s loan eligibility. Age Salary Profession Location Customer type Previous customers Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad

Classification methods:

Dr. Sunita Sarawagi Data Warehousing & Mining 15 Classification methods Nearest neighbor Regression: (linear or any polynomial) a*salary + b*age + c = eligibility score. Decision tree classifier Probabilistic/generative models Neural networks

Clustering:

Dr. Sunita Sarawagi Data Warehousing & Mining 16 Clustering Unsupervised learning when old data with class labels not available e.g. when introducing a new product. Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Key requirement: Need a good measure of similarity between instances. Identify micro-markets and develop policies for each

Association rules:

Dr. Sunita Sarawagi Data Warehousing & Mining 17 Association rules Given set T of groups of items Example: set of item sets purchased Goal: find all rules on itemsets of the form a-->b such that support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold c Example: Milk --> bread Purchase of product A --> service B Milk, cereal Tea, milk Tea, rice, bread cereal T

Mining market:

Dr. Sunita Sarawagi Data Warehousing & Mining 18 Mining market Around 20 to 30 mining tool vendors Major players: Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner. All pretty much the same set of tools Many embedded products: fraud detection, electronic commerce applications

Conclusions:

Dr. Sunita Sarawagi Data Warehousing & Mining 19 Conclusions The value of warehousing and mining in effective decision making based on concrete evidence from old data Challenges of heterogeneity and scale in warehouse construction and maintenance Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining.

authorStream Live Help