DW & DM: Introduction :DW & DM: Introduction Babu Ram Dawadi,
IOE Pulchowk Campus,
Tribhuvan University Nepal CH#2, Data Warehousing By: Babu Ram Dawadi
Data Mining & Warehousing :CH#2, Data Warehousing By: Babu Ram Dawadi Data Mining & Warehousing Data Mining:
The process of Discovering meaningful patterns & trends often previously unknown, by shifting large amount of data, using pattern recognition, statistical and Mathematical techniques.
A group of techniques that find relationship that have not previously been discovered
What is Data Mining? :What is Data Mining? The process of discovering meaningful new correlations, patterns
and trends, often previously unknown, by sifting through large
amounts of data, using pattern recognition, statistical and
mathematical techniques
What Is Data Mining? :CH#2, Data Warehousing By: Babu Ram Dawadi What Is Data Mining? Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
Alternative names and their “inside stories”:
Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
What is not data mining?
(Deductive) query processing.
Expert systems
Data Warehousing :CH#2, Data Warehousing By: Babu Ram Dawadi Data Warehousing Data
Raw piece of information that is capable of being moved and store.
Database
An organized collection of such data in which data are managed in tabular form with relationship.
Data Warehouse
System that organizes all the data available in an organization, makes it accessible & usable for the all kinds of data analysis and also allows to create a lots of reports by the use of mining tools.
Data Warehouse… :CH#2, Data Warehousing By: Babu Ram Dawadi Data Warehouse… “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”
Data warehousing:
The process of constructing and using data warehouses.
Is the process of extracting & transferring operational data into informational data & loading it into a central data store (warehouse)
Data Warehouse—Integrated :CH#2, Data Warehousing By: Babu Ram Dawadi Data Warehouse—Integrated Constructed by integrating multiple, heterogeneous data sources
relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
Data Warehouse—Subject-Oriented :CH#2, Data Warehousing By: Babu Ram Dawadi Data Warehouse—Subject-Oriented Organized around major subjects, such as customer, product, sales.
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
Data Warehouse—Time Variant :CH#2, Data Warehousing By: Babu Ram Dawadi Data Warehouse—Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems.
Operational database: current value data.
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time element”.
Data Warehouse—Non-Volatile :CH#2, Data Warehousing By: Babu Ram Dawadi Data Warehouse—Non-Volatile A physically separate store of data transformed from the operational environment.
Operational update of data does not occur in the data warehouse environment.
Does not require transaction processing, recovery, and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
Data Warehouse Usage :CH#2, Data Warehousing By: Babu Ram Dawadi Data Warehouse Usage Three kinds of data warehouse applications
Information processing
supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse data
supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
knowledge discovery from hidden patterns
supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools.
The Warehousing Approach :CH#2, Data Warehousing By: Babu Ram Dawadi The Warehousing Approach Data
Warehouse Clients Source Source Source . . . Extractor/
Monitor Integration System . . . Metadata Extractor/
Monitor Extractor/
Monitor Information integrated in advance
Stored in WH for direct querying and analysis
General Architecture :CH#2, Data Warehousing By: Babu Ram Dawadi General Architecture Data
Warehouse Query
and
Data Analysis
Component External Sources Data
Integration
Component OLAP Server OLAP queries/
reports data
mining Monitoring
Administration Internal Sources Data acquisition Data extraction Construction & maintenance
Slide 14:CH#2, Data Warehousing By: Babu Ram Dawadi
The Need for Data Analysis :CH#2, Data Warehousing By: Babu Ram Dawadi The Need for Data Analysis Managers must be able to track daily transactions to evaluate how the business is performing
By tapping into the operational database, management can develop strategies to meet organizational goals
Data analysis can provide information about short-term tactical evaluations and strategies
A Comparison of Data Warehouse and Operational Database Characteristics :CH#2, Data Warehousing By: Babu Ram Dawadi A Comparison of Data Warehouse and Operational Database Characteristics
Creating a Data Warehouse :CH#2, Data Warehousing By: Babu Ram Dawadi Creating a Data Warehouse
Factors Common to Data Warehousing :CH#2, Data Warehousing By: Babu Ram Dawadi Factors Common to Data Warehousing Dynamic framework for decision support that is always a work in progress
Must satisfy:
Data integration and loading criteria
Data analysis capabilities with acceptable query performance
End-user data analysis needs
Apply database design procedures
Decision Support Systems :CH#2, Data Warehousing By: Babu Ram Dawadi Decision Support Systems Methodology (or series of methodologies) designed to extract information from data and to use such information as a basis for decision making
Decision support system (DSS):
Arrangement of computerized tools used to assist managerial decision making within a business
Usually requires extensive data “massaging” to produce information
Used at all levels within an organization
Often tailored to focus on specific business areas
Provides ad hoc query tools to retrieve data and to display data in different formats
Decision Support Systems (continued) :CH#2, Data Warehousing By: Babu Ram Dawadi Decision Support Systems (continued) Composed of four main components:
Data store component
Basically a DSS database
Data extraction and filtering component
Used to extract and validate data taken from operational database and external data sources
End-user query tool
Used to create queries that access database
End-user presentation tool
Used to organize and present data
Main Components of a Decision Support System (DSS) :CH#2, Data Warehousing By: Babu Ram Dawadi Main Components of a Decision Support System (DSS)
Transforming Operational Data Into Decision Support Data :CH#2, Data Warehousing By: Babu Ram Dawadi Transforming Operational Data Into Decision Support Data
Designing DSS :CH#2, Data Warehousing By: Babu Ram Dawadi Designing DSS DSS is the more general term referring to all kinds of analysis of existing data in order to make better decisions, like: data mining, OLAP, Simulation etc…
DSS design differs considerably from that of an online transaction processing (OLTP)
In contrast to OLTP, DSS are used only for queries.
Designing DSS :CH#2, Data Warehousing By: Babu Ram Dawadi Designing DSS Designing a DSS seeks particular importance on:
Requirement of the end user
Software requirement
Hardware requirement
End user requirement
Discuss with the end user
People who need to use DSS produce a huge variety of queries
Some are interested only on a particular part of the information so that they may prefer to optimize the application completely in order to speed up the query process.
DSS … :CH#2, Data Warehousing By: Babu Ram Dawadi DSS … Software Requirement
Type of software depends very much on the requirement of the end user.
Working on a client/server environment allows flexibility in choosing the appropriate software for end users.
For data mining, software can be split into two parts:
The first works with the algorithms on the database server
The second work on the local workstation.
DSS… :CH#2, Data Warehousing By: Babu Ram Dawadi DSS… Hardware Requirement
A large DW can contain hundreds of thousands of gaga bytes.
So DW is designed by Engineer with knowledge of both hardware and software
For data mining, it is not always necessary to have a very large database and a large database server.
Slide 27:CH#2, Data Warehousing By: Babu Ram Dawadi ON-LINE ANALYTICAL PROCESSING (OLAP)
Slide 28:CH#2, Data Warehousing By: Babu Ram Dawadi DEFINITION : ‘OLAP applications and tools are those that are designed to ask ad hoc, complex queries of large multidimensional collections of data. It is for this reason that OLAP is often mentioned in the context of Data Warehouses’.
Slide 29:CH#2, Data Warehousing By: Babu Ram Dawadi Example: Three dimensions – Product, Sales Area, and Season
3-D Cube :CH#2, Data Warehousing By: Babu Ram Dawadi 3-D Cube dimensions = 3 Multi-dimensional cube: Fact table view:
Slide 31:CH#2, Data Warehousing By: Babu Ram Dawadi
OLAP Client/Server Architecture :CH#2, Data Warehousing By: Babu Ram Dawadi OLAP Client/Server Architecture
OLAP Server Arrangement :CH#2, Data Warehousing By: Babu Ram Dawadi OLAP Server Arrangement
Slide 34:CH#2, Data Warehousing By: Babu Ram Dawadi roll up By a drill up opperation examine sales
By country rather than city level Drill down
Slice and dice :CH#2, Data Warehousing By: Babu Ram Dawadi Slice and dice Slice: a selection on one dimension of the cube resulting in subcube
Ex: sales data are selected for dimension time using time =spring
dice: defines a subcube by performing a selection on two or more dimensions
Ex: a dice opp. Based on
location=“london” or “glasgow” and
time =spring or summer and
item = “T-shirts” or “Pyjamas”
Slide 36:CH#2, Data Warehousing By: Babu Ram Dawadi slice dice
Designing Distributed Systems :CH#2, Data Warehousing By: Babu Ram Dawadi Designing Distributed Systems Local Area Networks
Components:
Local Area Network (LAN)
Supports a network of personal computers.
Each PC has its own storage and are able to share common devices and software.
Within 100 feet of each other.
Architectures
File Server Architecture
Client Server Architecture
Designing Distributed Systems :CH#2, Data Warehousing By: Babu Ram Dawadi Designing Distributed Systems File Servers
A device that manages files operations and is shared by each client PC that is attached to the LAN.
Acts as an additional hard disk for each client PC.
Limitations of File Servers
Excessive data movement
Client is busy with extensive data manipulation, and the network is transferring large blocks of data.
Need for powerful client workstation
Decentralized data control
Slide 39:CH#2, Data Warehousing By: Babu Ram Dawadi
Designing for Distributed Systems :CH#2, Data Warehousing By: Babu Ram Dawadi Designing for Distributed Systems Client Server Architecture
Application processing is divided between client and server.
Client
Responsible for managing user-interface including presentation of data.
Server
Responsible for data storage and access such as query processing.
Client :CH#2, Data Warehousing By: Babu Ram Dawadi Client Handles user interaction
Data manipulation functions
Server :CH#2, Data Warehousing By: Babu Ram Dawadi Server Also called as the database engine
Handles all database access and control functions
Client/Server Advantages :CH#2, Data Warehousing By: Babu Ram Dawadi Client/Server Advantages Leverage (influence) the benefits of microcomputer technology.
Allows processing to be performed close to the source
Facilitates the use of GUIs and visual presentation techniques.
Encourages the acceptance of open systems.
Business Information: :CH#2, Data Warehousing By: Babu Ram Dawadi Business Information: “How you gather, manage,
and use information
will determine whether you win or lose.”
– Bill Gates
What is BI? :CH#2, Data Warehousing By: Babu Ram Dawadi What is BI? Business Intelligence means using your data assets to make better business decisions.
Business intelligence involves the gathering, management, and analysis of data for the purpose of turning that data into useful information which is then used to improve decision making.
Organizations can then make more strategic decisions about how to administer clients and programs. These practices can also reduce operating costs through more effective financial analysis, risk management, and fraud management.
Business Intelligence solutions start with data warehouses and data marts :CH#2, Data Warehousing By: Babu Ram Dawadi Business Intelligence solutions start with data warehouses and data marts
Data… Information….Decisions :CH#2, Data Warehousing By: Babu Ram Dawadi Data… Information….Decisions
OLTP –vs- OLAP :CH#2, Data Warehousing By: Babu Ram Dawadi OLTP –vs- OLAP On Line Transaction Processing -- OLTP
Maintain a database that is an accurate model of some real-world enterprise
Short simple transactions
Relatively frequent updates
Transactions access only a small fraction of the database
On Line Analytic Processing -- OLAP
Use information in database to guide strategic decisions
Complex queries
Infrequent updates (Load)
Transactions access a large fraction of the database
Example: :CH#2, Data Warehousing By: Babu Ram Dawadi Example: OLTP
Calvin, from College Park, bought a soft toy model: ‘Hobbes’) ; charge his account; deliver the toy from College park warehouse; decrease our inventory of soft toys from that warehouse
OLAP
How many soft toys were sold in all northeast warehouses in the years 2004 and 2005?
Knowledge discovery in databases :CH#2, Data Warehousing By: Babu Ram Dawadi Knowledge discovery in databases KDD is the process of identifying valid,potentially useful and understandable patterns & relationships in data
? Knowledge = patterns & relationships
knowledge discovery =
data preparation + data mining + evaluation/interpretation of discovered patterns/relationships
Nowadays, ? KDD = data mining
Knowledge Discovery in Database Environment (Stages) :CH#2, Data Warehousing By: Babu Ram Dawadi Knowledge Discovery in Database Environment (Stages) There are six stages of KDD which are:
Data selection
Cleaning
Enrichment
Coding
Data mining
reporting
Data Mining: A KDD Process :CH#2, Data Warehousing By: Babu Ram Dawadi Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation
KDD : Data selection :CH#2, Data Warehousing By: Babu Ram Dawadi KDD : Data selection Data Selection
It is the first stage of KDD process in which we collect and select the data set or database required to work with
Data sets are obtained from operational databases
Obtaining information from centralized databases can be difficult, reasons may be:
Data set may need conversion from one format to another
Eg: Excel files to access files
KDD: Data selection :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: Data selection Different quality of data in different parts are available
Making choice on right data is important
Investigations should be made on any data warehouses available in an organization.
A well maintained DW helps to make data selection job convenient by providing right data set necessary for analysis.
Data Cleaning
This is the second stage of KDD.
Data set obtained is never perfectly cleaned.
We may not be aware of to what extent it is polluted.
KDD: Data selection :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: Data selection Data in real world is dirty:
Incomplete: lack attribute values
Noisy: contains errors
Human errors
Not available when collected
Not entered due to misunderstanding
Malfunction of hardware/software
Mistake data entry
Inconsistent: contains discrepancies codes
The cleaning phenomena should try to eliminate all the above mentioned defects by the stage of de-duplication, domain consistency, disambiguation
KDD … :CH#2, Data Warehousing By: Babu Ram Dawadi KDD … Enrichment
Enrichment is the process of adding additional information to the databases or accessing additional databases to obtain extra information.
Eg: an airline company might cooperate with telephone company to enhance its marketing policy. A telephone company maintains large databases comprising the call behavior of customers & create telephone profiles of the basis of these data.
KDD … :CH#2, Data Warehousing By: Babu Ram Dawadi KDD … These telephone profiles could be used by air lines to identify interesting new groups of target customers
So data miners can collect all the necessary information from additional bought – in databases.
Obtaining information from other organizations may involve some tedious procedures.
Coding:
Coding is one of the most important stage where further cleaning and transformation of data is done.
KDD … :CH#2, Data Warehousing By: Babu Ram Dawadi KDD … Coding…
It can range from simple SQL Queries to using sophisticated high level languages depending upon requirement.
Some polluted records can be easily filtered out by using SQL queries. (Eg: Records with most of the field empty can be easily traced and removed)
Coding is the creative activity which involves creative transformation of data.
It can be used to obtained more simpler form of the complete, detailed database.
KDD: Coding :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: Coding Coding…
Example: to the table of the magazine publisher, we can apply following coding steps:
Convert address to region (area codes)
Birth date to age
Divide income by 1000
Divide credit by 1000
Convert owners yes/no to 1/0
Convert purchase date to month starting from 1990
Perform filtering
KDD: Coding :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: Coding Coding: applying steps 1 to 6
KDD: Coding :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: Coding Coding: applying step 7
KDD: DataMining :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: DataMining Data Mining:
All the cleanings, transformations and enrichment are performed on data, so that we can extract the most useful information from it, and this is performed in data mining stage of KDD.
It consists of different rules, techniques, and algorithms used for mining purpose.
These are involved in performing following three tasks:
Knowledge Engineering Tasks
Classification tasks
Problem solving tasks
KDD: DM :CH#2, Data Warehousing By: Babu Ram Dawadi Genetic Algorithms *Association Rules
K- nearest neighbor
Decision Trees Inductive Logic
Programming KDD: DM DM… Knowledge Engineering Tasks Classification Tasks Problem Solving Tasks Different Algorithms Concerned with Different Tasks
KDD: Data mining :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: Data mining Knowledge engineering:
is the process of finding right formal representation of certain body of knowledge in order to represent it in a knowledge based system
Eg: Expert Systems (medical diagnostic system)
Classification tasks:
Classification is the process of dividing data into no. of classes. Eg: class of customers
Problem Solving Tasks:
It involves finding solutions of remedies to the problems that arise. Eg: why are people not going to cinema hall?
KDD… :CH#2, Data Warehousing By: Babu Ram Dawadi KDD… For finding useful patterns in databases, it is necessary to choose right algorithms and right tools.
For choosing right data mining algorithms following three points should be considered:
Quality of input [No. of records, attributes, numeric]
Quality of output [yes/no results, statistics]
Performance [CPU load]
KDD: Reporting :CH#2, Data Warehousing By: Babu Ram Dawadi KDD: Reporting This stage involves documenting the results obtained from learning algorithms.
Any report writer or graphical tools can be used
It basically combines two functions:
Analysis of results obtained from mining.
Application of results to new data
Different data visualization tools like scatter diagrams available for showing different patterns or clusters of data can be used.