Lecture 4

Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Slide1:

MODULE 4 DATA PREPROCESSING

Overview:

Overview An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent غير متناسقة includes data cleaning ( Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data or  coarse data. [1] ) data integration (  involves combining data residing in different sources and providing users with a unified view of these data. [1]  This process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge their databases ) data transformation ( converts a set of data values from the data format of a source data system into the data format of a destination data system. ) data reduction (  is the transformation of numerical or alphabetical digital information derived empirically تجريبيا or experimentally into a corrected, ordered, and simplified form. The basic concept is the reduction of multitudinous amounts of data down to the meaningful parts. )

Data PREPROCESSING:

Data PREPROCESSING

Forms of Data Preprocessing:

Forms of Data Preprocessing Data Cleaning Data integration T1 T2 T2000 A1 A2 A3 ... A126 T1 T4 T1456 A1 A2 A3 ... A115 -2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48 Data transformation

Overview:

Overview Data integration combines data from multiple sources to form a coherent data store. Metadata, correlation analysis, data conflict detection and resolution of semantic heterogeneity contribute towards smooth data integration. Data cleaning fill in missing values smooth noisy data identify outliers correct data inconsistency

Overview:

Overview Data reduction data cube aggregation, dimension reduction, data compression, numerosity reduction and discretization . Used to obtain a reduced representation of the data while minimizing the loss of information content. Data transformation convert the data into appropriate forms for mining. E.g. attribute data maybe normalized to fall between a small range such as 0.0 to 1.0

Data PREPROCESSING:

Data PREPROCESSING

Data Integration:

Data Integration Data integration combines data from multiple sources into a coherent data store e.g. data warehouse sources may include multiple database, data cubes or flat files Issues in data integration schema integration redundancy detection and resolution of data value conflicts

Data Integration:

Data Integration Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust -id  B. cust -# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units

Data Integration:

Data Integration Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases One attribute may be a “derived” attribute in another table, e.g., annual revenue Redundant data may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Data PREPROCESSING:

Data PREPROCESSING

Data Cleaning : Missing Values:

Data Cleaning : Missing Values Method of filling the missing values Ignore the tuple Fill in the missing value manually Use a global constant Use the attribute mean Use the attribute mean for all samples belonging to the same class Use the most probable value

Data Cleaning: Noisy Data:

Data Cleaning : Noisy Data Noise - random error or variance in a measured variable smooth out the data to remove the noise

Data Cleaning: Noisy Data:

Data Cleaning : Noisy Data Data Smoothing Techniques Binning smooth a sorted data value by consulting its neighborhood the sorted values are distributed into a number of buckets or bins smoothing by bin means smoothing by bin medians smoothing by bin boundaries

Simple Discretization Methods: Binning:

Simple Discretization Methods: Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = ( B – A )/ N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky.

Binning Methods for Data Smoothing:

Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Handling outliers:

Handling outliers Clustering Outliers may be detected by clustering, where similar values are organized into groups or clusters. Regression Combined computer and human inspection

Data Cleaning : Inconsistent Data:

Data Cleaning : Inconsistent Data Can be corrected manually using external references Source of inconsistency error made at data entry, can be corrected using paper trace

Data PREPROCESSING:

Data PREPROCESSING

Data Reduction:

Data Reduction To obtain a reduced representation of the data set that is much smaller in volume but closely maintains the integrity of the original data mining on the reduced dataset should be more efficient yet produce the same analytical results.

Data Cube Aggregation:

Data Cube Aggregation The lowest level of a data cube the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse. Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solve the task Queries regarding aggregated information should be answered using data cube, when possible

Data Cube Aggregation:

Data Cube Aggregation Year = 1997 Quarter Sales Q1 $224,000 Q2 $408,000 Q3 $350,000 Q4 $586,000 Year = 1998 Year = 1999 Year Sales 1997 $1,568,000 1998 $2,356,000 1999 $3,594,000 Sales data for company AllElectronics for 1997 - 1999

Dimensionality Reduction:

Dimensionality Reduction Data preparation Standard form Dimension reduction Data Subset Prediction Methods Evaluation The role of dimension reduction in Data Mining

Dimensionality Reduction:

Dimensionality Reduction Data sets for analysis may contain hundreds of attributes that may be irrelevant to the mining task or redundant Dimensionality reduction reduces the dataset size by removing such attributes among them

Dimensionality Reduction:

Dimensionality Reduction How can we find a good subset of the original attributes?? attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.

Dimensionality Reduction (Techniques):

Dimensionality Reduction (Techniques) Attribute subset selection techniques Forward selection start with empty set of attributes, the best of the original attributes is determined and added to the set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set . Stepwise backward elimination starts with the full set of attributes At each step, it removes the worst attribute remaining in the set . Combination of forward selection and backward elimination the procedure combines and selects the best attribute and removes the worst from among the remaining attributes

Slide27:

Attribute subset selection techniques Decision tree induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6}

Data Compression:

Data Compression Apply data encoding or transformation to obtain a reduced or compressed representation of the original data lossless although typically lossless, they allow only limited manipulation of data. Two methods of lossy data compression Wavelet Transforms Principle Component Analysis

Numerosity Reduction:

Numerosity Reduction Numerosity reduction technique can be applied to reduce the data volume by choosing alternative, smaller forms of data representation techniques Regression and Log-Linear Models Histograms Clustering Sampling

Histograms:

Histograms A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems. (pg 126)

Discretization:

Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis

Data PREPROCESSING:

Data PREPROCESSING

Data Transformation:

Data Transformation Smoothing : remove noise from data Aggregation : summarization, data cube construction Generalization : concept hierarchy climbing Normalization : scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones

Data Transformation: Normalization:

Data Transformation: Normalization min-max normalization z-score normalization normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1

Slide35:

..\example_data.xls For automated preprocessing use http://rosetta.lcb.uu.se/

RAW DATA:

RAW DATA Male Typical angina 145 233 TRUE LV hypertrophy 150 No 2.3 Downsloping 0 Fixed defect 0 No Male Asymptomatic 160 286 FALSE LV hypertrophy 108 Yes 1.5 Flat 3 Normal 2 Yes Male Asymptomatic 120 229 FALSE LV hypertrophy 129 Yes 2.6 Flat 2 Reversable defect 1 Yes Male Non-anginal pain 130 250 FALSE Normal 187 No 3.5 Downsloping 0 Normal 0 No Female Atypical angina 130 204 FALSE LV hypertrophy 172 No 1.4 Upsloping 0 Normal 0 No Male Atypical angina 120 236 FALSE Normal 178 No 0.8 Upsloping 0 Normal 0 No Female Asymptomatic 140 268 FALSE LV hypertrophy 160 No 3.6 Downsloping 2 Normal 3 Yes Female Asymptomatic 120 354 FALSE Normal 163 Yes 0.6 Upsloping 0 Normal 0 No Male Asymptomatic 130 254 FALSE LV hypertrophy 147 No 1.4 Flat 1 Reversable defect 2 Yes Male Asymptomatic 140 203 TRUE LV hypertrophy 155 Yes 3.1 Downsloping 0 Reversable defect 1 Yes Male Asymptomatic 140 192 FALSE Normal 148 No 0.4 Flat 0 Fixed defect 0 No Female Atypical angina 140 294 FALSE LV hypertrophy 153 No 1.3 Flat 0 Normal 0 No Male Non-anginal pain 130 256 TRUE LV hypertrophy 142 Yes 0.6 Flat 1 Fixed defect 2 Yes Male Atypical angina 120 263 FALSE Normal 173 No 0 Upsloping 0 Reversable defect 0 No Male Non-anginal pain 172 199 TRUE Normal 162 No 0.5 Upsloping 0 Reversable defect 0 No Male Non-anginal pain 150 168 FALSE Normal 174 No 1.6 Upsloping 0 Normal 0 No Male Atypical angina 110 229 FALSE Normal 168 No 1 Downsloping 0 Reversable defect 1 Yes Male Asymptomatic 140 239 FALSE Normal 160 No 1.2 Upsloping 0 Normal 0 No Female Non-anginal pain 130 275 FALSE Normal 139 No 0.2 Upsloping 0 Normal 0 No Male Atypical angina 130 266 FALSE Normal 171 No 0.6 Upsloping 0 Normal 0 No Male Typical angina 110 211 FALSE LV hypertrophy 144 Yes 1.8 Flat 0 Normal 0 No Female Typical angina 150 283 TRUE LV hypertrophy 162 No 1 Upsloping 0 Normal 0 No Male Atypical angina 120 284 FALSE LV hypertrophy 160 No 1.8 Flat 0 Normal 1 Yes Male Non-anginal pain 132 224 FALSE LV hypertrophy 173 No 3.2 Upsloping 2 Reversable defect 3 Yes Male Asymptomatic 130 206 FALSE LV hypertrophy 132 Yes 2.4 Flat 2 Reversable defect 4 Yes

CLEANED DATA:

CLEANED DATA Male Typical angina 145 233 TRUE LV hypertrophy 150 No 2.3 Downsloping 0 Fixed defect 0 No Male Asymptomatic 160 286 FALSE LV hypertrophy 108 Yes 1.5 Flat 3 Normal 2 Yes Male Asymptomatic 120 229 FALSE LV hypertrophy 129 Yes 2.6 Flat 2 Reversable defect 1 Yes Male Non-anginal pain 130 250 FALSE Normal 187 No 3.5 Downsloping 0 Normal 0 No Female Atypical angina 130 204 FALSE LV hypertrophy 172 No 1.4 Upsloping 0 Normal 0 No Male Atypical angina 120 236 FALSE Normal 178 No 0.8 Upsloping 0 Normal 0 No Female Asymptomatic 140 268 FALSE LV hypertrophy 160 No 3.6 Downsloping 2 Normal 3 Yes Female Asymptomatic 120 354 FALSE Normal 163 Yes 0.6 Upsloping 0 Normal 0 No Male Asymptomatic 130 254 FALSE LV hypertrophy 147 No 1.4 Flat 1 Reversable defect 2 Yes Male Asymptomatic 140 203 TRUE LV hypertrophy 155 Yes 3.1 Downsloping 0 Reversable defect 1 Yes Male Asymptomatic 140 192 FALSE Normal 148 No 0.4 Flat 0 Fixed defect 0 No Female Atypical angina 140 294 FALSE LV hypertrophy 153 No 1.3 Flat 0 Normal 0 No Male Non-anginal pain 130 256 TRUE LV hypertrophy 142 Yes 0.6 Flat 1 Fixed defect 2 Yes Male Atypical angina 120 263 FALSE Normal 173 No 0 Upsloping 0 Reversable defect 0 No Male Non-anginal pain 172 199 TRUE Normal 162 No 0.5 Upsloping 0 Reversable defect 0 No Male Non-anginal pain 150 168 FALSE Normal 174 No 1.6 Upsloping 0 Normal 0 No Male Atypical angina 110 229 FALSE Normal 168 No 1 Downsloping 0 Reversable defect 1 Yes Male Asymptomatic 140 239 FALSE Normal 160 No 1.2 Upsloping 0 Normal 0 No Female Non-anginal pain 130 275 FALSE Normal 139 No 0.2 Upsloping 0 Normal 0 No Male Atypical angina 130 266 FALSE Normal 171 No 0.6 Upsloping 0 Normal 0 No Male Typical angina 110 211 FALSE LV hypertrophy 144 Yes 1.8 Flat 0 Normal 0 No Female Typical angina 150 283 TRUE LV hypertrophy 162 No 1 Upsloping 0 Normal 0 No Male Atypical angina 120 284 FALSE LV hypertrophy 160 No 1.8 Flat 0 Normal 1 Yes Male Non-anginal pain 132 224 FALSE LV hypertrophy 173 No 3.2 Upsloping 2 Reversable defect 3 Yes Male Asymptomatic 130 206 FALSE LV hypertrophy 132 Yes 2.4 Flat 2 Reversable defect 4 Yes

DISCRETIZED DATA:

DISCRETIZED DATA [60, *) Male {Atypical angina, Typical angina} [139, *) [223, 265) TRUE {LV hypertrophy} [143, 162) No [1.4, *) {Downsloping} [*, 1) {Fixed defect} 0 No [60, *) Male {Asymptomatic} [139, *) [265, *) FALSE {LV hypertrophy} [*, 143) Yes [1.4, *) {Flat} [2, *) {Normal} 2 Yes [60, *) Male {Asymptomatic} [*, 123) [223, 265) FALSE {LV hypertrophy} [*, 143) Yes [1.4, *) {Flat} [2, *) {Reversable defect} 1 Yes [*, 52) Male {Non-anginal pain} [123, 139) [223, 265) FALSE {Normal} [162, *) No [1.4, *) {Downsloping} [*, 1) {Normal} 0 No [*, 52) Female {Atypical angina, Typical angina} [123, 139) [*, 223) FALSE {LV hypertrophy} [162, *) No [1.4, *) {Upsloping} [*, 1) {Normal} 0 No [52, 60) Male {Atypical angina, Typical angina} [*, 123) [223, 265) FALSE {Normal} [162, *) No [0.1, 1.4) {Upsloping} [*, 1) {Normal} 0 No [60, *) Female {Asymptomatic} [139, *) [265, *) FALSE {LV hypertrophy} [143, 162) No [1.4, *) {Downsloping} [2, *) {Normal} 3 Yes [52, 60) Female {Asymptomatic} [*, 123) [265, *) FALSE {Normal} [162, *) Yes [0.1, 1.4) {Upsloping} [*, 1) {Normal} 0 No [60, *) Male {Asymptomatic} [123, 139) [223, 265) FALSE {LV hypertrophy} [143, 162) No [1.4, *) {Flat} [1, 2) {Reversable defect} 2 Yes [52, 60) Male {Asymptomatic} [139, *) [*, 223) TRUE {LV hypertrophy} [143, 162) Yes [1.4, *) {Downsloping} [*, 1) {Reversable defect} 1 Yes [52, 60) Male {Asymptomatic} [139, *) [*, 223) FALSE {Normal} [143, 162) No [0.1, 1.4) {Flat} [*, 1) {Fixed defect} 0 No [52, 60) Female {Atypical angina, Typical angina} [139, *) [265, *) FALSE {LV hypertrophy} [143, 162) No [0.1, 1.4) {Flat} [*, 1) {Normal} 0 No [52, 60) Male {Non-anginal pain} [123, 139) [223, 265) TRUE {LV hypertrophy} [*, 143) Yes [0.1, 1.4) {Flat} [1, 2) {Fixed defect} 2 Yes [*, 52) Male {Atypical angina, Typical angina} [*, 123) [223, 265) FALSE {Normal} [162, *) No [*, 0.1) {Upsloping} [*, 1) {Reversable defect} 0 No [52, 60) Male {Non-anginal pain} [139, *) [*, 223) TRUE {Normal} [162, *) No [0.1, 1.4) {Upsloping} [*, 1) {Reversable defect} 0 No [52, 60) Male {Non-anginal pain} [139, *) [*, 223) FALSE {Normal} [162, *) No [1.4, *) {Upsloping} [*, 1) {Normal} 0 No [*, 52) Male {Atypical angina, Typical angina} [*, 123) [223, 265) FALSE {Normal} [162, *) No [0.1, 1.4) {Downsloping} [*, 1) {Reversable defect} 1 Yes [52, 60) Male {Asymptomatic} [139, *) [223, 265) FALSE {Normal} [143, 162) No [0.1, 1.4) {Upsloping} [*, 1) {Normal} 0 No [*, 52) Female {Non-anginal pain} [123, 139) [265, *) FALSE {Normal} [*, 143) No [0.1, 1.4) {Upsloping} [*, 1) {Normal} 0 No [*, 52) Male {Atypical angina, Typical angina} [123, 139) [265, *) FALSE {Normal} [162, *) No [0.1, 1.4) {Upsloping} [*, 1) {Normal} 0 No [60, *) Male {Atypical angina, Typical angina} [*, 123) [*, 223) FALSE {LV hypertrophy} [143, 162) Yes [1.4, *) {Flat} [*, 1) {Normal} 0 No [52, 60) Female {Atypical angina, Typical angina} [139, *) [265, *) TRUE {LV hypertrophy} [162, *) No [0.1, 1.4) {Upsloping} [*, 1) {Normal} 0 No [52, 60) Male {Atypical angina, Typical angina} [*, 123) [265, *) FALSE {LV hypertrophy} [143, 162) No [1.4, *) {Flat} [*, 1) {Normal} 1 Yes [52, 60) Male {Non-anginal pain} [123, 139) [223, 265) FALSE {LV hypertrophy} [162, *) No [1.4, *) {Upsloping} [2, *) {Reversable defect} 3 Yes [60, *) Male {Asymptomatic} [123, 139) [*, 223) FALSE {LV hypertrophy} [*, 143) Yes [1.4, *) {Flat} [2, *) {Reversable defect} 4 Yes

DISCRETIZED DATA:

DISCRETIZED DATA 2 1 0 2 1 1 2 1 0 2 2 0 1 0 0 2 1 2 2 2 0 2 0 1 2 1 2 0 2 1 2 1 2 0 1 0 2 0 1 2 1 2 2 1 1 0 1 1 1 1 0 0 2 0 2 2 0 0 0 0 0 0 0 1 0 0 2 2 0 2 0 0 0 0 0 1 1 0 0 1 0 0 2 0 1 0 0 0 0 0 2 0 2 2 2 0 2 1 0 2 2 2 0 3 1 1 0 2 0 2 0 0 2 1 1 0 0 0 0 0 2 1 2 1 1 0 2 1 0 2 1 1 2 2 1 1 1 2 2 0 1 2 1 1 2 2 0 2 1 1 1 1 2 2 0 0 0 1 0 1 1 0 1 0 0 1 0 0 2 2 0 2 1 0 1 1 0 0 0 0 1 1 1 1 1 1 2 0 1 1 1 1 1 2 1 0 1 0 0 1 0 0 2 0 0 0 0 2 0 0 1 1 1 2 0 1 0 2 0 1 0 0 2 0 0 1 1 1 2 0 0 0 2 0 2 0 0 0 0 0 0 1 0 0 1 0 0 2 0 1 2 0 2 1 1 1 1 2 2 1 0 0 1 0 1 0 0 0 0 0 0 0 1 1 2 0 0 0 0 1 0 0 0 0 0 0 1 0 1 2 0 0 2 0 1 0 0 0 0 0 2 1 0 0 0 0 2 1 1 2 1 0 0 0 0 1 0 0 2 2 1 2 2 0 1 0 0 0 0 0 1 1 0 0 2 0 2 1 0 2 1 0 0 1 1 1 1 1 1 1 0 2 2 0 2 0 2 2 3 1 2 1 2 1 0 0 2 0 1 2 1 2 2 4 1

TRANSFORMED DATA:

TRANSFORMED DATA 1 0.5 0 1 0.5 0.5 1 0.5 0 1 1 0 0.5 0 0 1 0.5 1 1 1 0 1 0 0.5 1 0.5 1 0 1 0.5 1 0.5 1 0 0.5 0 1 0 0.5 1 0.5 1 1 0.5 0.5 0 0.5 0.5 0.5 0.5 0 0 1 0 1 1 0 0 0 0 0 0 0 0.5 0 0 1 1 0 1 0 0 0 0 0 0.5 0.5 0 0 0.5 0 0 1 0 0.5 0 0 0 0 0 1 0 1 1 1 0 1 0.5 0 1 1 1 0 3 0.5 0.5 0 1 0 1 0 0 1 0.5 0.5 0 0 0 0 0 1 0.5 1 0.5 0.5 0 1 0.5 0 1 0.5 0.5 1 1 0.5 0.5 0.5 1 1 0 0.5 1 0.5 0.5 1 1 0 1 0.5 0.5 0.5 0.5 1 1 0 0 0 0.5 0 0.5 0.5 0 0.5 0 0 0.5 0 0 1 1 0 1 0.5 0 0.5 0.5 0 0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 1 0 0.5 0.5 0.5 0.5 0.5 1 0.5 0 0.5 0 0 0.5 0 0 1 0 0 0 0 1 0 0 0.5 0.5 0.5 1 0 0.5 0 1 0 0.5 0 0 1 0 0 0.5 0.5 0.5 1 0 0 0 1 0 1 0 0 0 0 0 0 0.5 0 0 0.5 0 0 1 0 0.5 1 0 1 0.5 0.5 0.5 0.5 1 1 0.5 0 0 0.5 0 0.5 0 0 0 0 0 0 0 0.5 0.5 1 0 0 0 0 0.5 0 0 0 0 0 0 0.5 0 0.5 1 0 0 1 0 0.5 0 0 0 0 0 1 0.5 0 0 0 0 1 0.5 0.5 1 0.5 0 0 0 0 0.5 0 0 1 1 0.5 1 1 0 0.5 0 0 0 0 0 0.5 0.5 0 0 1 0 1 0.5 0 1 0.5 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 1 1 0 1 0 1 1 3 0.5 1 0.5 1 0.5 0 0 1 0 0.5 1 0.5 1 1 4 0.5

Summary:

Summary Data preparation is a big issue for both warehousing and mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot methods have been developed but still an active area of research

authorStream Live Help