logging in or signing up 8323 Stats - Lesson 2: Principle Components Analys untellectualism Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 324 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 20, 2008 This Presentation is Public Favorites: 0 Presentation Description 8323 Stats - Lesson 2: Principle Components Analysis Comments Posting comment... By: khairilnotodiputro (20 month(s) ago) This presentation materials is good for teaching undergraduate students at my university. I would be happy if I am allowed to download this materials for education purposes, Thank you. Khairil Notodiputro Bogor Agricultural University Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Principal Components Analysis: Principal Components Analysis Motivation: why principal components Describe some important characteristics of the correlation matrix, R Introduction to principal componentsSlide2: We consider information on innovation and research (we limit attention to a subset of vars). These variables are probably related, since they are supposed to be related to the same topic. Example1. Innovation and Research in Europe (Source: Eurostat) The data we will considerPrincipal Components Analysis - Motivation: Principal Components Analysis - Motivation The correlation matrix, R (remember R is symmetric) We are interested in analyzing the relationships between vars. This could be done by considering the entries in R, but this operation is not simple and it is even more complicated when the nr of vars is high. Principal Components Analysis - Motivation: Principal Components Analysis - Motivation The variables taken into account are all correlated one to each other? Groups of vars related one to each other? Isolated vars? Which are the most relevant relationships? Is it possible to synthesize vars, reducing dimension? PCA answers to these questions by synthesizing the original vars. More precisely, the syntheses considered in PCA are linear combinations of the original vars. These linear combinations are selected so as to reproduce at best information (we will also explain what we mean with information). Moreover, they provide a description of the main tendencies in data, i.e., they can be used to describe the relationships between the vars ?1 How can we describe relationships between vars? If the vars are related, it should be possible to combine them, reducing the dimension of the problem – lower nr of vars. How should these syntheses be defined? How much powerful they are in reproducing the original information? (Here, a definition of information is to be provided). ?2The data matrix: The data matrix Data matrix: X Centroid (vector whose elements are the sample means) : Variance and Covariances matrix (matrix whose elements are the variances – on the diagonal – and the covariances): S Correlations matrix (matrix whose elements are the correlations – 1’s on the diagonal): R Information about the (linear) relationships between vars is contained in S and in R. R is the var/cov matrix of standardized variables, i.e., vars with the same unit of measurement (variances=1, centroid = 0). Since we are going to work with linear combinations, it is sensible to work with standardized vars – in this way we are combining comparable vars. Thus, we will consider the standardized data matrix:Before proceeding…. More on vectors and matrices: Before proceeding…. More on vectors and matrices Given two vectors, v and u in the K-dimensional space, we define their internal product as: Given one (n × p) matrix Z and one (p × 1) vector v: NB: here “graphically” Z and v appear as having the same nr of rows. Nevertheless, the nr of rows of Z (n) differ from the nr of rows of v (p) Instead, the nr of columns of Z must coincide with the nr of rows of v.PCA – preliminary concepts. Var/cov matrices decomposition: In the following, we limit attention to the correlation matrix. Nevertheless, the same concepts illustrated for R also apply to the var/cov matrix, S. Being a variance/covariance matrix (for standardized data), R is A square matrix (nr of rows = nr of columns = nr of vars, p) A symmetric matrix (rjh = rhj) A positive semi-definite matrix: for any (p × 1) vector v, it is: We consider now the eigenvalues and the eigenvectors of R PCA – preliminary concepts. Var/cov matrices decompositionPCA – preliminary concepts. Eigenvalues /Eigenvectors: Consider the (p p) matrix R. An eigenvalue of R, k and its corresponding eigenvector, g(k), are respectively a real number and a (p 1) vector satisfying the following equation: PCA – preliminary concepts. Eigenvalues /Eigenvectors Usually, for uniqueness conditions, it is required that: To the (p p) correlation matrix R p pairs of eigenvalues/ eigenvectors may be associated. p eigenvalues of R 1 2 … p (ordered from the highest to the lowest): and p eigenvectors associated to them: g(1), g(2),…, g(p) PCA –preliminary concepts. Eigenvalues /Eigenvectors: The eigenvalues: 1 2 … p The eigenvalues are real numbers (since R is symmetric) The k’s are non negative (since R is positive semi-definite) The number of non zero eigenvalues coincides with the rank of R. The rank of R is the maximum number of rows (columns) of R which are linearly independent (rows are linearly independent if none of them can be written as a linear combination of the others) The eigenvalues R are closely related to some characteristics of R The eigenvalues can reproduce the trace and the determinant of R. This means that the eigenvalues can reproduce the total variance and the generalized variance of the standardized data PCA –preliminary concepts. Eigenvalues /Eigenvectors Due to the characteristics of a var/cov matrix, and hence of the correlation matrix R Since the elements on the diagonal of R = 1.PCA – preliminary concepts. Eigenvalues /Eigenvectors: The eigenvectors: g (1), g (2),…, g (p) PCA – preliminary concepts. Eigenvalues /Eigenvectors PCA – preliminary concepts. Eigenvalues /Eigenvectors: PCA – preliminary concepts. Eigenvalues /Eigenvectors The p eigenvalues, 1 2 … p , and the p eigenvectors, g(1), g(2),…, g (p), associated to R can be arranged into the eigenvalues and the eigenvector matrices:PCA – preliminary concepts. Decomposition Theorem: PCA – preliminary concepts. Decomposition Theorem Let us consider the correlation matrix R and the eigenvalues and eigenvector matrices associated to it. It can be shown that: All the information in R can be obtained on the basis of eigenvalues and eigenvectors. The (p × p) correlation matrix R can be rewritten as the sum of p matrices. Let us consider one of these matrices.PCA – preliminary concepts. Decomposition Theorem: PCA – preliminary concepts. Decomposition Theorem Decomposition theorem: The number of matrices necessary to obtain R coincides with the number of non null eigenvalues (the rank of R). One matrix will be really important in the reconstruction of R if the corresponding eigenvalue is very high. If some eigenvalues, say the last (p – r ) – remember that eigenvalues are ordered – are very small, we could disregard the matrices associated to them, and approximate R using only the first r matrices without loosing much informationSlide14: The first matrix (corresponding to the first eigenvalue/eigenvector pair) is the most important in the reconstruction of R (highest eigenvalue) Eigenvalues and eigenvectors Correlation matrix, R PCA – preliminary concepts. Decomposition Theorem Example1. Innovation and Research in Europe For the sake of simplicity we limit attention only to 3 vars R, eigenvalues and eigenvectors have been obtained with SASSlide15: Eigenvalues and eigenvectors Correlation matrix, R PCA – preliminary concepts. Decomposition Theorem Example1. Innovation and Research in Europe - 3 vars R, eigenvalues and eigenvectors have been obtained with SAS + + = R = Matrices associated to the highest eigenvalues are the most important. The last matrix has relatively low entries (errors we would incur if we should disregard it) Slide16: Principal Components – Just define them Consider Z , the standardized data matrix, with n cases (rows) and p vars (columns). The centroid of Z is the origin, 0, the variance and covariance matrix of Z is R. 1 2 … p , and g(1), g(2),…, g (p), are the p eigenvalues and eigenvectors associated to R. The k-th principal component (PC) associated to Z is defined as: We have one value of the k-th PC (score) for each observation. The PC is obtained as a linear combination of the measurements on the p original variables. The weight associated to the j-th original variable is the j-th element of the eigenvector, gjk. More precisely, the PC score is a standardized linear combination (sum of the squared weights = 1) NB: Z and g (k) do not have the same nr of rows. BUT nr of columns of Z = p = nr of rows of g (k).Slide17: Principal Components – Just define them Z is the standardized data matrix. The centroid of Z is the origin, 0, the variance and covariance matrix of Z is R. 1 … p, and g(1),…, g (p) => p eigenval/eigenvec associated to R. We can define as many PC’s as are the original vars, p. To the (n × p) original (standardized) data matrix, the (n × p) matrix of the principal components scores may be associated. n “observations”/scores on the first PC p “measurements”scores on the PC’s for the n-th obsSlide18: Principal Component Transformation Example1. Innovation and Research in Europe - 3 vars Standardized data matrix, Z Original data matrix, X Slide19: Eigenvalues and eigenvectors Correlation matrix, R Example1. Innovation and Research in Europe - 3 vars R, eigenvalues and eigenvectors have been obtained with SAS ITALY First PC score for Italy. To obtain it, we calculate a standardized l.c. of the measurements on the original (standardized) variables, with weights given by the elements of the first eigenvector: yItaly,1 = (–0.14959 0.638045)+(–0.52719 0.390045)+(–0.16914 0.663900) = –0.41336 In the same way we can obtain the scores on the first PC for all the observations. Principal Component TransformationSlide20: Eigenvalues and eigenvectors Correlation matrix, R Example1. Innovation and Research in Europe - 3 vars R, eigenvalues and eigenvectors have been obtained with SAS ITALY yItaly,2 = (–0.14959 –.359783)+(–0.52719 0.913317)+(–0.16914 –.190808) = – 0.3954 yItaly,3 = (–0.14959 0.680775)+(–0.52719 0.117116)+(–0.16914 –.723070) = – 0.04128 Second and Third PC scores for Italy. Calculate the standardized l.c. of the measurements on the original (standardized) variables, with weights given by the elements of the second and of the third eigenvector respectively: Principal Component TransformationSlide21: Standardized Variables Principal Components Principal Component Transformation Example1. Innovation and Research in Europe - 3 varsSlide22: The PC’s are centred on 0 The k-th PC has variance k. Since eigenvalues are ordered, the first PC has the highest variance. The PC’s have decreasing var. The PC’s are uncorrelated The variances of the PC’s are related to the trace of R (total variance) and to the determinant of R (generalized variance). Principal Components – Characteristics Since we have n observations on each PC, we can calculate the mean, the variances and the covariances characterized the Principal components data matrix, YSlide23: Back to Multivariate samples – Transformations Let us recall the different transformations of data matrices. The PC transformation looks very similar to the Mahalanobis transformation. From p correlated (standardized) vars we obtained p uncorrelated vars. Nevertheless, data transformed according to Mahalanobis are standardized: all the vars have variance = 1, while PC’s are characterized by different variances. BUT standardized PC’s will be characterized by variance = 1. Hence: ZM = Standardized (Y) (Y = Principal components) X Z Y Standardized (Y) X ZM Slide24: Back to Multivariate samples – Transformations Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) STANDARDIZED VARS: Variances -> 1 (affects ORIGINAL AXES) Orientation (correlations) -> YES Notice that the ellipse has NO SIMILAR LENGTH AXES. MAHALANOBIS TRANSFORMATION: Variances -> 1 Orientation (correlations) -> NO Notice that the ellipse has SIMILAR LENGTH AXES. Slide25: Back to Multivariate samples – Transformations Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) STANDARDIZED VARS: Variances -> 1 (affects ORIGINAL AXES) Orientation (correlations) -> YES Variances -> are not 1 Orientation (correlations) -> NO Notice that the axes of ellipse have DIFFERENT LENGTHS. (PC have different variances) PRINCIPAL COMPONENTS TRANSF.Slide26: Principal Components – Characteristics PC transformation: from p original (correlated) vars we obtain p uncorrelated vars. Standardized PC’s are the original vars transformed according to Mahalanobis. Hence, the Euclidean distance evaluated on standardized PC’s coincides with the Mahalanobis distance. PC transformation: (2 dim) rotate the original axes so that the ellipse lines up with the coordinate axes and translates space so the ellipse’s centre is at the origin. For higher dimension spaces: rotate the space so that the ellipsoid lines up with the coordinate axes, and translates space so the ellipsoid’s centre is at the origin. The eigenvectors of R describe the direction of the axes of the ellipse/ellipsoid. The PC’s have different variances, and their variances coincide with the eigenvalues of R. It can be shown that the eigenvalues are related to the dispersion along the lines of the ellipse/ellipsoid. The largest eigenvalue corresponds to the variance along the direction of greatest variance, the next largest eigenvalue corresponds to the variance along the next direction of greatest variance, and so on. Hence, the first PC (highest variance) describes the direction of greatest variance, the second PC describe the next direction of greatest variance, and so on.Slide27: Principal Components – Characteristics Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) PC1 PC2 PC1 describes the direction of greatest variance. It describes the most relevant relationship between the two considered vars. We can say here that the two vars are positively correlated (notice that they are both positively related to PC1). Some obs are in countertendency: we observe high values of Internel_access but low levels of E_gov_avail (Cyprus). PC2 describes the possible deviations from the most relevant relationship/direction. In higher dimensions the same holds: the first PC describes the most relevant tendency of the cloud of p-dimensional points, the second PC describes the second tendency in order of importance, and so on. Notice that here importance means variance along the axes of the ellipsoid (or ellipse in the 2-dimensional case).Principal Components Analysis - Motivation: Principal Components Analysis - Motivation ?1 How can we describe relationships between vars? DONE: PC’s are transformations of the original vars capturing and describing the most important relationships between them. PC’s are extracted in order of importance, i.e., the 1° PC describes the most important tendency/relationships, the 2° PC describes the next most important tendency, a.s.o. ! If we consider all the PC’s, all the tendencies of the cloud of points are describes. No relationship remains unexplained/not described. In this particular case, the transformation into PC’s simply consists in a rotation of the original space so that the axes coincide with those of the ellipsoid and the relationships between variables are easier to describe/individuate. If we consider r < p PC’s we will loose information about the dispersion along the (p – r) axes of the ellipsoid having lowest dispersions (hence being less informative). Remember that the variance along the axes of the j-th direction of the ellipsoid is the j-th eigenvalue. So the amount of lost information is:Principal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars Consider p (standardized) variables, and suppose we are interested in defining a synthesis of these vars. Let us start with a simple problem: One synthesis: Z1, Z2, …, Zp Y1 Which kind of synthesis: For simplicity we consider linear combinations of the original vars: Z1, Z2, …, Zp Y1 = (a11Z1 + a12 Z2 + … + a1p Zp ) Which kind of linear combination: We consider standardized linear combinations: Is it possible to synthesize vars, reducing dimension? ?2 Principal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars If we consider our standardized data matrix, we have Notice that by combining vars, we substitute to each p-dimensional row (1 measurement on each variable) a row vector having only one column. We can expect that something is lost with this operation. We want to determine which is the “best” synthesis of the original vars, i.e., which vector a1 we should select to obtain a standardized linear combination of the original variable so as to loose as few information as possible. What is the information we should preserve when going from p-dimensional points to 1-dimensional points?Principal Components Analysis – Projections of OBS: Principal Components Analysis – Projections of OBS Consider the original space, and one observation, zi, in this space. Consider the vector a1. This vector can be viewed as a one dimensional sub-space of the original space. The linear combination: is the projection of zi onto a1. The projection of one point onto a subspace, is an approximation of the position of the point itself in the original space. The approximation error is the distance between one point and its projection. It is quite natural to select the standardized vector a1 so that the distances between the original p-dimensional points and their 1-dimensional projection are minimized. Hence, we can consider the sum of the squared approximation errors to evaluate a synthesis.Slide32: Consider two p-dimensional obs in the original space, zi and zh. Consider the vector a1 and the projections of these points onto it. The distance between the two obs in the new space (1 dimension) is an approximation of the distance between the two obs in the original space (p dimensions) By reducing the dimension of the space from p to 1, we incur in a deformation of the distances between points. Points which are relatively “far” in the original space may be “close” in the new one. Principal Components Analysis – Distances between OBS It is quite natural to select the standardized vector a1 so that the distances between the original p-dim points are preserved at best in the 1-dim space where their projections lie. Notice that if points are well represented in the new space (distance from the origin, previous slide), distances between them are well reproduced tooSlide33: SYNTHESIS OF VARIABLES: LOOK AT OBSERVATIONS If we consider observations the synthesis of the original variables should be characterized by low approximation errors, E. The approximation error for one observation is the distance between the point in the original p-dimensional space and its projection in the 1-dimensional space. Of course, we have to synthesize all the approximation errors: Principal Components Analysis – Distances between OBS SHOULD BE SMALLSlide34: Principal Components Analysis – Looking at the VARS As for now we considered “what happens” to obs when we synthesise the original vars, and described some characteristics of the observations space which should be ideally well reproduced in the new space. Now, let us consider the VARS. The synthesis we are going to define should also reproduce the information provided by each vars. This means that the synthesis should reproduce at best the variances of the original vars, or, in other words, the synthesis should be correlated as much as possible to the original vars. Remember that the determination coefficient between two vars (the squared correlation coefficient) measures the extent to which one variable can reproduce the variance of other one. Hence, for a generic original variable, z(j) , our attention will be focused on: If this determination coefficient is high, this means that the standardized linear combination adequately explains z(j).Slide35: SYNTHESIS OF VARIABLES: LOOK AT THE VARIABLES If we consider variables the synthesis of the original variables should be characterized by high determination coefficients with all the variables. Of course, we have to synthesize all these coefficients Principal Components Analysis – Distances between OBS SHOULD BE HIGHPrincipal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars We are interested in finding a synthesis characterized by low (synthesis of) approximation errors (observations) and high (synthesis of) determination coefficients. It can be shown that the best standardized linear combination in both senses is the one having maximum variance (variance is information). Thus, we should choose a1 (the weights to be assigned to the original variables in order to obtain the synthesis) so as to: Principal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars It can be shown that: max Var(Za1) = 1, the first eigenvalue of R (remember that this is the highest eigenvalue of R) Var(Z g(1)) = 1. Hence the vector maximizing the distance is the eigenvector corresponding to 1. Remember that Z g(1) = y(1) the first principal component. Thus: the standardized linear combination with maximum variance is the first PC Conclusion: the best way to synthesize (standardized) data is the first principal componentPrincipal Components Analysis – More Syntheses: Principal Components Analysis – More Syntheses It can be shown that: max Var(Za2) = 2, the second eigenvalue of R Var(Z g(2)) = 2 The standardized linear combination we are looking for is Z g(2) = y(2) the second principal component. Suppose now we are interested in finding another standardized linear combination, Za2 such that: Za2 is not correlated with the first PC. Za2 has maximum variance. This means that we are looking for a second synthesis which is reducing as much as possible the approximation error characterizing the first principal component, i.e. the information we loose when substituting the original vars with the first PC. Principal Components Analysis – More Syntheses: Principal Components Analysis – More Syntheses The first PC, PC1, is the standardized linear combination with maximum variance (here variance means information). This variance is 1 The second PC, PC2, is the standardized linear combination which is not correlated with PC1 and has maximum variance. This variance is 2 … The k-th PC, PCk, is the standardized linear combination which is not correlated with the first (k – 1) PC’s and has maximum variance. This variance is k … The number of PC’s we can extract is p, the number of original vars. The p PC’s are standardized linear combination extracted in a decreasing order of importance (importance=variance)Principal Components Analysis – More Syntheses: Principal Components Analysis – More Syntheses Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) PC1 PC2 Notice that if we consider as many PC’s as are the original variables, p, the PC transformation only consists in a rotation of the original axes. Notice moreover that the distance of one point from the origin in the original space coincides with the distance from the origin in the space generated by p PC’s. Thus, if we consider all the PC’s the original space is perfectly reproduced. Moreover, remember that the eigenvalues (all the eigenvalues) perfectly reproduce the total variance and the generalized variance characterizing the original space. Thus the total and the generalized variance in the space generated by the p PC’s coincide with those characterizing the original space.Principal Components Analysis - Motivation: Principal Components Analysis - Motivation DONE: PC’s are standardized linear combinations of the original vars being characterized by lowest approximation errors and highest determination coefficients. PC’s are extracted in order of importance. ! If we consider all the PC’s, the p syntheses jointly reproduce the original space (the new space generated by uncorrelated variables is a rotation of the original space). No approximation error is incurred and all the original vars are perfectly reproduced (determination coefficients all equal to 1). If we consider r < p PC’s we incur in approximation error and we loose information about the variance of one or more original variable. It can be shown that the amount of lost information (as defined above) when considering only the first r PC’s, is related to the variances of the excluded PC’s which are the last (p - r ) eigenvalues: Is it possible to synthesize vars, reducing dimension? ?2 Where Lost information = synthesis of squared approximation errors, unexplained portion of variancePrincipal Components Analysis – Subset of PC’s: Principal Components Analysis – Subset of PC’s Suppose we consider only the first r Principal Components The amount of lost information (variance along the (p – r) axes of the ellipsoid having lowest dispersions, approximation errors, portion of unexplained variance) is: If the last (p – r) eigenvalues are small, the last (p – r) PC’s may be disregarded without loosing too much information Remember: p eigenvalues/eigenvectors are needed to perfectly reproduce the correlation matrix R – sum of p matrices. If the last (p – r ) eigenvalues are low we can approximate R by referring to the first r matrices. IMPORTANT: we are considering syntheses of lost information. It may happen that some observations/vars are much better explained than others.Principal Components Analysis – Subset of PC’s: Principal Components Analysis – Subset of PC’s To measure the informative content of the first r PC’s we should then refer to the sum of their variances (first r eigenvalues). The sum of all the eigenvalues coincides with tr(R) = p. Hence to evaluate the first r PC’s we consider: The proportion of total variance accounted for by the first r PC’s. This index ranges between 0 and 1 (1 is reached when all the PC’s or, better, all the PC’s characterized by variance – eigenvalue – greater than zero) are taken into account. The relative importance of a given PC, say the k-th, is given by:Principal Components Analysis: Principal Components Analysis b. Is it possible to synthesize vars, reducing dimension? ? a. How can we describe relationships between vars? ? USE r PRINCIPAL COMPONENTS Some questions immediately arise: How many PC’s should we consider? (a. which are the tendencies to consider/ to ignore? b. How many syntheses should we take into account?) What is the relationship between one PC and original vars? (a. which are the main tendencies? b. Which are the most relevant variables in determining this tendencies, and thus more relevant in the combination?) How do the selected PC’s reproduce the original variables? (a. which are the correlated vars inducing the main tendencies? Which are the isolated vars? b. Are the considered PC’s loosing more information w.r.t. particular vars?) How do the selected PC’s describe the observations? (a. which are the observations lying along – dominating / defining – the main tendencies? Which are the observations lying along the ignored directions? b. How are the selected PC’s reproducing obs? Are there obs characterized by relatively high approximation errors?) You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
8323 Stats - Lesson 2: Principle Components Analys untellectualism Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 324 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 20, 2008 This Presentation is Public Favorites: 0 Presentation Description 8323 Stats - Lesson 2: Principle Components Analysis Comments Posting comment... By: khairilnotodiputro (20 month(s) ago) This presentation materials is good for teaching undergraduate students at my university. I would be happy if I am allowed to download this materials for education purposes, Thank you. Khairil Notodiputro Bogor Agricultural University Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Principal Components Analysis: Principal Components Analysis Motivation: why principal components Describe some important characteristics of the correlation matrix, R Introduction to principal componentsSlide2: We consider information on innovation and research (we limit attention to a subset of vars). These variables are probably related, since they are supposed to be related to the same topic. Example1. Innovation and Research in Europe (Source: Eurostat) The data we will considerPrincipal Components Analysis - Motivation: Principal Components Analysis - Motivation The correlation matrix, R (remember R is symmetric) We are interested in analyzing the relationships between vars. This could be done by considering the entries in R, but this operation is not simple and it is even more complicated when the nr of vars is high. Principal Components Analysis - Motivation: Principal Components Analysis - Motivation The variables taken into account are all correlated one to each other? Groups of vars related one to each other? Isolated vars? Which are the most relevant relationships? Is it possible to synthesize vars, reducing dimension? PCA answers to these questions by synthesizing the original vars. More precisely, the syntheses considered in PCA are linear combinations of the original vars. These linear combinations are selected so as to reproduce at best information (we will also explain what we mean with information). Moreover, they provide a description of the main tendencies in data, i.e., they can be used to describe the relationships between the vars ?1 How can we describe relationships between vars? If the vars are related, it should be possible to combine them, reducing the dimension of the problem – lower nr of vars. How should these syntheses be defined? How much powerful they are in reproducing the original information? (Here, a definition of information is to be provided). ?2The data matrix: The data matrix Data matrix: X Centroid (vector whose elements are the sample means) : Variance and Covariances matrix (matrix whose elements are the variances – on the diagonal – and the covariances): S Correlations matrix (matrix whose elements are the correlations – 1’s on the diagonal): R Information about the (linear) relationships between vars is contained in S and in R. R is the var/cov matrix of standardized variables, i.e., vars with the same unit of measurement (variances=1, centroid = 0). Since we are going to work with linear combinations, it is sensible to work with standardized vars – in this way we are combining comparable vars. Thus, we will consider the standardized data matrix:Before proceeding…. More on vectors and matrices: Before proceeding…. More on vectors and matrices Given two vectors, v and u in the K-dimensional space, we define their internal product as: Given one (n × p) matrix Z and one (p × 1) vector v: NB: here “graphically” Z and v appear as having the same nr of rows. Nevertheless, the nr of rows of Z (n) differ from the nr of rows of v (p) Instead, the nr of columns of Z must coincide with the nr of rows of v.PCA – preliminary concepts. Var/cov matrices decomposition: In the following, we limit attention to the correlation matrix. Nevertheless, the same concepts illustrated for R also apply to the var/cov matrix, S. Being a variance/covariance matrix (for standardized data), R is A square matrix (nr of rows = nr of columns = nr of vars, p) A symmetric matrix (rjh = rhj) A positive semi-definite matrix: for any (p × 1) vector v, it is: We consider now the eigenvalues and the eigenvectors of R PCA – preliminary concepts. Var/cov matrices decompositionPCA – preliminary concepts. Eigenvalues /Eigenvectors: Consider the (p p) matrix R. An eigenvalue of R, k and its corresponding eigenvector, g(k), are respectively a real number and a (p 1) vector satisfying the following equation: PCA – preliminary concepts. Eigenvalues /Eigenvectors Usually, for uniqueness conditions, it is required that: To the (p p) correlation matrix R p pairs of eigenvalues/ eigenvectors may be associated. p eigenvalues of R 1 2 … p (ordered from the highest to the lowest): and p eigenvectors associated to them: g(1), g(2),…, g(p) PCA –preliminary concepts. Eigenvalues /Eigenvectors: The eigenvalues: 1 2 … p The eigenvalues are real numbers (since R is symmetric) The k’s are non negative (since R is positive semi-definite) The number of non zero eigenvalues coincides with the rank of R. The rank of R is the maximum number of rows (columns) of R which are linearly independent (rows are linearly independent if none of them can be written as a linear combination of the others) The eigenvalues R are closely related to some characteristics of R The eigenvalues can reproduce the trace and the determinant of R. This means that the eigenvalues can reproduce the total variance and the generalized variance of the standardized data PCA –preliminary concepts. Eigenvalues /Eigenvectors Due to the characteristics of a var/cov matrix, and hence of the correlation matrix R Since the elements on the diagonal of R = 1.PCA – preliminary concepts. Eigenvalues /Eigenvectors: The eigenvectors: g (1), g (2),…, g (p) PCA – preliminary concepts. Eigenvalues /Eigenvectors PCA – preliminary concepts. Eigenvalues /Eigenvectors: PCA – preliminary concepts. Eigenvalues /Eigenvectors The p eigenvalues, 1 2 … p , and the p eigenvectors, g(1), g(2),…, g (p), associated to R can be arranged into the eigenvalues and the eigenvector matrices:PCA – preliminary concepts. Decomposition Theorem: PCA – preliminary concepts. Decomposition Theorem Let us consider the correlation matrix R and the eigenvalues and eigenvector matrices associated to it. It can be shown that: All the information in R can be obtained on the basis of eigenvalues and eigenvectors. The (p × p) correlation matrix R can be rewritten as the sum of p matrices. Let us consider one of these matrices.PCA – preliminary concepts. Decomposition Theorem: PCA – preliminary concepts. Decomposition Theorem Decomposition theorem: The number of matrices necessary to obtain R coincides with the number of non null eigenvalues (the rank of R). One matrix will be really important in the reconstruction of R if the corresponding eigenvalue is very high. If some eigenvalues, say the last (p – r ) – remember that eigenvalues are ordered – are very small, we could disregard the matrices associated to them, and approximate R using only the first r matrices without loosing much informationSlide14: The first matrix (corresponding to the first eigenvalue/eigenvector pair) is the most important in the reconstruction of R (highest eigenvalue) Eigenvalues and eigenvectors Correlation matrix, R PCA – preliminary concepts. Decomposition Theorem Example1. Innovation and Research in Europe For the sake of simplicity we limit attention only to 3 vars R, eigenvalues and eigenvectors have been obtained with SASSlide15: Eigenvalues and eigenvectors Correlation matrix, R PCA – preliminary concepts. Decomposition Theorem Example1. Innovation and Research in Europe - 3 vars R, eigenvalues and eigenvectors have been obtained with SAS + + = R = Matrices associated to the highest eigenvalues are the most important. The last matrix has relatively low entries (errors we would incur if we should disregard it) Slide16: Principal Components – Just define them Consider Z , the standardized data matrix, with n cases (rows) and p vars (columns). The centroid of Z is the origin, 0, the variance and covariance matrix of Z is R. 1 2 … p , and g(1), g(2),…, g (p), are the p eigenvalues and eigenvectors associated to R. The k-th principal component (PC) associated to Z is defined as: We have one value of the k-th PC (score) for each observation. The PC is obtained as a linear combination of the measurements on the p original variables. The weight associated to the j-th original variable is the j-th element of the eigenvector, gjk. More precisely, the PC score is a standardized linear combination (sum of the squared weights = 1) NB: Z and g (k) do not have the same nr of rows. BUT nr of columns of Z = p = nr of rows of g (k).Slide17: Principal Components – Just define them Z is the standardized data matrix. The centroid of Z is the origin, 0, the variance and covariance matrix of Z is R. 1 … p, and g(1),…, g (p) => p eigenval/eigenvec associated to R. We can define as many PC’s as are the original vars, p. To the (n × p) original (standardized) data matrix, the (n × p) matrix of the principal components scores may be associated. n “observations”/scores on the first PC p “measurements”scores on the PC’s for the n-th obsSlide18: Principal Component Transformation Example1. Innovation and Research in Europe - 3 vars Standardized data matrix, Z Original data matrix, X Slide19: Eigenvalues and eigenvectors Correlation matrix, R Example1. Innovation and Research in Europe - 3 vars R, eigenvalues and eigenvectors have been obtained with SAS ITALY First PC score for Italy. To obtain it, we calculate a standardized l.c. of the measurements on the original (standardized) variables, with weights given by the elements of the first eigenvector: yItaly,1 = (–0.14959 0.638045)+(–0.52719 0.390045)+(–0.16914 0.663900) = –0.41336 In the same way we can obtain the scores on the first PC for all the observations. Principal Component TransformationSlide20: Eigenvalues and eigenvectors Correlation matrix, R Example1. Innovation and Research in Europe - 3 vars R, eigenvalues and eigenvectors have been obtained with SAS ITALY yItaly,2 = (–0.14959 –.359783)+(–0.52719 0.913317)+(–0.16914 –.190808) = – 0.3954 yItaly,3 = (–0.14959 0.680775)+(–0.52719 0.117116)+(–0.16914 –.723070) = – 0.04128 Second and Third PC scores for Italy. Calculate the standardized l.c. of the measurements on the original (standardized) variables, with weights given by the elements of the second and of the third eigenvector respectively: Principal Component TransformationSlide21: Standardized Variables Principal Components Principal Component Transformation Example1. Innovation and Research in Europe - 3 varsSlide22: The PC’s are centred on 0 The k-th PC has variance k. Since eigenvalues are ordered, the first PC has the highest variance. The PC’s have decreasing var. The PC’s are uncorrelated The variances of the PC’s are related to the trace of R (total variance) and to the determinant of R (generalized variance). Principal Components – Characteristics Since we have n observations on each PC, we can calculate the mean, the variances and the covariances characterized the Principal components data matrix, YSlide23: Back to Multivariate samples – Transformations Let us recall the different transformations of data matrices. The PC transformation looks very similar to the Mahalanobis transformation. From p correlated (standardized) vars we obtained p uncorrelated vars. Nevertheless, data transformed according to Mahalanobis are standardized: all the vars have variance = 1, while PC’s are characterized by different variances. BUT standardized PC’s will be characterized by variance = 1. Hence: ZM = Standardized (Y) (Y = Principal components) X Z Y Standardized (Y) X ZM Slide24: Back to Multivariate samples – Transformations Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) STANDARDIZED VARS: Variances -> 1 (affects ORIGINAL AXES) Orientation (correlations) -> YES Notice that the ellipse has NO SIMILAR LENGTH AXES. MAHALANOBIS TRANSFORMATION: Variances -> 1 Orientation (correlations) -> NO Notice that the ellipse has SIMILAR LENGTH AXES. Slide25: Back to Multivariate samples – Transformations Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) STANDARDIZED VARS: Variances -> 1 (affects ORIGINAL AXES) Orientation (correlations) -> YES Variances -> are not 1 Orientation (correlations) -> NO Notice that the axes of ellipse have DIFFERENT LENGTHS. (PC have different variances) PRINCIPAL COMPONENTS TRANSF.Slide26: Principal Components – Characteristics PC transformation: from p original (correlated) vars we obtain p uncorrelated vars. Standardized PC’s are the original vars transformed according to Mahalanobis. Hence, the Euclidean distance evaluated on standardized PC’s coincides with the Mahalanobis distance. PC transformation: (2 dim) rotate the original axes so that the ellipse lines up with the coordinate axes and translates space so the ellipse’s centre is at the origin. For higher dimension spaces: rotate the space so that the ellipsoid lines up with the coordinate axes, and translates space so the ellipsoid’s centre is at the origin. The eigenvectors of R describe the direction of the axes of the ellipse/ellipsoid. The PC’s have different variances, and their variances coincide with the eigenvalues of R. It can be shown that the eigenvalues are related to the dispersion along the lines of the ellipse/ellipsoid. The largest eigenvalue corresponds to the variance along the direction of greatest variance, the next largest eigenvalue corresponds to the variance along the next direction of greatest variance, and so on. Hence, the first PC (highest variance) describes the direction of greatest variance, the second PC describe the next direction of greatest variance, and so on.Slide27: Principal Components – Characteristics Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) PC1 PC2 PC1 describes the direction of greatest variance. It describes the most relevant relationship between the two considered vars. We can say here that the two vars are positively correlated (notice that they are both positively related to PC1). Some obs are in countertendency: we observe high values of Internel_access but low levels of E_gov_avail (Cyprus). PC2 describes the possible deviations from the most relevant relationship/direction. In higher dimensions the same holds: the first PC describes the most relevant tendency of the cloud of p-dimensional points, the second PC describes the second tendency in order of importance, and so on. Notice that here importance means variance along the axes of the ellipsoid (or ellipse in the 2-dimensional case).Principal Components Analysis - Motivation: Principal Components Analysis - Motivation ?1 How can we describe relationships between vars? DONE: PC’s are transformations of the original vars capturing and describing the most important relationships between them. PC’s are extracted in order of importance, i.e., the 1° PC describes the most important tendency/relationships, the 2° PC describes the next most important tendency, a.s.o. ! If we consider all the PC’s, all the tendencies of the cloud of points are describes. No relationship remains unexplained/not described. In this particular case, the transformation into PC’s simply consists in a rotation of the original space so that the axes coincide with those of the ellipsoid and the relationships between variables are easier to describe/individuate. If we consider r < p PC’s we will loose information about the dispersion along the (p – r) axes of the ellipsoid having lowest dispersions (hence being less informative). Remember that the variance along the axes of the j-th direction of the ellipsoid is the j-th eigenvalue. So the amount of lost information is:Principal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars Consider p (standardized) variables, and suppose we are interested in defining a synthesis of these vars. Let us start with a simple problem: One synthesis: Z1, Z2, …, Zp Y1 Which kind of synthesis: For simplicity we consider linear combinations of the original vars: Z1, Z2, …, Zp Y1 = (a11Z1 + a12 Z2 + … + a1p Zp ) Which kind of linear combination: We consider standardized linear combinations: Is it possible to synthesize vars, reducing dimension? ?2 Principal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars If we consider our standardized data matrix, we have Notice that by combining vars, we substitute to each p-dimensional row (1 measurement on each variable) a row vector having only one column. We can expect that something is lost with this operation. We want to determine which is the “best” synthesis of the original vars, i.e., which vector a1 we should select to obtain a standardized linear combination of the original variable so as to loose as few information as possible. What is the information we should preserve when going from p-dimensional points to 1-dimensional points?Principal Components Analysis – Projections of OBS: Principal Components Analysis – Projections of OBS Consider the original space, and one observation, zi, in this space. Consider the vector a1. This vector can be viewed as a one dimensional sub-space of the original space. The linear combination: is the projection of zi onto a1. The projection of one point onto a subspace, is an approximation of the position of the point itself in the original space. The approximation error is the distance between one point and its projection. It is quite natural to select the standardized vector a1 so that the distances between the original p-dimensional points and their 1-dimensional projection are minimized. Hence, we can consider the sum of the squared approximation errors to evaluate a synthesis.Slide32: Consider two p-dimensional obs in the original space, zi and zh. Consider the vector a1 and the projections of these points onto it. The distance between the two obs in the new space (1 dimension) is an approximation of the distance between the two obs in the original space (p dimensions) By reducing the dimension of the space from p to 1, we incur in a deformation of the distances between points. Points which are relatively “far” in the original space may be “close” in the new one. Principal Components Analysis – Distances between OBS It is quite natural to select the standardized vector a1 so that the distances between the original p-dim points are preserved at best in the 1-dim space where their projections lie. Notice that if points are well represented in the new space (distance from the origin, previous slide), distances between them are well reproduced tooSlide33: SYNTHESIS OF VARIABLES: LOOK AT OBSERVATIONS If we consider observations the synthesis of the original variables should be characterized by low approximation errors, E. The approximation error for one observation is the distance between the point in the original p-dimensional space and its projection in the 1-dimensional space. Of course, we have to synthesize all the approximation errors: Principal Components Analysis – Distances between OBS SHOULD BE SMALLSlide34: Principal Components Analysis – Looking at the VARS As for now we considered “what happens” to obs when we synthesise the original vars, and described some characteristics of the observations space which should be ideally well reproduced in the new space. Now, let us consider the VARS. The synthesis we are going to define should also reproduce the information provided by each vars. This means that the synthesis should reproduce at best the variances of the original vars, or, in other words, the synthesis should be correlated as much as possible to the original vars. Remember that the determination coefficient between two vars (the squared correlation coefficient) measures the extent to which one variable can reproduce the variance of other one. Hence, for a generic original variable, z(j) , our attention will be focused on: If this determination coefficient is high, this means that the standardized linear combination adequately explains z(j).Slide35: SYNTHESIS OF VARIABLES: LOOK AT THE VARIABLES If we consider variables the synthesis of the original variables should be characterized by high determination coefficients with all the variables. Of course, we have to synthesize all these coefficients Principal Components Analysis – Distances between OBS SHOULD BE HIGHPrincipal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars We are interested in finding a synthesis characterized by low (synthesis of) approximation errors (observations) and high (synthesis of) determination coefficients. It can be shown that the best standardized linear combination in both senses is the one having maximum variance (variance is information). Thus, we should choose a1 (the weights to be assigned to the original variables in order to obtain the synthesis) so as to: Principal Components Analysis – Syntheses of original vars: Principal Components Analysis – Syntheses of original vars It can be shown that: max Var(Za1) = 1, the first eigenvalue of R (remember that this is the highest eigenvalue of R) Var(Z g(1)) = 1. Hence the vector maximizing the distance is the eigenvector corresponding to 1. Remember that Z g(1) = y(1) the first principal component. Thus: the standardized linear combination with maximum variance is the first PC Conclusion: the best way to synthesize (standardized) data is the first principal componentPrincipal Components Analysis – More Syntheses: Principal Components Analysis – More Syntheses It can be shown that: max Var(Za2) = 2, the second eigenvalue of R Var(Z g(2)) = 2 The standardized linear combination we are looking for is Z g(2) = y(2) the second principal component. Suppose now we are interested in finding another standardized linear combination, Za2 such that: Za2 is not correlated with the first PC. Za2 has maximum variance. This means that we are looking for a second synthesis which is reducing as much as possible the approximation error characterizing the first principal component, i.e. the information we loose when substituting the original vars with the first PC. Principal Components Analysis – More Syntheses: Principal Components Analysis – More Syntheses The first PC, PC1, is the standardized linear combination with maximum variance (here variance means information). This variance is 1 The second PC, PC2, is the standardized linear combination which is not correlated with PC1 and has maximum variance. This variance is 2 … The k-th PC, PCk, is the standardized linear combination which is not correlated with the first (k – 1) PC’s and has maximum variance. This variance is k … The number of PC’s we can extract is p, the number of original vars. The p PC’s are standardized linear combination extracted in a decreasing order of importance (importance=variance)Principal Components Analysis – More Syntheses: Principal Components Analysis – More Syntheses Example1. Innovation and Research in Europe - (consider the 2 vars in multivariate vectors and samples_2008.ppt, internet_access, e_gov_avail) PC1 PC2 Notice that if we consider as many PC’s as are the original variables, p, the PC transformation only consists in a rotation of the original axes. Notice moreover that the distance of one point from the origin in the original space coincides with the distance from the origin in the space generated by p PC’s. Thus, if we consider all the PC’s the original space is perfectly reproduced. Moreover, remember that the eigenvalues (all the eigenvalues) perfectly reproduce the total variance and the generalized variance characterizing the original space. Thus the total and the generalized variance in the space generated by the p PC’s coincide with those characterizing the original space.Principal Components Analysis - Motivation: Principal Components Analysis - Motivation DONE: PC’s are standardized linear combinations of the original vars being characterized by lowest approximation errors and highest determination coefficients. PC’s are extracted in order of importance. ! If we consider all the PC’s, the p syntheses jointly reproduce the original space (the new space generated by uncorrelated variables is a rotation of the original space). No approximation error is incurred and all the original vars are perfectly reproduced (determination coefficients all equal to 1). If we consider r < p PC’s we incur in approximation error and we loose information about the variance of one or more original variable. It can be shown that the amount of lost information (as defined above) when considering only the first r PC’s, is related to the variances of the excluded PC’s which are the last (p - r ) eigenvalues: Is it possible to synthesize vars, reducing dimension? ?2 Where Lost information = synthesis of squared approximation errors, unexplained portion of variancePrincipal Components Analysis – Subset of PC’s: Principal Components Analysis – Subset of PC’s Suppose we consider only the first r Principal Components The amount of lost information (variance along the (p – r) axes of the ellipsoid having lowest dispersions, approximation errors, portion of unexplained variance) is: If the last (p – r) eigenvalues are small, the last (p – r) PC’s may be disregarded without loosing too much information Remember: p eigenvalues/eigenvectors are needed to perfectly reproduce the correlation matrix R – sum of p matrices. If the last (p – r ) eigenvalues are low we can approximate R by referring to the first r matrices. IMPORTANT: we are considering syntheses of lost information. It may happen that some observations/vars are much better explained than others.Principal Components Analysis – Subset of PC’s: Principal Components Analysis – Subset of PC’s To measure the informative content of the first r PC’s we should then refer to the sum of their variances (first r eigenvalues). The sum of all the eigenvalues coincides with tr(R) = p. Hence to evaluate the first r PC’s we consider: The proportion of total variance accounted for by the first r PC’s. This index ranges between 0 and 1 (1 is reached when all the PC’s or, better, all the PC’s characterized by variance – eigenvalue – greater than zero) are taken into account. The relative importance of a given PC, say the k-th, is given by:Principal Components Analysis: Principal Components Analysis b. Is it possible to synthesize vars, reducing dimension? ? a. How can we describe relationships between vars? ? USE r PRINCIPAL COMPONENTS Some questions immediately arise: How many PC’s should we consider? (a. which are the tendencies to consider/ to ignore? b. How many syntheses should we take into account?) What is the relationship between one PC and original vars? (a. which are the main tendencies? b. Which are the most relevant variables in determining this tendencies, and thus more relevant in the combination?) How do the selected PC’s reproduce the original variables? (a. which are the correlated vars inducing the main tendencies? Which are the isolated vars? b. Are the considered PC’s loosing more information w.r.t. particular vars?) How do the selected PC’s describe the observations? (a. which are the observations lying along – dominating / defining – the main tendencies? Which are the observations lying along the ignored directions? b. How are the selected PC’s reproducing obs? Are there obs characterized by relatively high approximation errors?)