8323 Stats - Lesson 1 - 03 Multivariate Analysis

Views:
 
Category: Education
     
 

Presentation Description

8323 Stats - Lesson 1 - 03 Multivariate Analysis

Comments

Presentation Transcript

Multivariate Samples: 

Multivariate Samples Recall some very basic concepts of univariate and bivariate statistics Describe Multivariate Samples Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space

The data we will consider: 

Example1. Innovation and Research in Europe (Source: Eurostat) The data we will consider

Some basic concepts of Univariate and Bivariate statistics: 

Some basic concepts of Univariate and Bivariate statistics

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable Let us consider one variable of interest, say EPO In statistics a commonly used position measure is the arithmetic (sample) mean, obtained by summing up all the observed values and dividing the results by the nr of obs Netherlands Spain Mean = 127.6987

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable The mean can be used to make a “prediction” about EPO for a generic country without any further information. To evaluate the reliability of the mean as a synthesis of the observed data, we can consider for each observed value the error incurred when substituting it with the sample mean. Netherlands Spain Mean = 127.6987 In the plot: errors incurred when substituting the mean to the values observed for Netherlands and Spain respectively. The TOTAL SUM OF SQUARES is the sum of the squared errors Variable of interest: EPO

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable A synthesis of the errors, and a measure of the reliability of the mean as a synthesis of the observed data, is the (sample) variance This is the average of the squared errors we incur when substituting the observed values with the sample mean. It is obtained by dividing the Total SS by the number of observations (minus 1) The variance of EPO turns out to be 12646.5814. Hence the error we can expect to incur for a generic observation is the square root of the variance, which is called standard deviation Variable of interest: EPO

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable In statistics we are mainly concerned with the explanation of variance, i.e., we are interested in explaining why a phenomenon varies and, also, we are considering predictive tools characterized by low prediction errors. So the question now is: Can we do better than the mean? i.e., can we use external information (other vars) related to EPO, and hence proving useful to predict the values of EPO with a lower error? In the following we will consider two supporting variables having different characteristics: The Region (a categorical variable) Internet_Access (a numerical variable) and we will show how it is possible to evaluate the extent to which one external variable provides information about the variable of interest Variable of interest: EPO

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable If we consider the region, our prediction on EPO can be better? General Mean = 127.6987 We can use the conditional means rather than the general one. It is worth only if the prediction error is considerably lower (it can be shown that it is lower by construction) Netherlands Spain Values observed within the regions

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable Consider the region to improve prediction on EPO  Use the conditional means Netherlands Spain To evaluate the reliability of the conditional means as syntheses of the observed EPO data, we can consider the squared difference between each value and the proper conditional mean. In the plot: errors for Netherlands and Spain The WITHIN SUM OF SQUARES of EPO given Region is the sum of the squared errors incurred when using the conditional means (by region) to predict EPO

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable If we use the region, our improvement as compared to the general mean is The R2 ranges from 0 to 1. It measures the ability of the categorical var as a predictor of the numerical one. % of variance of EPO accounted for by Region Compare general mean / conditional means as predictors of EPO

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable If we consider Internet_Access, our prediction on EPO can be better? When considering numerical variables, we are interested in evaluating the existence of a linear association between them. To evaluate if a linear relationship exists and to determine its direction we refer to the sample covariance (absolute measure of linear association)

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable If we consider Internet_Access, our prediction on EPO can be better? The covariance between the two variables is: Cov(EPO, Int_Acc) = 1868.5152 This measure only indicates that a linear relationship exists and that it is direct (an inspection of the scatter plot confirms this). Nevertheless, the value of the covariance depends upon the unit of measurement of the considered variables. A relative measure of linear association is the correlation coefficient. The correlation coefficient ranges from – 1 to +1. Values close to 1 indicate strong direct linear association, values close to –1 denote strong inverse association. Values close to zero indicate no relationship. Here we have Corr(EPO, Int_Acc) = 0.8527 (strong association)

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable If we consider Internet_Access, our prediction on EPO can be better? EPO = –60.018 + 4.5934*Int_Acc The high value of the correlation tells us that observations tend to cluster around a line having a positive slope. This line, evidenced in the scatterplot is called regression line. Its analytical expression can be easily determined

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable EPO = –60.018 + 4.5934*Int_Acc For each observation we can calculate the difference between the observed EPO value and the value predicted using the regression line. In the plot the error is evidenced for the Spain. Spain The MODEL SUM OF SQUARES of EPO given Int_Acc is the sum of the squared errors incurred when using the line to predict EPO. Consider Internet_Access to improve prediction on EPO  Use the regression line

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable Notice that we have a considerable decrease of the prediction errors. Compare general mean / regression line as predictors of EPO

Back to basics…. Considering one variable: 

Back to basics…. Considering one variable The R2 index ranges from 0 to 1 and it measures the ability of the numerical var to predict the other one. It can be shown that the index coincides with the squared correlation coefficient. Hence the correlation measures the extent of linear association, whereas its square measures the percentage of the variance of one variable which can be explained by the other variable (numerical). If we use the line (function of Int_Acc), our improvement as compared to the general mean is % of variance of EPO accounted for by Int_Acc If we consider Internet_Access, our prediction on EPO can be better?

Data Matrices(Numerical variables only): 

Data Matrices (Numerical variables only)

Data matrices : 

Data matrices Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few variables and to few observations The country variable is useful to identify the statistical units but it is not object of analysis. At the moment we consider only numerical variables For each observation we have information collected on p variables For each variable we have information collected on n observations The data matrix contains information available for the n cases (rows) on the p variables (columns) Here we have 15 rows (cases, n) and 7 columns (vars, p)

Data matrices : 

Data matrices Example1 (continued). Innovation and Research in Europe. (subset) To each observation a collection of p values is associated. These values are the realizations observed for each variables corresponding to the considered obs. Similarly, to each variable, a collection of n values can be associated (values observed for all the cases) A collection of k values is usually called a vector. To avoid confusion, we will only consider column vectors, with dimension (k  1) – i.e., a collection of values arranged in k rows and in 1 column . A row (1  k) vector can always be seen as the transpose of a column (k  1) vector.

Data matrices : 

Data matrices xi = vector (p  1) containing measurements on the p vars for the i-th case. x(j) = vector (n  1) containing the n measurements on the j-th variable Data matrix (n individuals and p variables) Transposition operation A data matrix can be seen as a collection of n row (transposed) vectors (cases) and/or as a collection of p column vectors (variables)

Data matrices : 

Data matrices Example1 (continued). Innovation and Research in Europe. (subset) Row vector associated to “Belgium” (measurements on 7 vars) Column vector associated to EPO (measurements on 15 obs) The element in the i-th row and in the j-th column, xij is the value observed for the i-th case corresponding to the j-th variable. In this simple example, x13 6 is the value of EPO (6° variable) for Belgium (13° observation).

Data matrices – Vectors : 

Data matrices – Vectors A (K  1) vector is as an oriented line in a K-dimensional space v1 v2 v3 v1 v2 A two-dimensional vector A three-dimensional vector Vectors of higher dimension cannot be represented in this way A one-dimensional vector (scalar) v1

Data matrices – Vectors (length): 

Data matrices – Vectors (length) For a given vector in the k-dimensional space, we define its length as: It is the length of the line connecting v to the origin, 0: v1 v2 v3 v1 v2 v1 0 0 0

Data matrices – Vectors (Distance) : 

Data matrices – Vectors (Distance) 0 v v1 v2 u u1 u2 Given two vectors, v and u in the k-dimensional space, we define the Euclidean Distance between v and u as the length of the line connecting v to u: |v1 – u1| |v2 – u2| !!! the length of a vector v coincides with its distance from the origin, 0. Example in the two-dimensional space

Analyze multivariate samples in a geometrical perspectiveDescribe distances in the Euclidean Space: 

Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space

Data matrices : 

Data matrices A data matrix can be see as a collection of two kind of vectors: Row vectors: xi lie in the p-dimensional space Column vectors: x(j) lie in the n-dimensional space Hence two dimensional spaces can be considered to analyze/describe a data matrix. Of course, these spaces will be related one to each other. For the sake of simplicity, we will analyze in depth only the space of the observations.

Syntheses of variables: 

Syntheses of variables The position. The sample mean (unbiased estimator for the population mean) for the j-th variable (column) is: It may be seen as the vector associated to the “artificial case” “mean” – an unobserved case being in the average with respect to all the vars Remember: the mean is not robust (sensitive to extreme values) How to arrange syntheses of p variables, i.e., how to synthesize the elements of the column vectors? Vector of the sample means (centroid).

The space of the observations: 

The space of the observations Consider a graphical representation we are used to: the 2-dimensional space Note: axes adjusted to have the same scale. Mean of E_gov_indiv Mean of Internet_Acc The centroid (vector whose elements are the sample means) is the centre of gravity of the cloud. It is the point which is globally less distant from all the points.

Synthesis of variables: 

Synthesis of variables Notice that it is the average of the squared distances between the observed values and the sample mean The sample standard deviation for the j-th variable (column) is The dispersion around the mean. The sample variance (unbiased estimator for the population variance) for the j-th variable (column) is: The Std. Dev has the same unit of measurement as the variable taken into account. It measures of the expected error (below or above the mean) we incur when substituting the mean to a generic case. Moreover it can be considered as the average distance between a generic value and the mean. It is the expected distance from mean. Being based upon averages, both the variance and the standard deviation are not robust (sensitive to extreme values) Average of the squared errors we incur when substituting the observed values with the sample mean.

Slide30: 

The space of the observations Consider again the 2-dimensional space Let us consider the distance from Iceland (IS) to the centroid Note: axes adjusted to have the same scale. Absolute Difference between the Iceland E_gov_Indiv value and the mean of E_gov_Indiv Absolute Difference between the Iceland Internet_Acc value and the mean of Internet_Acc

Slide31: 

The space of the observations Consider, in the 2-dimensional space, ALL THE DISTANCES FROM POINTS TO THE CENTROID. Note: axes adjusted to have the same scale. Var(E_gov_indiv) + Var(Internet_cc) = SUM of the variances of THE TWO VARIABLES is proportional to the sum of the squared distances from the obs to the centroid

Synthesis of association between vars: 

Synthesis of association between vars The linear association. The sample covariance for the j-th and the h-th variables (columns) is The sample correlation coefficient for the j-th and the h-th variables is (absolute measure of linear association) (relative measure of linear association; it ranges from – 1 to +1). Remember: being based upon averages, the correlation coefficient is not robust (sensitive to extreme values)

Slide33: 

The space of the observations Consider again the 2-dimensional space Since the covariance and the correlations are actually measuring the concentration of points around a line, both the indices give us information about the ORIENTATION of the scatter. Note: axes adjusted to have the same scale.

Slide34: 

Variance and Covariance Matrix Variances and covariances are arranged in the so called variance and covariance matrix S is a square matrix (number of rows equals the number of columns) The diagonal elements of S, sjj, are the variances (notice that the variance can be regarded as the covariance between one variable and itself) The extra-diagonal elements of S, sjh, are the covariances Since sjh = shj, S is a symmetric matrix.

Slide35: 

Correlation Matrix Correlations are arranged in the correlation matrix R is also a square matrix, and its diagonal elements are 1’s (the correlation between one variable and itself is 1) Its extra-diagonal elements, rjh, are the correlations, and of course, R is a symmetric matrix. Due to the relationship between covariances and correlations: R can be simply obtained from the variance and covariance matrix

Slide36: 

The space of the observations The centroid (vector whose elements are the sample means) is the centre of gravity of the p-dimensional cloud The elements of the variance and covariance matrix give us information about the dispersion around the centroid (remember the 2-dimension example) and on the orientation of the cloud

Slide37: 

Measuring dispersion How to synthesize the dispersion of the n cases in the p-dimensional space? Two proposals. TOTAL VARIANCE As we saw before, the sum of all the variances is proportional to the sum of the squared distances from the points to the centroid. Thus, a first method to evaluate the dispersion of the points in the p-dimensional space is the so called Total Variance. The Total Variance is the sum of the diagonal elements of the var/cov matrix, S. The sum of the diagonal elements of a square matrix is defined to be its trace. Hence, we have: Notice that we are not taking into account the interrelationships between vars, i.e. the orientation of the cloud.

Slide38: 

The space of the observations To motivate the second measure of multivariate dispersion, consider the “portion” of the space which is occupied by data (area of the ellipse). We will come back to this concept later, but can intuitively understand that the area of the ellipse (in higher-dimensional space, the volume of an ellipsoid) is somehow related to the variances and to the covariances, i.e., to all the entries of the var/cov matrix, S

Slide39: 

THE GENERALIZED VARIANCE The volume of the ellipsoid containing points in the p-dimensional space can be shown to be related to a particular synthesis of the elements of S, the so called determinant of S, |S|. The determinant is a number which can be calculated for a square matrix. It equals zero if two column of the matrix are proportional, i.e., if they do share information. This measure is called Generalized Variance Generalized Variance = det(S)=|S| Hence, to synthesize the dispersion of points in a p- dimensional space, two measure can be used, both related to the elements of the variance and covariance matrix, S. The Total Variance takes into account only the diagonal elements of S, whilst the Generalized variance is calculated by referring to all the elements of S. Measuring dispersion

Slide40: 

The space of the observations The variances and covariance matrix contains relevant information to describe the points in a p-dimensional space, and, also information about their distances. We now consider different measures of distances between cases in the p-dimensional space, related to particular transformations of the original vars. Notice first that if the variables are centred on their mean nothing changes as concerns the dispersion of the points. This operation only consists in a change of the origin

Slide41: 

Multivariate Samples - Transformations Centroid = Origin = 0 Var/Cov Matrix: S Corr Matrix: R TRASFORMATION: VARS CENTRED ON THEIR MEANS Original Data Matrix Centred Data Matrix The centred matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself. This means that to all the observations on a given column, say the j-th, the mean of the j-th variable is subtracted.

Slide42: 

A closer look at the distance The Euclidean distance is the length of the line connecting a point to the origin. Consider, in the plot of the centred variables, Cyprus and Italy: their distance from the origin, 0, is (almost) the same. This similar distance is due to different combinations of x- and y- deviations from 0. Should the x- and y- deviations be evaluated in the same manner? Notice that the distance of Slovakia from the origin is higher. We will consider this later

Slide43: 

A closer look at the distance Remember: the standard deviation of a variable is the typical deviation from the mean. Here Std.Dev.(E_gov_Avail)=15, Std.Dev.(Int_Acc) = 21.31. To compare adequately the deviations from the origin (data are centred), we should take into account the Std.Dev (of course, squared deviations should be compared with variances). Internet_Acc has an higher std.dev. Hence, a deviation D from the origin along the horizontal axis should “count less” than a deviation D from the origin along the vertical axis.

Slide44: 

A closer look at the distance In the Euclidean distance, the deviations are considered in absolute terms. When we are considering variables having different Std.Dev, we should consider relative deviations. To remove the effect of Std. Dev, thus obtaining comparable deviations, we have to standardize the variables. The Euclidean Distance between two standardized observations is: Statistical Distance: A different weight is assigned to the squared deviation of each variable in the calculation of the distance (1/sjj). The statistical distance is proportional to the Euclidean one only if the variances are all equal. Standardization of the j-th variable:

Slide45: 

A closer look at the distance The statistical distance (visualization in the original/centred space). x-deviations are penalized less than y-deviations, since the x-axis is characterized by an higher dispersion. Hence Cyprus, which is showing an higher y-deviation from the origin as compared to Italy is characterized by a statistical distance from the origin which is higher than that characterizing Italy. Points having the same statistical distance from the origin Notice that Slovakia has a stat. distance from 0 which is now similar to that of Cyprus.

Slide46: 

Multivariate Samples - Transformations Centroid = Origin = 0 Var/Cov Matrix: R Corr Matrix: R TRASFORMATION: STANDARDIZED VARS Original Data Matrix Standardized Data Matrix The standardized matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself and by dividing this difference by the Std.Dev. The centred vars have null mean, the standardized vars have variances all equal to 1 (the unit of measurement is removed). Since Variance=Std.Dev= 1 for each variable, the covariances coincide with correlations (Corr=Cov/Product of Std.Dev’s).

Slide47: 

A closer look at the distance Euclidean distance in the standardized space. The standardization makes all the differences comparable, so now the Euclidean distance coincides with the statistical distance calculated in the original space. Notice that the cloud still has orientation Euclidean distance in the original space Statistical distance in the original space

Slide48: 

A closer look at the distance In statistical distance deviations are adjusted by taking into account dispersions of the variables. But no attention is posed on the “coherence” between each point and the cloud of points (standardization does not involve correlations) Slovakia and Cyprus are equally statistically distant from the origin. Notice that Lithuania is more statistically distant from the origin. Consider the orientation of the cloud: the line connecting Lithuania to 0 has the same direction of the cloud. This is less true for Slovakia. The line connecting Cyprus to the origin is in countertendency

Slide49: 

A closer look at the distance In Statistical distance, the coherence with the orientation of the cloud is not considered. A transformation of data which removes the effect of Std. Dev, and also penalizes deviations by considering the orientation of the cloud of points id the so called Mahalanobis transformation. We do not enter into details here. The so called Mahalanobis distance is defined as the Euclidean distance calculated on Mahalanobis transformed observations: Mahalanobis transf. of the j-th variable: The Mahalanobis transformation is a particular linear combination of the considered variables.

Slide50: 

Multivariate Samples - Transformations TRASFORMATION: MAHALANOBIS Centroid = Origin = 0 Var/Cov Matrix: I Corr Matrix: I Original Data Matrix Mahalanobis Data Matrix The Mahalanobis distance is the Euclidean distance evaluated by previously transforming data according to the Mahalanobis transformation. The variables transformed according to the Mahalanobis transformation have null means, variances all equal to 1 (unit of measurement is removed), and null correlations (orientation of the cloud is removed).

Slide51: 

A closer look at the distance Mahalanobis Distance: deviations from the origin are adjusted by taking into account both the dispersions of variables and their correlations (orientation). Now Cyprus, being in countertendency with respect to the orientation of the cloud is characterized by a Mahalanobis distance from 0 which is higher than that characterizing Slovakia. Notice that Lithuania has a Mahalan. distance from 0 similar to that of Slovakia. Points having the same Mahalanobis distance from the origin

Slide52: 

A closer look at the distance Euclidean distance (original space Statistical distance (original space) Mahalanobis distance (original space) Euclidean distance in the Mahalanobis space. By removing both dispersion and correlation differences are comparable also with respect to their orientation, so now the Euclidean distance coincides with the mahalanobis distance calculated in the original space. Notice that the cloud has no orientation.

Slide53: 

Multivariate samples – Transformations Conclusion: By transforming data via standardization or Mahalanobis transformation we are simply defining a new space such that the Euclidean Distance calculated on the transformed points coincides respectively with: Statistical distance - standardization, deviations are differently evaluated depending on their Std.Dev Mahalanobis distance - Mahalanobis transformation, deviations are differently evaluated depending on the Std.Dev.’s and to the orientation of the cloud - correlations/covariances). As for now the latter transformation was not explicitly defined due to its analytical complexity, but we will see later how to obtain Mahalanobis-transformed data.