Slide1: Principal Components Analysis Theory and Practice Choose the number of PC’s
Evaluate the PC’s with respect to vars and obs
Obtain and analyze PC’s plots. Comment the position of obs, possibly also taking into account categorical grouping variables.
Discuss the influence of outliers on PC’s
Slide2: We consider information on innovation and research (we limit attention to a subset of vars). Example1. Innovation and Research in Europe (Source: Eurostat) The data we will consider
Slide3: PC’s, eigenvectors and eigenvalues Since we are considering 14 variables, we can extract p=14 PC’s.
The PC’s are extracted in a decreasing order of importance, as measured by their variance. Remember: the variance of the k-th PC coincides with the k-th eigenvalue.
In the k-th row of this table we find the k-th eigenvalue, the difference between it and the next eigenvalue, the proportion of total variance explained by the k-th PC (the eigenvalue divided by tr(R) = p). Notice that the sum of the eigenvalues coincides with p. Moreover, in the last column we find the cumulative proportion of total variance explained by the first k PC’s. Of course, p PC’s together explain the 100% of the total variance, i.e., of the trace of the correlation matrix.
Slide4: PC’s, eigenvectors and eigenvalues How many PC’s should we consider? By choosing the number of PC’s we are deciding which tendencies we will ignore, and how many syntheses of the original vars we will not take into account.
We can approach this problem from two points of view:
FIRST : We can try to define “what” (direction/synthesis) is important and should hence be taken into account
Remember that we are working with standardized variables. Hence, all the variances (information) of the original vars are equal to 1.
One idea is to take into account:
The axes characterized by a dispersion higher than 1, (1 is the dispersion characterizing the original axes we are actually substituting)
The syntheses characterized by an amount of information greater than that characterizing the original vars.
This means that attention should be limited to PC’s characterized by variance (eigenvalue) greater than 1. We will then consider a number of PC’s coinciding with the number of eigenvalues greater than 1.
Slide5: PC’s, eigenvectors and eigenvalues Eigenvalues > 1.
This is only a general rule. Our choice may be guided by other considerations:
The amount of variance explained by the PC’s.
For example, we may be interested in retaining a set of PC’s explaining at least the 90% of total variance).
The performance of the PC’s with respect to all the variables (see later, slide 10).
For example, we may be interested in retaining a set of PC’s explaining all the original vars in a satisfactory way (at least, say, the 60-70% of their variance) In this case, notice that
1) According to the Eigenvalue >1 rule, we should select 3 PCs. 2) 3 PC’s together explain the 72% of the total variance 3) The 4° PC has a variance very close to 1. 4) We could be interested in a higher % of explained variance and we could also be interested in evaluating if all the variables are well explained.
It is better to consider a larger nr of PC’s and to evaluate them before choosing.
Slide6: PC’s, eigenvectors and eigenvalues BUT: while indicating the directions of the most relevant tendencies, they do not provide information about the relevance of the tendencies themselves. This means that the eigenvectors describe the new axes but not the dispersion (importance, information) characterizing them. To have a clear description of the meaning of PC’s we have to consider the correlations between them and the original vars. The correlations provide also important information about the explanatory power of PC’s We are considering p = 14 vars; hence can extract p = 14 PC’s and p = 14 eigenvectors. Here we limit attention only to the first 6 eigenvectors (those related to the first 6 PC’s) Eigenvectors provide information about the direction of the (6) principal axes of the ellipsoid. Also, they give information about the weight of each variable in the standardized linear combinations defining the (first 6) PC’s
Slide7: Correlations between the PC’s and the original vars REMEMBER (FOR STANDARDIZED VARS): The elements (rows) of the k-th eigenvector (column vector) are the weights assigned to the original vars. For example, gjk is the weight assigned to the j-th original variable to obtain the k-th principal component.
As we saw before this weight says nothing about the relevance of the j-th var in the definition of the k-th PC’s nor about the relevance of the k-th PC in the reconstruction of the j-th variable.
This information is given by the correlation between the j-th var, z(j) and the k-th PC, y(k).
Slide8: Correlations between the PC’s and the original vars The correlation between the j-th original variables and the k-th PC is given by the j-th element of the k-th eigenvector (direction of the axis, weight in the lin. comb.) MULTIPLIED by the square root of the k-th eigenvalue (std dev of the k-th PC), i.e. by the importance of the k-th PC in the reconstruction of the original vars. Elements of the k-th eigenvector. Correlations:
The squared correlation coefficient is the proportion of the variance of the j-th variable, z(j) , accounted for by the k-th PC, y(k). These quantities can be used to evaluate the explanatory power of PC’s with respect to each variable
The correlations between vars and PC’s give us information about the relationship between the considered variables.
Slide9: Proportion of variance explained by the first PC’s Proportion and cumulative proportion are respectively the % of total variance explained by the k-th PC and by the first k PC’s.
These values are only a synthesis of the performance of the PC’s to recover the variances of the original variables.
This means that for some variables we may have a very high % of variance explained and for other vars a very low % of variance explained. Again, the proportion of variance explained by one PC is the squared correlation coefficient between the PC and one variable. Remember moreover that if we consider p PC’s all the variances are perfectly explained.
We are interested in evaluating the cumulative proportion of the variances of the original vars explained by the first k PC’s. These quantities are obtained as the sum of the squared correlation coefficients between one variable and the first k PC’s.
Slide10: Proportion of variance explained by the first PC’s These are the cumulative proportions with respect to all the original variables (synthesis). In the table below, we consider the proportion of variance of each variable explained by the 1° PC (EXPL_1), by the first 2 PC’s (EXPL_2), by the first 3 PC’s (EXPL_3) a.s.o. We flag proportions higher than 60%.
We observe that globally the 1° PC explains the 52% of the variance and the first 2 PC’s jointly explain the 63%. Nevertheless, when we turn attention to the vars we observe different reconstruction abilities. Some vars (e.g., Y_Educ_Lev) are absolutely not explained.
If the original vars are all strongly related, their reconstruction should be homogeneous.
Here 3 PC’s are synthesizing all the vars but 2, ST_grad and Y_Educ_level
The 50% of the variance of E_gov_avail is explained, even if this proportion is lower than those characterizing the other vars)
To explain all the vars: 5 PC’s
Slide11: Correlation between the PC’s and the original vars By analyzing correlations between the original vars and the PC’s, we can understand which are the most relevant tendencies and which are the most important vars in defining PC’s the best syntheses of them We have to flag somehow the most relevant correlations. At this aim we may consider, for example, the square root of the global % of explained variance. Nevertheless, the correlations have to be analyzed both in relative and in absolute terms (a correlation 0.26 is low in absolute value, even if it is relatively high when compared to other correlations)
Slide12: Correlation between the PC’s and the original vars 1° PC. Its variance, the highest, is much more higher than the variance of the 2° PC. The % of explained variance, 52% is very high.
This is due to the strong relationships between GERD/GERD industry, innovation of firms (EPO/USTPO) innovation of population/govern (Internet_acc/E_gov_avail), HT exports and expen. in Educ and in IT. This group of vars is negatively correlated with GERD_govern (the R&D is financed by not public institutions) and Telec_Exp (im/maturity of TEL sector?). Hence this tendency opposes countries where innovation involves all the levels of the economic society to countries where innovation is promoted by the government and firms are not “strong” enough to finance innovation / obtain innovative results.
NOTICE: should we synthesize all the original vars with the 1° PC, 48% of information is lost. In particular, some vars are not related to the 1° PC – information almost totally lost.
Slide13: Correlation between the PC’s and the original vars instead the relationship between GERD_industry and GERD opposed to GERD_govern. Since the relationship between GERD_gov and GERD_abroad is not strong or the inverse relation between GERD_abroad and GERD_govern is not strong, the inverse relationship between the two vars was not described by the 1° PC. The next most relevant tendency (11% of the total variance) evidences a relationship between GERD_abroad, Telec_Exp and Educ_Exp. To this group of vars (also weakly related to the education vars) GERD_industry is opposed.This tendency opposes countries where GERD is financed from abroad (attracted by Educ_Exp and Tel_exp) to countries where GERD is financed by industry. The inverse relationship between the two kind of financing was not emphasized by the 1° PC, which is describing
Slide14: Correlation between the PC’s and the original vars The 3° PC is related to the description between GERD_abroad and HT:exp (another kind of attraction, different from that described by the 2° PC). Now, Edu_Exp and education (Y_Educ_lev) are opposed to GERD_abroad
The first 3 PC’s are those characterized by eigenv. > 1.
If attention is limited to these PC’s there are some vars which are related to tendencies left unexplained.
The 4° and 5° PC’s are dedicated to the explanation of the two education vars, ST_grad (4° PC) and Y_Educ_Lev (5° PC). These vars appear isolated, dominating almost alone a general tendency/needing one dedicated axis to be described. They appear not strongly associated to the other vars. Hence, these concepts are only marginally related to innovation as measured by the other vars: these vars have an informative content which is not totally shared with the other vars. Should we think that these content is to be taken into account when studying research and innovation, we would consider at least 5 PC’s. As concerns the 6° PC, it is characterized by very low correlations and only describes one not so relevant tendency.
Slide15: Graphical representation of the correlations In the previous slides we analyzed the correlations using a table. Nevertheless, if the number of variables is very high, it can be worth to use a graphical tool to determine which are the variables most correlated to each PC. (Optional additional tool, useful if the previous table is too much complicated to be analyzed – too many rows/vars).
The correlations between the original vars and the PC’s may be represented in the so-called correlation plot. In this plot (one for each pair of PC’s) each variable is represented by a point whose coordinates are the correlations with the two considered PC’s).
One variable is highly correlated with one or both the considered PC’s if it is very close to the circle. (Circles in the plot)
If one variable is close to the origin this means that it is not well explained by the two considered PC’s (triangle in the plot)
Slide16: Graphical representation of the correlations Correlations (abs)> 0.6 are flagged Correlations (abs) > 0.5 are flagged Correlations (abs) > 0.5 are flagged Using correlations we can 1) understand the meaning of the PC’s or, better, the tendency they describe 2) understand if there are isolate vars.
Nevertheless, to have a clearer idea of the % of variance of each variable explained by the PC’s it is worth to consider also the determination coefficients (squared correlations)
Slide17: PCA: first conclusions about our data. PC’s as syntheses
If we are interested in defining an innovation index, we should evaluate the performance of the PC’s.
We first observe that one synthesis (the first PC) has an high explanatory power but left unexplained some vars and has a low explanatory power for some other vars. To obtain proper syntheses of most of the vars we should refer to 3 PCs. Also in this case, the education vars are left unexplained.
Hence, if we want to obtain syntheses describing innovation, we could restrict attention to the first 3 PC, conscious that in this way education is not taken into account.
If we are using instead PC’s to obtain few vars to be substituted to the original vars so as that as few information as possible is lost (purely data reduction technique) we should consider 5 PC’s.
Slide18: PCA: looking at the obs Of course a tendency in the data is due to observations “inducing it”. A good way to “look” at the tendencies consists in plotting observations in the principal components space.
Before doing this, we should
Check if all the observations are well represented in this space. If some observations are poorly represented their position in the plots should not be commented
Individuate observations which are mostly responsible of the main tendencies described by the PC’s. Of course, these observations will be very well represented (by definition) in the PC’s space
It is evident that this analysis is worth only if
There number of observations is not too high
The observations are well identified. For example, in our running examples obs are countries. Hence, it is meaningful to consider which are the countries dominating a tendency. Should the obs be characterized by labels which we can not recognize, this operation would be meaningless.
Slide19: PCA: looking at the obs – cosines To evaluate to which extent one observation is reproduced in the PC’s space, we can refer to its cosine. Consider a point in the original space and its projection onto the space of the PC’s (for example in the picture on the left the projection onto the 1-dim space generated by the 1° PC is considered).
Of course, the projection is good (and the approximation error is low) if the angle between the two vectors (original vector and its projection) is small. The cosine between two vectors is the “shadow” of the first vector onto the second.
Since the PC’s are obtained by rotating the original axes and the vars are standardized, the cosines is simply the ratio between the distance from the point to the origin in the new space and the distance from the point to the origin in the original space (notice that high cosines correspond to low approximation errors). We know that the PC’s were extracted so as to minimize (a synthesis of) approximation errors. Nevertheless, as in the case of vars, also for obs it may happen that some observations have very low approximation errors and other observations have instead very high approximation errors.
Slide20: PCA: looking at the obs – squared cosines (In the case of standardized vars) The squared cosine between one obs, say the i-th , and one principal component, say the k-th, cos_k, is given by: (In the case of standardized vars) If attention is limited to the first r PC’s, the cosines between one obs and the projection onto this space, cos_1_k , is: Important: if all the p PC’s are considered, the squared cosines between the obs and their projections onto the space generated by all the p PC’s is 1.
Slide21: PCA: looking at the obs – squared cosines Let us consider our data. First of all, we print the observations which are poorly represented in the space spanned by the first 3 PC’s. We consider poorly represented observation characterized by cos_1_3 < 0.5. We can understand from this result that some observations will remain unexplained if we do not consider also the 4° and the 5° PCs. For example, Ireland, and France (4° PC), Portugal, Czech Rep, Spain (5° PC). Other obs instead will remain not well explained unless we consider a higher number of PC – Austia and Slovenia (6° PC), Belgium (even higher PC).
These obs are related to the tendencies described by the last PC’s which are of course only partially recovered by the first 3 PC’s
Slide22: PCA: looking at the obs – squared cosines Now, we are interested in determining which are the best represented obs. If one obs is well represented and it is also characterized by an high PC score, then it will be relevant to the definition of the tendency described by the PC itself. Consider first of all that the 1° PC, being the most powerful synthesis, will be characterized by relatively higher cosines as compared to the other PC’s. The same reasoning holds for other PC’s: the first ones will be generally characterized by higher cosines – they are describing the most relevant tendencies, since these tendencies are relevant this means that there are obs inducing them and hence well represented in them.
Let us consider the averages of cosines with the PC’s We will consider one obs as well represented on one PC if its cosines is higher than the mean (hence, limiting values are 0.45, 0.10, 0.09, 0.08, 0.08, 0.06) or than the median. A stronger condition refers to the 3° quartile
Slide23: PCA: Plot of the observations Now we can plot the obs in the space spanned by the PC’s. We consider plots of paired PC’s (1/2, 3/4, 5/6 the last ones useful to understand where we are loosing information). Types Legend | H1 H1H2 H2 N
--------------+------------------------------------
Symbol Colors | blue red green cyan We observe that the blue and red coloured countries are well represented on the 1° PC, the green coloured countries are well represented on the 2° PC.
Of course, to better understand what the observed opposition mean, we have to remind which are the vars more related to the PC’s.
To ease the analysis it is possible to refer to a BIPLOT, i.e., a plot where obs and vars are simultaneously represented.
For the sake of simplicity we restrict attention only to vars strongly correlated with the PC’s. More detailed analysis could be conducted by analyzing also lower correlat.
Slide24: PCA: Plot of the observations Types Legend | H1 H1H2 H2 N
--------------+------------------------------------
Symbol Colors | blue red green cyan 1° PC: GERD/GERD industry, firms (EPO/USTPO) and population/govern (Internet_acc/E_gov_avail) innovation, HT exports and expenditures in Educ and in IT, negatively correlated with GERD_govern (R&D financed by government).
Opposes countries where innovation involves all the levels of the economic society : Sweden, Denmark, Finland, Switzerland, Germany to countries where innovation is promoted by the government and firms are not “strong” enough to finance innovation / obtain innovative result, Bulgaria, Latvia, Lithuania, Romania.
2° PC: inverse rel between GERD_abroad, and GERD_industry. Opposes countries where GERD is financed from abroad (attracted by Educ_Exp and Tel_exp) Latvia, Estonia, UK and Austria to countries where GERD is financed by industry, Luxembourg, Germany, Turkey, Chzech Rep, Spain, Croatia, Italy, Slovakia.
Appreciate here the difference between being well represented and dominating a dimension (e.g., Italy vs Turkey in the 2°) Only vars with correlations > 0.6 are plotted
Slide25: PCA: Plot of the observations Types Legend | H3 H3H4 H4 N
--------------+------------------------------------
Symbol Colors | blue red green cyan 3° PC: GERD_abroad and HT_Exports Opposes countries attracting R&D financing from abroad and exporting HT UK, Latvia, Luxembourg, Netherlands, Ireland to countries having opposite characteristics, Croatia, Slovenia, Poland, Norway.
The 4° PC is devoted to the explanation of ST_grad. Mostly Ireland but also France, Lithuania and Spain is characterized by a high % of graduated in Science and Technology, the opposite is true for Belgium Latvia, Luxembourg and Netherlands.
Only vars with correlations > 0.5 are plotted
Slide26: PCA: Plot of the observations Types Legend | H5 H5H6 H6 N
--------------+------------------------------------
Symbol Colors | blue red green cyan Even if these last components are not particularly interesting, we can observe and comment also the oppositions along them. The main intent here is to understand if and to which extent the observations which are not adequately described by the selected principal component are related to the excluded ones. Only vars with correlations > 0.5 are plotted
Slide27: PCA and categorical variables In some applications we may have obs which whose labels are not particularly evocative. For example, if we are considering the firms operating in certain countries and in certain sectors, we may not know exactly all the firms (there is a great difference between a label such as “Belgium” and a label such as “firm444_be”).
In these situations, the plot of the obs on the PC’s space can not give us clear information about “who is opposed to who” along the principal axes. When this is the case, it may be sensible to individuate on the plot obs having certain characteristics (for example, if we are considering firms, we can consider the region where they are operating, the sector, or structural characteristics of the firm). In this way we can (hope) to understand if the oppositions in the PC’s space are somehow related to characteristics of the observations.
In other situations, we can be interested in evaluating if cases having certain characteristics are located in particular regions of the PC space. For example, in our running application, we may be interested in understanding if countries in certain regions tend are located close one to each other. If this is true, we can conclude that innovation (as it is measured by the PC’s) is a geographical phenomenon.
We now consider the second problem. The first one will be presented with another application
Slide28: PCA and categorical variables From this plot we can observe an opposition between macro regions (North and NorthWest on the one side, all the other regions on the other side) but only on the first PC.
This evidences that the measure of innovation measured by the 1° PC is related to regions. This is not so strong on the 2° PC. Western Eastern North_West North_East South_West South_East Western_Asia ---------------------------------------------------------------------------------------------------------------------------------
blue red green cyan magenta orange gold
Slide29: On the map relative to the 3° and 4° PC’s the regional tendency is not evident.
The same procedure may be applied considering the class_gdp indicator.
It may be useful to inspect the relationships between PC’s and one or more categorical variable using side-by-side box plots. PCA and categorical variables
Slide30: PCA and categorical variables It is common practice to assign to some a priori defined groups of observations a score calculated on the basis of PC’s to describe the general tendency of cases in the groups. This can be done only if a clear tendency to group appears from PC’s plots/boxplots.
Slide31: PCA and outliers In the previous slides we analyzed the PC’s to evaluate if and to which extent they describe the information contained in a dataset with respect both to the vars and to the obs.
It is now important to evaluate if some obs, having some peculiar characteristics, and being thus really different from the other obs, are influencing too much the PC’s (influent outliers)
The first PC’s are those having maximum variance. Hence, they are usually related to the vars with high variance (information) which are strongly related one to each other. It is important to be sure that the first PC’s, which are the most important ones, are not describing tendencies which are actually only due to the presence of outliers. This is an influence analysis
If some obs are really different from the others, it could happen that they are needing one “dedicated” tendency to describe their position in the cloud of points. In this case, these obs will probably not influence the first PC’s, but will dominate the last ones. In this sense, the PC’s may also be used to individuate outliers.
We will first of all present illustration of the two situations and then show how PC’s can be used to individuate and remove multivariate outliers.
Slide32: PCA and outliers OUTLIERS: EXAMPLE 1
In this 3d plot, three synthetic variables are considered, X1, X2 e X3.
The green point is a point lying in a particular direction, and thus is an outlier.
This obs has a peculiar tendency, so it will be probably influence the last PC and not the first ones, which will be dedicated to the description of the strong and evident relationships between the variables
Slide33: PCA and outliers The outlier Here are the plots of the PC’s extracted from the three considered variables. Notice that the outliers is not influencing the most relevant tendencies (PC’s) but the less relevant one (the 3° PC).
Slide34: PCA and outliers Let us consider another example. The synthetic variables considered here, X1 and X3 are inversely correlated. The third variable, X4 is not related to X1 and X3. Some outliers have been inserted inflating the relationship between X1,X4 and X3,X4. X1 X3 X1 X3 X4 X4
Slide35: PCA and outliers Notice that the first PC describes the inverse relationship between X1 and X3, but also the relationship of the two vars with x4. The same holds for the second PC.
The influence of these outliers on the PC’s should be removed. Here are the plots of the PC’s extracted from the three considered variables. Notice that the outliers are now influencing also the most relevant tendencies.
Slide36: PCA and outliers In PCA, as in most of multivariate techniques, it is important to deal with the problem of outliers, since the analysis is based upon the correlations, and these are not robust measures of association.
It is then important to individuate the outliers and to obtain the results without letting the outliers influencing them.
A multivariate outlier can be defined as an obs being distant from the cloud of points, i.e, lying along anomalous directions or being so extreme to induce a direction.
Remember that the Mahalanobis distance measures the distance of the observations from the origin, taking into account both dispersion along the axes and the orientation of the cloud.
Remember moreover that the Mahalanobis distance can be calculated as the statistical distance from the origin in the space spanned by the PC’s (all the PC’s not only those eventually selected) or, which is the same, as the euclidean distance of one obs from the origin in the space spanned by the standardized PC’s.
How can we flag one observation as an outlier on the basis of its Mahalanobis distance? It can be shown that the Mahalanobis distance is distributed according to a Chi-square distribution with p degrees of freedom, where p is the number of original vars. The percentiles of the Chi-square distribution (for example the 95-th) may be used to flag a point as an outlier. If one point has a distance with a chi-squared(p) value greater than the 95.th percentile it is flagged as an outlier.
Slide37: PCA and outliers (1-)-th percentile Area before the (1-)-th percentile. Probability that an obs coming from a Chi-squared p distribution is lower than the (1-)-th percentile 1 – Dist 1 is characterizing an obs which is not an outlier, since the cumulated mass of probability at this point is lower than (1 – ) and its significance level (area “after the point”) is greater than . Dist 1 Dist 2 is instead characterizing a point which is an outlier, since the cumulated mass of probability at this point is greater than (1 – ) and its significance level (area “after the point”) is lower than . Dist 2
Slide38: PCA and outliers We discussed the PC’s obtained for the data on innovation in EU. But are these results reliable or they are influenced by the presence of outliers? There are not suspect Mahalanobis distances. The significance levels are all lower than 0.05.
Notice that Italy and Slovenia are the obs closest to the centre of gravity of the cloud.