Slide 1: Application of SPSS
in Social Science Research
Reader in Statistics
Chennai – 600 005
Mobile: 94442 21627 BASIC CONCEPTS : 7-Dec-09 Dr.R.RAVANAN, Presidency College BASIC CONCEPTS Population
Collection of all individuals or objects or items under study and denoted by N
A part of a population and denoted by n
Characteristic of an individual or object.
Qualitative and Quantitative variables
Characteristic of the population
Characteristic of the sample Chart on population, sample and statistical inference : 7-Dec-09 Dr.R.RAVANAN, Presidency College Chart on population, sample and statistical inference Population too large Collect data from
the sample Organise
data Analyse the
Organised data Sample drawn from
the population Draw inference which is
applicable to the population Notations of Population and Sample : Notations of Population and Sample Organising a raw data set : 7-Dec-09 Dr.R.RAVANAN, Presidency College Organising a raw data set Pictorial representation of a data set : 7-Dec-09 Dr.R.RAVANAN, Presidency College Pictorial representation of a data set Summarising a raw data set on a quantitative variable : 7-Dec-09 Dr.R.RAVANAN, Presidency College Summarising a raw data set on a quantitative variable Slide 8: 7-Dec-09 Dr.R.RAVANAN, Presidency College Sampling Techniques Determination of sample size : 7-Dec-09 Dr.R.RAVANAN, Presidency College Determination of sample size Three factors for specifying a sample size
SD of the population
Acceptance level of sampling error
Expected Confidence Level
Sample size n = (ZS/E)2
Where Acceptable Error E = ZSx and SE of the mean Sx=S/?n
S = Sample SD or an estimate of the population SD
Z = Standardized value corresponding to a confidence level
For Example, Z=1.96, S=90, E=6 then n= 864
Z=1.96, S=90, E=12 then n=216
Infer from above calculation that the sample size can be reduced to almost one-fourth of its original size by doubling the range of acceptable error Stages in Data Analysis : Stages in Data Analysis Editing Coding Data Entry
Analysis Interpretation Statistical Inference : 7-Dec-09 Dr.R.RAVANAN, Presidency College Statistical Inference The Concept of P Value : 7-Dec-09 Dr.R.RAVANAN, Presidency College The Concept of P Value Given the observed data set, the P value is the smallest level for which the null hypothesis is rejected (and the alternative is accepted)
If the P value ? ? then reject H0 ; Otherwise accept H0
If the P value ? 0.01 then reject H0 at 1% level of significance
If the P value lies between 0.01 to 0.05 (ie. 0.01< P value ? 0.05) then reject H0 at 5% level of significance
If the P value ? 0.05 then accept H0 at 5% level of significance Measurement Scales : Measurement Scales Types of measurement scales are
Ratio Scale The Measurement Principles : The Measurement Principles NominalOrdinalIntervalRatioPeople or objects with the same scale value are the same on some attribute. The values of the scale have no 'numeric' meaning in the way that you usually think about numbers.People or objects with a higher scale value have more of some attribute. The intervals between adjacent scale values are indeterminate. Scale assignment is by the property of "greater than," "equal to," or "less than."Intervals between adjacent scale values are equal with respect the the attribute being measured. E.g., the difference between 8 and 9 is the same as the difference between 76 and 77.There is a rationale zero point for the scale. Ratios are equivalent, e.g., the ratio of 2 to 1 is the same as the ratio of 8 to 4. Examples of the Measurement Scales : Examples of the Measurement Scales Permissible Arithmetic Operations : 7-Dec-09 Dr.R.RAVANAN, Presidency College Permissible Arithmetic Operations Appropriate Statistics : Appropriate Statistics Statistical Inference : Statistical Inference There are two types of statistical inferences:
Estimation of population parameters and hypothesis testing.
Hypothesis testing is one of the most important tools of application of statistics to real life problems.
Most often, decisions are required to be made concerning populations on the basis of sample information.
Statistical tests are used in arriving at these decisions. Five ingredients to statistical test : Five ingredients to statistical test Null Hypothesis
Level of Significance
Interpretation Slide 22: Steps in Hypothesis Testing 1Identify the null hypothesis H0 and the alternate hypothesis H1. 2Choose ?. The value should be small, usually less than 10%.
It is important to consider the consequences of both types of errors. 3Select the test statistic and determine its value from the sample data.
This value is called the observed value of the test statistic. 4Compare the observed value of the statistic to the critical value
obtained for the chosen a. Types of Error : Types of Error Slide 24: Use the key by answering the questions in the most relevant way.
1. Have you got more than two samples?
No......go to 2
Yes.....go to 8
2. Have you got one or two samples?
One.....Single sample t-test
Two....go to 3
3. Are your data sets normally distributed (K-S test or Shapiro-Wilke)?
No.......go to 4
Yes......go to 5
4. Do your data sets have any factor in common (dependence), i.e. location or individuals?
No.Mann Whitney U test
YesWilcoxon Matched Pairs
5. Do your data sets have any factor in common (dependence), i.e. location or individuals?
No......go to 6
Yes.....paired sample t-test Slide 25: Use the key by answering the questions in the most relevant way.
6. Do your data sets have equal variances (f-test)?
No......unequal variance t-test
Yes.....go to 7
7. Is n greater or less than 30?
<30.....equal variance t-test or ANOVA
>30.....z-test or ANOVA
8. Are your samples normally distributed and with equal variances?
No......Kruskal-Wallis non-parametric ANOVA
Yes.....go to 9
9. Does your data involve one factor or two factors?
One.....One-way ANOVA (see also Multiple comparison tests)
Two.....Two-way ANOVA (see also Multiple comparison tests) A Classification of Multivariate Methods : A Classification of Multivariate Methods Multivariate Analysis: Classification of Dependence Methods : Multivariate Analysis: Classification of Dependence Methods Multivariate Analysis: Classification of Independence Methods : Multivariate Analysis: Classification of Independence Methods Test of Hypothesis : 7-Dec-09 Dr.R.RAVANAN, Presidency College Test of Hypothesis Test of Hypotheses concerning mean(s).
Test of Hypotheses concerning variance/Variances.
Test of Hypotheses concerning proportions. Small Sample Test : 7-Dec-09 Dr.R.RAVANAN, Presidency College Small Sample Test Test based on Student t Distribution ( W.S. Gorgett )
Test based on Snedecor’s F Distribution ( R.A. Fisher )
Test based on Chi square Distribution ( Karl Pearson ) Type of Statistical Tests and its Characteristics : 7-Dec-09 Dr.R.RAVANAN, Presidency College Type of Statistical Tests and its Characteristics Example for Tests of Hypotheses concerning Two population means : Example for Tests of Hypotheses concerning Two population means Sample I: 110, 120, 123, 112, 125
Sample II: 120, 128, 133, 138, 129 Tests of Hypotheses concerning proportion(s) : Tests of Hypotheses concerning proportion(s) One-tailed tests concerning single proportion
Two-tailed tests concerning single proportion
One-tailed tests concerning two proportions
Two-tailed tests concerning two proportions Tests of Hypotheses concerning Variance(s) : Tests of Hypotheses concerning Variance(s) One-tailed chi-square test concerning single population variance
Two-tailed chi-square test concerning single population variance
One-tailed F-test concerning equality of two population variances
Two-tailed F-test concerning equality of two population variances Chi-square test for checking independence of two categorized data : Chi-square test for checking independence of two categorized data Let us consider two factors which may or may not have influence on the observed frequencies formed with respect to combinations of different levels of the two factors
H0: Factor A and factor B are independent
H1: Factor A and factor B are not independent
Objective : To check whether the null hypothesis is to be accepted based on the value of the chi-square by placing the significance level of ? at the right tail of the chi-square distribution. Chi-square test for goodness of fit : Chi-square test for goodness of fit To fit the data to the nearest distribution which represents the data more meaningfully for future analysis. Such fitting of data to the nearest distribution is done using the goodness of fit test
H0: The given data follow an assumed distribution
H1: The given data do not follow an assumed distribution
Objective : To check whether the null hypothesis is to be accepted based on the value of the chi-square by placing the significance level of ? at the right tail of the chi-square distribution. Comparing Multiple Population : Comparing Multiple Population Comparing multiple population variances
Comparing multiple population means Comparing multiple population variances : Comparing multiple population variances For more than two populations, it is assumed that the probability distribution ( i.e. Histogram ) of each population is approximately normal.
H0: All the population variances are equals
H1: At least two population variances are differ
This test is called Bartlett’s Test
Objective : To check whether the null hypothesis is to be accepted based on the value of the chi-square by placing the significance level of ? at the right tail of the chi-square distribution. Comparing multiple population means : Comparing multiple population means For more than two populations, it is assumed that the probability distribution ( i.e. Histogram ) of each population is approximately normal.
H0: All the population means are equals
H1: At least two population means are differ
This test is called Analysis Of Variance (ANOVA)
Data from Unrestricted (independent) samples ( One-way ANOVA)
Data from Block Restricted Samples (Two-way ANOVA)
Objective : To check whether the null hypothesis is to be accepted based on the value of the F by placing the significance level of ? at the right tail of the Snedecor F distribution. Example for One Way ANOVA : Example for One Way ANOVA School I : 45, 54, 35, 43, 48
School II : 54, 65, 67, 55, 52
School III : 87, 65, 75, 79, 67 Non-Parametric Tests : Non-Parametric Tests In some situations, the practical data may be non-normal and/or it may not be possible to estimate the parameter(s) of the data
The test which are used for such situations are called non-parametric tests
Since these tests are based on the data which are free from distribution and parameter, these tests are known as non-parametric(NP) test or Distribution Free tests
NP test can be used even for nominal data (qualitative data like greater or less, etc.) and ordinal data, like ranked data.
NP test required less calculation, because there is no need to compute parameters. List of Non-Parametric Tests : List of Non-Parametric Tests One-sample test
One sample sign test
Chi-square one sample test
Two related samples tests
Two samples sign test
Wilcoxon Matched-pairs signed –rank test
Two independent samples test
Chi-Square test for two independent samples
Mann-Whitney U test
Kolmogorov-Smirnov two sample test List of Non-Parametric Tests : List of Non-Parametric Tests K Related Samples test
Friedman Two way Analysis of Variance by Ranks
The Coehran Q test
5. K Independent samples
Chi-Square test for k Independent samples
The extension of the Median test
Kruskal-Wallis one-way Analysis of Variance by Rank One sample sign test : One sample sign test This test is applied to a situation where a sample is taken from a population which has a continuous symmetrical distribution and known to be non-normal such that the probability of having a sample values less than the mean value as well as probability of having a sample values more than the mean value(p) is ½.
Classified into four categories
One-tailed one-sample sign tests for small sample
Two-tailed one-sample sign tests for small sample
One-tailed one-sample sign tests for large sample
Two-tailed one-sample sign tests for large sample Kolmogorov-smirnov test : Kolmogorov-smirnov test It is similar to the chi-square test to do goodness of fit of a given set of data to an assumed distribution
This test is more powerful for small samples whereas the chi- square test is suited for large sample
H0: The given data follow an assumed distribution
H1: The given data do not follow an assumed distribution
K-S test is an one-tailed test. Hence if the calculated value of D is more than the theoretical value of D for a given significance level, then reject H0 ; otherwise accept H0 Two samples sign test : Two samples sign test Two samples sign test is applied to a situation, where two samples are taken from two populations which have continuous symmetrical distributions and known to be non-normal
Modified sample value, Zi = + if Xi > Yi
= - if Xi < Yi
= 0 if Xi = Yi
Classified into four categories
One-tailed two-sample sign tests with binomial distribution
Two-tailed two-sample sign tests with binomial distribution
One-tailed two-sample sign tests with normal distribution
Two-tailed two-sample sign tests with normal distribution The Wilcoxon Matched-pairs signed-ranks test : The Wilcoxon Matched-pairs signed-ranks test The Wilcoxon test is a most useful test for behavioral scientist
Let di = the difference score for any matched pair
Rank all the di without regard to sign
T = Sum of rank with less frequent sign
Compute Z = [T – E(T)]/SD(T) Mann-Whitney U Test : Mann-Whitney U Test Mann-Whitney U test is an alternate to the two sample t-test
This test is based on the ranks of the observations of two samples put together
Alternate name for this test is Rank-Sum Test
Let R1 = The sum of the ranks of the observations of the first sample
Let R2 = The sum of the ranks of the observations of the second sample
Objective: To check whether the two samples are drawn from different populations having the same distribution
Compute Z = [U – E(U)]/SD(U)
where U = n1n2 + [n1(n1 + 1)/2] - R1
or U = n1n2 + [n2(n2 + 1)/2] - R2 Correlation and Regression Analysis : Correlation and Regression Analysis The Chi-square test measures the association between two or more variables.This test is applicable only when data is on nominal scale.
Correlation and Regression analysis is used for measuring the relationship between two variables measured on interval or ratio scale. Correlation Analysis : Correlation Analysis Correlation analysis is a statistical technique used to measure the magnitude of linear relationship between two variables.
Correlation analysis cannot be used in isolation to describe the relationship between variables.
It can be used along with regression analysis to determine the nature of the relationship between two variables.
Thus correlation analysis can be used for further analysis
Two prominent types of correlation Coefficient are
Pearson Product Moment correlation coefficient
Spearman’s Rank correlation coefficient
Testing the significance of correlation coefficient
Type I H0: ? = 0 and H1: ? ? 0
Type II H0: ? = r and H1: ? ? r
Type III H0: r1 = r2 and H1: r1 ? r2 Correlation Analysis : Correlation Analysis Example:
Mark in Mathematics: 89,58,78,79,86,58
Marks in Statistics: 75,79,59,78,84,65 Regression Analysis : Regression Analysis Regression analysis is used to predict the nature and closeness of relationships between two or more variables
It evaluate the causal effect of one variable on another variable
It used to predict the variability in the dependent (or criterion) variable based on the information about one or more independent (or predictor) variables.
Two variables : Simple or Linear Regression Analysis
More than two variables : Multiple Regression Analysis Linear Regression Analysis : Linear Regression Analysis Linear regression : Y = ? + ?X
Where Y : Dependent variable
X : Independent variable
? and ? : Two constants are called regression coefficients
? : Slope coefficient i.e. the change in the value of Y with
the corresponding change in one unit of X
? : Y intercept when X = 0
R2 : The strength of association i.e. to what degree that the
variation in Y can be explained by X.
R2 = 0.10 then only 10% of the total variation in Y can be
explained by the variation in X variables Test of significance of Regression Equation : Test of significance of Regression Equation Linear regression : Y = ? + ?X
F test is used to test the significance of the linear relationship between two variables Y and X
H0: ? = 0 (There is no linear relationship between Y and X)
H1: ? ? 0 (There is linear relationship between Y and X)
Objective : To check whether the estimates from the regression model represent the real world data. Example for Regression Analysis : Example for Regression Analysis School Climate : 25, 34, 55, 45, 56, 49, 65
Academic Achievement: 58, 62, 80, 75, 84, 72, 89 Multivariate Analysis : Multivariate Analysis Multivariate analysis is defined as “ all statistical techniques which are simultaneously analyse more than two variables on a sample of observation”.
Multivariate analysis helps the researcher in evaluating the relationship between multiple (more than two) variables simultaneously.
Multivariate techniques are broadly classified into two categories:
Independency Techniques A Classification of Multivariate Methods : A Classification of Multivariate Methods Multivariate Analysis: Classification of Dependence Methods : Multivariate Analysis: Classification of Dependence Methods Multivariate Analysis: Classification of Independence Methods : Multivariate Analysis: Classification of Independence Methods Discriminant Analysis : Discriminant Analysis Discriminant analysis aims at studying the effect of two or more predictor variables (independent variables) on certain evaluation criterion
The evaluation criterion may be two or more groups
Two groups such as good or bad, like or dislike, successful or unsuccessful, above expected level or below expected level
Three groups such as good, normal or poor
Check whether the predictor variable discriminate among the groups
To identify the predictor variable which is more important when compared to other predictor variable(s).
Such analysis is called discriminant analysis Discriminant Analysis : Discriminant Analysis Designing a discriminant function: Y = aX1 + bX2
where Y is a linear composite representing the discriminant function, X1 and X2 are the predictor variables (independent variables) which are having effect on the evaluation criterion of the problem of interest.
Finding the discriminant ratio (K) and determining the variables which account for intergroup difference in terms of group means
This ratio is the maximum possible ratio between the ‘variability between groups’ and the ‘variability within groups’
Finding the critical value which can be used to include a new data set (i.e. new combination of instances for the predictor variables) into its appropriate group
Testing H0: The group means are equal in importance
H1: The group means are not equal in importance
using F test at a given significance level ? Factor Analysis : Factor Analysis Factor analysis can be defined as a ‘set of methods in which the observable or manifest responses of individuals on a set of variables are represented as functions of a small number of latent variables called factors’.
Factor analysis helps the researcher to reduce the number of variables to be analyzed, thereby making the analysis easier.
For example, Consider a market researcher at a credit card company who wants to evaluate the credit card usage and behaviour of customers, using various variables. The variables include age, gender, marital status, income level, education, employment status, credit history and family background.
Analysis based on a wide range of variables can be tedious and time consuming.
Using Factor Analysis, the researcher can reduce the large number of variables into a few dimensions called factors that summarize the available data.
Its aims at grouping the original input variables into factors which underlying the input variables.
For example, age, gender, marital status can be combined under a factor called demographic characteristics. The income level, education, employment status can be combined under a factor called socio-economic status. The credit card and family background can be combined under factor called background status. Benefits of Factor Analysis : Benefits of Factor Analysis To identify the hidden dimensions or construct which may not be apparent from direct analysis
To identify relationships between variables
It helps in data reduction
It helps the researcher to cluster the product and population being analyzed. Terminology in Factor Analysis : Terminology in Factor Analysis Factor: A factor is an underlying construct or dimension that represent a set of observed variables. In the credit card company example, the demographic characteristics, socio economic status and background status represent a set of variables.
Factor Loadings: Factor loading help in interpreting and labeling the factors. It measure how closely the variables in the factor are associated. It is also called factor-variable correlation. Factor loadings are correlation coefficients between the variables and the factors.
Eigen Values: Eigen values measure the variance in all the variables corresponding to the factor. Eigen values are calculated by adding the squares of factor loading of all the variables in the factor. It aid in explaining the importance of the factor with respect to variables.Generally factors with eigen values more than 1.0 are considered stable. The factors that have low eigen values (<1.0) may not explain the variance in the variables related to that factor. Terminology in Factor Analysis : Terminology in Factor Analysis Communalities: Communalities, denoted by h2, measure the percentage of variance in each variable explained by the factors extracted. It ranges from 0 to 1. A high communality value indicates that the maximum amount of the variance in the variable is explained by the factors extracted from the factor analysis.
Total Variance explained: The total variance explained is the percentage of total variance of the variables explained. This is calculating by adding all the communality values of each variable and dividing it by the number of variables.
Factor Variance explained: The factor variance explained is the percentage of total variance of the variables explained by the factors. This is calculating by adding the squared factor loadings of all the variables and dividing it by the number of variables. Procedure followed for Factor Analysis : Procedure followed for Factor Analysis Define the problem
Construct the correlation matrix that measures the relationship between the factors and the variables.
Select an appropriate factor analysis method
Determine the number of factors
Rotation of factors
Interpret the factors
Determine the factor scores Cluster Analysis : Cluster Analysis Cluster analysis can be defined as a set of techniques used to classify the objects into relatively homogeneous groups called clusters
It involves identifying similar objects and grouping them under homogeneous groups
Cluster as a group of objects that display high correlation with each other and low correlation with other variables in other clusters Procedure in Cluster Analysis : Procedure in Cluster Analysis Defining the problem: First define the problem and de upon the variables based on which the objects are clustered.
Selection of similarity or distance measures: The similarity measure tries to examine the proximity between the objects. Closer or similar objects are grouped together and the farther objects are ignored. There are three major methods to measure the similarity between objects:
Euclidean Distance measures
Selection of clustering approach: To select the appropriate clustering approach. There are two types of clustering approaches:
Hierarchical Clustering approach
Non-Hierarchical Clustering approach
Hierarchical clustering Approach consists of either a top-down approach or a bottom-up approach. Prominent hierarchical clustering methods are: Single linkage, Complete linkage, Average linkage, Ward’s method and Centroid method. Procedure in Cluster Analysis : Procedure in Cluster Analysis Hierarchical clustering Approach consists of either a top-down approach or a bottom-up approach. Prominent hierarchical clustering methods are: Single linkage, Complete linkage, Average linkage, Ward’s method and Centroid method.
Non-Hierarchical clustering Approach: A cluster center is first determined and all the objects that are within the specified distance from the cluster center are included in the cluster
Deciding on the number of clusters to be selected
5 Interpreting the clusters Canonical Correlation Analysis (CCA) : Canonical Correlation Analysis (CCA) CCA is a way of measuring the linear relationship between two multidimensional variables.
CCA is extension of multiple regression analysis (MRA)
MRA analyses the linear relationship between a single dependent variable and multiple independent variables.
CCA analyses a linear relationship between multiple dependent variable and multiple independent variables.
For example, a social researcher wants to know the relationship between various work environment factors (like work culture, HR policies, Compensation structure, top management) influencing various employee behaviour elements (Employee productivity, job satisfaction, perception about company)
The linear combination for each variable is called canonical variables or canonical variates.
CCA tries to maximize the correlation between two canonical variables Canonical Correlation Analysis (CCA) : Canonical Correlation Analysis (CCA) For example, U represent the linear combination of work environment factors
U = a1X1 + a2X2 + a3X3 + a4X4
and V represent the linear combination of employee behaviour factors
V = b1Y1 + b2Y2 + b3Y3 + b4Y4
The coefficient of each canonical variable are called canonical coefficients
To interpret the canonical analysis, the researcher examines the relative magnitude and the sign of the several weights defining each equation and sees if a meaningful interpretation can be given.
Being a complex statistical tool that requires a great investment of effort and computing resources, CCA has not gained as much popularity as statistical tools like multiple regression. Multivariate Analysis of Variance (MANOVA) : Multivariate Analysis of Variance (MANOVA) MANOVA examines the relationship between several dependent variables and several independent variables
It tries to examine whether there is any difference between various dependent variables with respect to the independent variables.
For example, an industrial buyer wants to know whether the product from Company A, Company B and Company C differ in terms of various parameters (set by the company) such as quality, customer support, pricing and reliability.
The difference between ANOVA and MANOVA is that while ANOVA deals with problems containing one dependent variable and several independent variables, MANOVA deals with problems containing several dependent variables and several independent variables.
Another major difference is that the ANOVA test ignores interrelationship between the variables. This leads to biased results
MANOVA considers this aspect by testing the mean difference between groups on two or more dependent variables simultaneously. Books for Reference : Books for Reference SPSS For Windows
Step by Step
A simple Guide and Reference
Darren George and Paul Mallery
48, Ariya Gowda Road,
West Mambalam, Chennai
Phone: 24803091, 92, 93, 94 Books for Reference : Books for Reference Statistics: Concepts and Applications
Nabendu Pal and Sahadeb Sarkar
Prentice-Hall of India Private Limited, New Delhi.
Marketing Research – Text and cases
Tata McGraw-Hill Publishing Company Limited, New Delhi
Prentice-Hall of India Private Limited, New Delhi.