logging in or signing up 0AP03 Field college2 0607 Hufflepuff Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 301 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: June 24, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript 0AP03: Methods and models in behavioral researchPart 2: Understanding statistics using SPSS (Field): 0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field) Chris Snijders c.c.p.snijders@gmail.com NOTE: www.tue-tm.org/moodle What is multiple regression?: What is multiple regression? You have a target variable (Y) that you want to predict using predictor variables X1 through Xn using: where the bi’s have to be found in such a way that the estimated Y is close to the real Y. Usually there are two reasons to want this: Predicting Understanding Using this model is usually called Multiple regression analyses 'Ordinary Least Squares' (OLS) Simple regression: Y and one X: Simple regression: Y and one X 'Ordinary least squares': We define a concept of 'wrongness', or deviation: it is distance to the real value, squared. deviation = (observed – model)2 Choose the b’s so that the deviance is minimized. Why multiple linear regression?: Why multiple linear regression? Suppose I want to predict some Y For instance: Y = number of goals scored in international soccer games My first guess: an important predictor is the length of a player. The taller a player is, the more he will score. So X1 = height, and the model becomes: goals = b0 + b1 height Now suppose that I run this simple regression analysis and find: indeed, taller players score more goals. That is, b1 is estimated to be significantly larger than 0. [nb we do not find this in the real data, but let us suppose we do] Waarom multiple lineaire regressie? (2): Waarom multiple lineaire regressie? (2) So there appears to be a relation between height and number of goals scored. Possible (although a bit weird) counterargument: 'This finding is spurious. Players who are older, are more likely to have played many games. When you have played more games, you are likely to have scored more. If older players tend to be taller, you might actually be measuring the effect of age, not of height.' [Solution 1] Split the data in two groups: old and young players. Run simple regression analysis separately for both groups. This is fine, but in general, we do not want to do this. [Solution 2] Multiple regression: include age as a second predictor. goals = b0 + b1 height + b2 age If we now find that the b1 variable is no longer significantly different from 0, we have indeed shown that it is not height that does it, but age. Waarom lineaire vergelijkingen?: Especially in the social sciences, you often do not have a precise equation for the relation between X and Y. Most of the time, we have an idea of the kind 'if X increases, then Y is likely to increase', without any specific idea about the shape of the relation. A linear model is a good start. And: even if you have a concrete non-linear equation, often you can find a linear approximation (using Taylor-expansion, for instance) that is good enough for all practical purposes. Moreoever, the equation is linear in the predictors, but the predictors themselves can be non-linear! For instance: Waarom lineaire vergelijkingen? Why is it beautiful …: Why is it beautiful … [1] You can test hypotheses about effects of predictors on targets (Xs on Y), taking into account possibly intervening factors [2] It combines several 'separate models' into a single analysis. Y compared between two groups: t-test Y compared between three groups: anova Y compared between three groups and two treatments (blocked) anova the relation between one interval X and Y correlation All of these (and more) can be done with multiple regression. And: more complicated methods are usually a logical consequence of multiple regression. Software for linear regression: Software for linear regression SPSS Alternatives for SPSS (MiniTab, GLIM, Stata, Statistica, …) Several freeware packages In Excel, using plug-ins (for instance PopTools, which is also freeware) We use SPSS This week’s data: airport passengers: This week’s data: airport passengers Predict the number of passengers at an airport terminal … Number of passengers: Number of passengers day (as of May 2005) hour, minutes, passenge, weekday, quarts Predict passenge from the rest of the data. What do we need?: What do we need? Y has to be an interval variable X can be Interval Ordinal Nominal/categorical But: you have to know how to incorporate them! - - - - - - - andlt; example on blackboard: andgt; andlt; predicting passengers andgt; Notations / definitions: Notations / definitions Notation: the OLS-estimator for Y, 'Y hat' is calculated by choosing values for bi ('bi hat') so that is minimal. (= sum of squared residuals) Note: no statistics is involved! No sampling, no distributions, …. The only thing you need is a definition of deviance (=error=residual). Visually, this is …: Visually, this is … One X Two Xs How well does the model fit?: How well does the model fit? Take once again the example with one X (simple regression). Total sum of squares Residual sum of squares Model sum of squares Observed = Model + Error How well does the model fit?: How well does the model fit? Two ways to assess model fit [1] Through the sum of squared errors (SSR): [2] Through correlation Note . And: [1] and [2] are the same, and called About R2: About R2 Intuitively: it is easier to get higher R2 values when you have more predictor variables X. And, if you only have a handful of cases, your R2 can be high just coincidentally. To compare between different models (and data sets) we use 'adjusted R2', which takes into account the number of X’s (k) and cases (n) you have used: In principle: the higher R2, the better it is. Typical applications in social sciences get at most R2=0.5. It is not an absolute criterion, you can have a high R2 but have learned nothing. From sample to population: From sample to population For several reasons, the best fitting values b are not completely equal to their actual values in the population: (NB only here the statistics comes in) [1] 'Measurement error' [2] 'Sampling error' [3] 'Uncontrolled variance' How can we say something about the real value of bi in the population? Assumption: the 'real model' in the population is with and does not depend on the values of X. And when this assumption is met…: And when this assumption is met… … you can not only find best fitting values for bi but also confidence intervals for these estimates. Statistics programs give you: - the values 'bi-hat' and for each estimated coefficient - a (95%) confidence interval - a t-value - a p-value The p-value represents the probability to reject the null-hypothesis while it is true ('the probability that you find something statistically significant by accident'). Standard rule: when pandlt;0.05, we reject the hypothesis that the coefficient in the population equals zero. What’s still missing?: What’s still missing? Outliers Interaction effects and transformations of variables Multicollinearity Violations of assumptions Next week Weekly not-on-the-exam fact: Weekly not-on-the-exam fact The only game played against the house in a casino in which you can have a positive expectation is blackjack. Some standard strategies (estimated) typical casino player -2.0% to -15.0% 'never bust' -6.0% mimic the dealer -5.7% basic strategy -0.5% basic strategy+ -0.0% card counting +1.5% to +2.5% Standard mistake: player has 10 + 5 = 15 versus dealer 10. Players tend to stand, but you should hit. Ed Thorp 'Beat the dealer' (1962) Google for 'basic strategy blackjack' (continued): (continued) BLACKJACK BASIC STRATEGY To Do: To Do Study the parts in chapter 5 that relate to today’s slides (that is more than half of chapter 5). PRACTICE! Look at the passenger-data and/or the soccer player data and/or the data in Field and try if you can run and understand a regression analysis. Go to www.tue-tm.org/moodle and enroll in 0AP03. The site content is the same as before, but there are some extra possibilities. Check out and add to the WIKIs www.tue-tm.org/moodle: www.tue-tm.org/moodle WIKIs at www.tue-tm.org/moodle: WIKIs at www.tue-tm.org/moodle WIKIs at www.tue-tm.org/moodle: WIKIs at www.tue-tm.org/moodle You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
0AP03 Field college2 0607 Hufflepuff Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 301 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: June 24, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript 0AP03: Methods and models in behavioral researchPart 2: Understanding statistics using SPSS (Field): 0AP03: Methods and models in behavioral research Part 2: Understanding statistics using SPSS (Field) Chris Snijders c.c.p.snijders@gmail.com NOTE: www.tue-tm.org/moodle What is multiple regression?: What is multiple regression? You have a target variable (Y) that you want to predict using predictor variables X1 through Xn using: where the bi’s have to be found in such a way that the estimated Y is close to the real Y. Usually there are two reasons to want this: Predicting Understanding Using this model is usually called Multiple regression analyses 'Ordinary Least Squares' (OLS) Simple regression: Y and one X: Simple regression: Y and one X 'Ordinary least squares': We define a concept of 'wrongness', or deviation: it is distance to the real value, squared. deviation = (observed – model)2 Choose the b’s so that the deviance is minimized. Why multiple linear regression?: Why multiple linear regression? Suppose I want to predict some Y For instance: Y = number of goals scored in international soccer games My first guess: an important predictor is the length of a player. The taller a player is, the more he will score. So X1 = height, and the model becomes: goals = b0 + b1 height Now suppose that I run this simple regression analysis and find: indeed, taller players score more goals. That is, b1 is estimated to be significantly larger than 0. [nb we do not find this in the real data, but let us suppose we do] Waarom multiple lineaire regressie? (2): Waarom multiple lineaire regressie? (2) So there appears to be a relation between height and number of goals scored. Possible (although a bit weird) counterargument: 'This finding is spurious. Players who are older, are more likely to have played many games. When you have played more games, you are likely to have scored more. If older players tend to be taller, you might actually be measuring the effect of age, not of height.' [Solution 1] Split the data in two groups: old and young players. Run simple regression analysis separately for both groups. This is fine, but in general, we do not want to do this. [Solution 2] Multiple regression: include age as a second predictor. goals = b0 + b1 height + b2 age If we now find that the b1 variable is no longer significantly different from 0, we have indeed shown that it is not height that does it, but age. Waarom lineaire vergelijkingen?: Especially in the social sciences, you often do not have a precise equation for the relation between X and Y. Most of the time, we have an idea of the kind 'if X increases, then Y is likely to increase', without any specific idea about the shape of the relation. A linear model is a good start. And: even if you have a concrete non-linear equation, often you can find a linear approximation (using Taylor-expansion, for instance) that is good enough for all practical purposes. Moreoever, the equation is linear in the predictors, but the predictors themselves can be non-linear! For instance: Waarom lineaire vergelijkingen? Why is it beautiful …: Why is it beautiful … [1] You can test hypotheses about effects of predictors on targets (Xs on Y), taking into account possibly intervening factors [2] It combines several 'separate models' into a single analysis. Y compared between two groups: t-test Y compared between three groups: anova Y compared between three groups and two treatments (blocked) anova the relation between one interval X and Y correlation All of these (and more) can be done with multiple regression. And: more complicated methods are usually a logical consequence of multiple regression. Software for linear regression: Software for linear regression SPSS Alternatives for SPSS (MiniTab, GLIM, Stata, Statistica, …) Several freeware packages In Excel, using plug-ins (for instance PopTools, which is also freeware) We use SPSS This week’s data: airport passengers: This week’s data: airport passengers Predict the number of passengers at an airport terminal … Number of passengers: Number of passengers day (as of May 2005) hour, minutes, passenge, weekday, quarts Predict passenge from the rest of the data. What do we need?: What do we need? Y has to be an interval variable X can be Interval Ordinal Nominal/categorical But: you have to know how to incorporate them! - - - - - - - andlt; example on blackboard: andgt; andlt; predicting passengers andgt; Notations / definitions: Notations / definitions Notation: the OLS-estimator for Y, 'Y hat' is calculated by choosing values for bi ('bi hat') so that is minimal. (= sum of squared residuals) Note: no statistics is involved! No sampling, no distributions, …. The only thing you need is a definition of deviance (=error=residual). Visually, this is …: Visually, this is … One X Two Xs How well does the model fit?: How well does the model fit? Take once again the example with one X (simple regression). Total sum of squares Residual sum of squares Model sum of squares Observed = Model + Error How well does the model fit?: How well does the model fit? Two ways to assess model fit [1] Through the sum of squared errors (SSR): [2] Through correlation Note . And: [1] and [2] are the same, and called About R2: About R2 Intuitively: it is easier to get higher R2 values when you have more predictor variables X. And, if you only have a handful of cases, your R2 can be high just coincidentally. To compare between different models (and data sets) we use 'adjusted R2', which takes into account the number of X’s (k) and cases (n) you have used: In principle: the higher R2, the better it is. Typical applications in social sciences get at most R2=0.5. It is not an absolute criterion, you can have a high R2 but have learned nothing. From sample to population: From sample to population For several reasons, the best fitting values b are not completely equal to their actual values in the population: (NB only here the statistics comes in) [1] 'Measurement error' [2] 'Sampling error' [3] 'Uncontrolled variance' How can we say something about the real value of bi in the population? Assumption: the 'real model' in the population is with and does not depend on the values of X. And when this assumption is met…: And when this assumption is met… … you can not only find best fitting values for bi but also confidence intervals for these estimates. Statistics programs give you: - the values 'bi-hat' and for each estimated coefficient - a (95%) confidence interval - a t-value - a p-value The p-value represents the probability to reject the null-hypothesis while it is true ('the probability that you find something statistically significant by accident'). Standard rule: when pandlt;0.05, we reject the hypothesis that the coefficient in the population equals zero. What’s still missing?: What’s still missing? Outliers Interaction effects and transformations of variables Multicollinearity Violations of assumptions Next week Weekly not-on-the-exam fact: Weekly not-on-the-exam fact The only game played against the house in a casino in which you can have a positive expectation is blackjack. Some standard strategies (estimated) typical casino player -2.0% to -15.0% 'never bust' -6.0% mimic the dealer -5.7% basic strategy -0.5% basic strategy+ -0.0% card counting +1.5% to +2.5% Standard mistake: player has 10 + 5 = 15 versus dealer 10. Players tend to stand, but you should hit. Ed Thorp 'Beat the dealer' (1962) Google for 'basic strategy blackjack' (continued): (continued) BLACKJACK BASIC STRATEGY To Do: To Do Study the parts in chapter 5 that relate to today’s slides (that is more than half of chapter 5). PRACTICE! Look at the passenger-data and/or the soccer player data and/or the data in Field and try if you can run and understand a regression analysis. Go to www.tue-tm.org/moodle and enroll in 0AP03. The site content is the same as before, but there are some extra possibilities. Check out and add to the WIKIs www.tue-tm.org/moodle: www.tue-tm.org/moodle WIKIs at www.tue-tm.org/moodle: WIKIs at www.tue-tm.org/moodle WIKIs at www.tue-tm.org/moodle: WIKIs at www.tue-tm.org/moodle