logging in or signing up Lecture 9 Jacob Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 975 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: December 29, 2007 This Presentation is Public Favorites: 4 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Regression Continued… Prediction, Model Evaluation, Multivariate Extensions, & ANOVA Slide2: *Note: If categorical variable Is ordinal, Spearman Rank Correlation methods are applicable…Correlation Review: Correlation Review x y Example 1 Example 2 x y y y x x I II III IV IV III II ICorrelation Review: Correlation Review x y Example 2 More Correlation! x y y y x x I II III IV IV III II I Example 1Simple Linear Regression Review : Find distance to new line, y-hat, and to ‘naïve’ guess, y yi xi Simple Linear Regression Review y xSimple Linear Regression Review : Find distance to new line, y-hat, and to ‘naïve’ guess, y yi xi Simple Linear Regression Review y xLinear Regression Continued…: Linear Regression Continued… Predicted Values Model Evaluation No longer “simple”… MULTIPLE Linear Regression Parallels to “Analysis of Variance” aka ANOVA 1. Predicted Values: 1. Predicted Values Last week, we conducted hypothesis tests and CI’s for the slope of our linear regression model However, we might also be interested in making an estimate of the mean (and/or individual) value of y for particular values of xPredicted Values: Newborn Infant Length Example: Predicted Values: Newborn Infant Length Example Last week, came up with least squares line for mean length of low birth weight babies: y = length x = gestation age (weeks) What is the predicted mean length of infants at 20 weeks? 30 weeks? “Hat” denotes estimatePredicted Values: Newborn Infant Length Example: We can make a point estimate Let x = 29 weeks: Now, interested in a CI around this… Predicted Values: Newborn Infant Length ExamplePredicted Values: CIs: Predicted Values: CIs Confidence interval for y-hat: In order to calculate this interval, we need the standard error of y-hat: Note, we get a different standard error of y-hat for each xSlide12: Notice as x gets further from x, the standard error gets larger (leading to a wider confidence interval) se(y) at 29 weeks = 0.265 cm Predicted Values: Newborn Infant Length ExampleSlide13: Plugging in x = 29 weeks & se(y) = 0.265 95% CI for mean length of infant at 29 weeks of gestation is (36.41, 37.45) Predicted Values: Newborn Infant Length ExamplePredicted Values: CIs: Predicted Values: CIs We can do the same for an individual infant… Confidence interval for y: In order to calculate this interval, we need the standard error (always larger than the standard error of y-hat): Note, we get a different standard error of y for each xSlide15: Again, as x gets further from x, the standard error gets larger (leading to a wider confidence interval) se(y) at x=29 (an infant at 29 weeks) = 2.661 cm Much more variability at this level Predicted Values: Newborn Infant Length ExampleSlide16: Plugging in x = 29 weeks & se(y) = 2.661 Note, point estimate of y = y 95% CI for length of individual infant at 29 weeks of gestation is (31.66, 42.20) Wider interval - compared to (36.41, 37.45) for y-hat Predicted Values: Newborn Infant Length Example2. Model Evaluation: 2. Model Evaluation Homoscedasticity (Residual plots) Coefficient of Determination (R2) Just how good does our model fit the data? Review of Assumptions: Review of Assumptions Assumptions of the linear regression model: The y values are distributed according to a normal distribution with mean and variance that is unknown The relationship between X and Y is given by the formula: The y are independent For every value x the standard deviation of the outcomes y is constant and equal to This concept is called homoscedasticityModel Evaluation:Homoscedasticity: Model Evaluation: Homoscedasticity x y Model Evaluation:Homoscedasticity: Model Evaluation: Homoscedasticity x y Calculate residual distance for each (xi,yi) In end, we have n ei’sModel Evaluation:Homoscedasticity: Now we plot each of the ei’s Are the residuals increasing or decreasing as the fitted values get larger? Fairly consistent across y-hats Look for outliers If present, may want to remove and refit line Model Evaluation: Homoscedasticity Fitted Values (y-hats) Residuals -4 -2 0 2 4 Model Evaluation:Homoscedasticity: Example of heteroscedasticity Increasing variability as fitted values increase Suggests nonlinear relationship…may need to transform Model Evaluation: Homoscedasticity Fitted Values (y-hats) Residuals -4 -2 0 2 4 x y Model Evaluation: Coefficient of Determination: Model Evaluation: Coefficient of Determination R2 is a measure of the relative predictive power of a model i.e., the proportion of variability in Y that is explained by the linear regression of Y on X Pearson correlation coefficient squared aka r2 = R2 Also ranges between 0 and 1Model Evaluation: Coefficient of Determination: Model Evaluation: Coefficient of Determination Closer to one = better the model (greater ability to predict) R2 = 1 would imply that your regression model provides perfect predictions (all data points lie on least-squares regression line R2 = 0.7 would mean 70% of variation in response variable can be explained by preditor(s) Model Evaluation: Coefficient of Determination: Given R-squared is the Pearson correlation coefficient squared, we can solve… If x explains none of the variation in y, then = 0 and R2 = 0 Model Evaluation: Coefficient of DeterminationModel Evaluation: Coefficient of Determination: Adjust R-squared = adjusted for number of variables in model i.e., “punished” for additional variables Want more parsimonious (simple) Note that R-squared does NOT tell you if: The predictor is the true cause of the changes in the dependent variable CORRELATION ≠ CAUSATION !!! The correct regression line was used May be omitting variables (multiple linear regression…) Model Evaluation: Coefficient of Determination3. Multiple Linear Regression: 3. Multiple Linear Regression Extend Simple Model to include more variables Increase our power to make predictions! Model is no longer a line, but multidimensional Outcome = function of many variables e.g., sex, age, race, smoking status, exercise, education, treatment, genetic factors, etc. Multiple Regression: Naïve model Now, we can assess effect of x1 on y, while controlling for x2 (potential “confounder”) We can continue adding predictors... We can even add “interaction” terms (i.e., x1*x2) Multiple RegressionMultiple Regression: Interpretation: β0 = y-intercept (when both x1=0 and x2=0) Often not interpretable β1 = Increase in y for every increase in x1 While holding x2 constant β2 = Increase in y for every increase in x2 While holding x1 constant Multiple RegressionMultiple Linear Regression: Multiple Linear Regression Can incorporate (and control for) many variables A single (continuous) dependent variable Multiple independent variables (predictors) These variables may be of any scale (continuous, nominal, or ordinal) Multiple Regression: Indicator (“dummy”) variables are created and used for categorical variables: i.e. Need # categories-1 “dummy” variables Analysis of Variance equivalent – will cover later… Multiple RegressionMultiple Regression: Multiple Regression Conduct F-test for overall model as before, but now with k and n-k-1 degrees of freedom k = # of predictors in the model Conduct t-tests for coefficients of predictors as before, but now with n-k-1 degrees of freedom Note F-test no longer equivalent to t when k>1Multiple Regression:Confounding: Multiple Regression: Confounding Multiple regression can estimate the effect of each variable while controlling for (adjusting for) the effects of other (potentially confounding) variables in the model Confounding occurs when the effect of a variable of interest is distorted when you do not control for the effect of another “confounding” variableMultiple Regression: Confounding: Multiple Regression: Confounding For example, not accounting for confounding adjusting for effect of x2 By definition, a confounding variable is associated with both the dependent variable and the independent variable (predictor of interest - i.e., x1) Does β1 change in second model? If yes, then evidence that x2 is confounding the association between x1 and the dependent variable Multiple Regression: Confounding: Multiple Regression: Confounding Assume a model of blood pressure, with predictors of alcohol consumption and weight Weight and alcohol consumption may be associated Weight Alcohol Consumption (confounder for effect of weight on blood pressure) Blood Pressure Multiple Regression:Effect Modification: Multiple Regression: Effect Modification Interactions (effect modification) may be investigated The effect of one variable depends on the level of another variable Multiple Regression: Effect Modification: Multiple Regression: Effect Modification Effect of x1 depends on x2: 0 β1 (non-smoker) 1 β1 + β3 (smoker) BP example: If x1 = weight and x2 = smoking status, then the effect on your BP of an additional 10 lbs would be different if you were smoker vs. non-smoker x2 Multiple Regression: Effect Modification: Smoker: Non-Smoker: DIFFERENCE = *Difference between smokers and non-smokers dependent on x1 Multiple Regression: Effect Modification New slope and intercept!Multiple Regression: Effect Modification: DIFFERENCE in slope and intercept… Multiple Regression: Effect Modification x yMultiple Regression:Confounding or Effect Modification?: Multiple Regression: Confounding or Effect Modification? Confounding without Effect Modification: Overall association of predictor of interest and dependent variable is not the same as it is after stratifying on third variable (“confounder”) However, after stratifying, the association is the same within each stratum Multiple Regression:Confounding or Effect Modification?: Multiple Regression: Confounding or Effect Modification? Effect Modification without Confounding: Overall association accurately estimates average effect of predictor on dependent variable, but after stratifying on third variable, that effect differs across strata Both: Overall association is not a correct estimate of effect, and different effects across subgroups of third variable Multiple Regression: Multiple Regression How to build a multiple regression model: Examine two-way scatter plots of potential predictors against your dependent variable Those that look associated, evaluate in a simple linear regression model (“univariate” analysis) Pick out significant univariate predictors Use stepwise model building techniques: Backwards Forwards Stepwise Best subsetsMultiple Regression: Multiple Regression Like simple linear regression, models require an assessment of model adequacy and goodness of fit Examination of residuals (comparison of observed vs. predicted values) Coefficient of Determination Pay attention to adjusted R-squaredLead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.): Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.) Study of lead exposure on neurological and psychological function in children Compared mean finger-wrist tapping score (maxfwt), a measure of neurological function, between exposed (≥ 40 mg/100 mL) and control children (< 40 mg/100 mL) Measured in taps per 10 seconds Already have tools to do this in “naïve” case! 2-sample t-testLead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.): Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.) Need a dummy variable for exposure CSCN2 = With 2-sample T-test, we compared the means of the exposed & controls 1 if child is exposed 0 if child is controlLead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.): Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.) Now, we can turn this into a simple linear regression model: MAXFWT = α + βxCSCN2 + e Estimates for each group: Exposed (CSCN2=1): MAXFWT = α + βx1 = α + β Controls (CSCN2=0): MAXFWT = α + βx0 = α β represents difference between groups One unit increase in CSNC2 Testing β = 0 same as testing if mean difference = 0 Lead Exposure Example: Lead Exposure Example As just shown, MAXFWT(exposed) = α + β = 55.095 – 6.658 = 48.437 & MAXFWT(controls) = α + e = 55.095 Mean Difference = -6.658! Lead Exposure Example: Lead Exposure Example Equivalent to two-sample t-test (w/ equal vars) of H0: μc = μe t=-3.003 p=0.0034 Slope of -6.658 is equivalent to mean difference between exposed and controls Lead Exposure Example: Lead Exposure Example R-squared not strong… Model doesn’t predict much of the group differences Lead Exposure Example: Lead Exposure Example What other variables related to neurological-function? Often strongly related to age and gender Look at scatterplots of both age and gender vs. MAXFWT, separately. Both show evidence of association… Age (years) MAXFWT Males Females MAXFWT Lead Exposure Example: Lead Exposure Example Age in years, and sex coded (1=Male, 2=Female) Both appear to be associated with MAXFWT, age is statistically significant (p=0.0001) Lead Exposure Example: Lead Exposure Example Our first multiple linear regression model… Numerator DF = k = 2 (sum of squares regression) Denominator DF = n-k-1 = 92 (sum of squares error) Lead Exposure Example: Lead Exposure Example Adjusted multiple linear regression model… Coefficients for Age and Sex haven’t changed by much Coefficient for CSNC2 smaller than the crude (naïve) difference -5.147 from -6.658 taps/10 seconds Lead Exposure Example: Lead Exposure Example R-squared up to 0.56 (from 0.09 in simple model) Note: Adjusted R-squared compensates for added complexity in model Since R-squared will ALWAYS increase as more variables are added, we want to keep things as simple as we can…this takes that into account Lead Exposure Example: Lead Exposure Example Interpretation: Holding sex and age constant (i.e., male and 10 years), the estimated mean difference between groups is -5.15 taps/10 seconds, with a 95% CI of (-8.23, -2.06) Other Regression Models: Other Regression Models Logistic Regression Used when the dependent variable is binary Very common in public health/medical studies (e.g., disease vs. no disease) Poisson Regression Used when the dependent variable is a count (Cox) Proportional Hazards (PH) Regression Used when the dependent variable is a “event time” with censoring 4. ANOVA: *Note: If categorical variable Is ordinal, Rank correlation Methods are applicable… 4. ANOVAAnalysis of Variance: Analysis of Variance Hypothesis Test for difference in means of k groups H0: μ1= μ2= μ3=…=μk HA: At least one pair not equal Assessing differences in means using VARIANCES Within-group and between-group variability If no difference in means, then two types of variability should be equal Assuming within-group variability is constant across groups Note, if k=2, then same as two-sample t-test Only need k-1 “dummy” variablesAnalysis of Variance: Analysis of Variance Parallels methods used for regression when we had one continuous and one categorical variable (with k levels) In constructing Least-Squares lines, we evaluated how much variability in our response could be explained by our explanatory (predictor) variables vs. left unexplained (residual error)…Analysis of Variance: The Total Error (SSY) was split into two portions: Variability explained by the regression (SSR) and Residual variability (SSE) Analysis of VarianceAnalysis of Variance: Similarly, we can think of this as the variability WITHIN and BETWEEN each level of the predictor: Analysis of VarianceAnalysis of Variance: Box plots for five levels of an explanatory variable: Size of boxes (Q1-Q3) reflect “Within-Group” variability Placement of boxes along y-axis reflect “Between-Group” variability Analysis of VarianceAnalysis of Variance: Box plots for five levels of an explanatory variable and total (combined): Total and y line added – so we can see where groups lie relative to the mean… How much overlap? TOTAL Analysis of Variance A B C D E yAnalysis of Variance Table: Analysis of Variance Table k = # of groups (levels of categorical variable) Remember, using parallel regression methods, only needed k-1 variables for k groups – so now, k-1 and n-k degrees of freedom… *Formerly known as SSR *Formerly known as SSE Analysis of Variance Table: Analysis of Variance Table Same test is conducted as we saw with regression, testing the ratio of between-group sum of squares and within-group sum of squares The larger the between is relative to within, the more likely we are to reject the null hypothesis *Formerly known as SSR *Formerly known as SSEAnalysis of Variance: Analysis of Variance If we do reject the null hypothesis of all group means being equal (based on the F-test), then we only know that at least one pair differ Still need to find where those differences lie Post-hoc tests (aka Multiple Comparisons) i.e., Tukey, Bonferroni Perform two-sample tests, while adjusting to maintain overall α level Review: Review You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
Lecture 9 Jacob Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 975 Category: Entertainment License: All Rights Reserved Like it (1) Dislike it (0) Added: December 29, 2007 This Presentation is Public Favorites: 4 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Slide1: Regression Continued… Prediction, Model Evaluation, Multivariate Extensions, & ANOVA Slide2: *Note: If categorical variable Is ordinal, Spearman Rank Correlation methods are applicable…Correlation Review: Correlation Review x y Example 1 Example 2 x y y y x x I II III IV IV III II ICorrelation Review: Correlation Review x y Example 2 More Correlation! x y y y x x I II III IV IV III II I Example 1Simple Linear Regression Review : Find distance to new line, y-hat, and to ‘naïve’ guess, y yi xi Simple Linear Regression Review y xSimple Linear Regression Review : Find distance to new line, y-hat, and to ‘naïve’ guess, y yi xi Simple Linear Regression Review y xLinear Regression Continued…: Linear Regression Continued… Predicted Values Model Evaluation No longer “simple”… MULTIPLE Linear Regression Parallels to “Analysis of Variance” aka ANOVA 1. Predicted Values: 1. Predicted Values Last week, we conducted hypothesis tests and CI’s for the slope of our linear regression model However, we might also be interested in making an estimate of the mean (and/or individual) value of y for particular values of xPredicted Values: Newborn Infant Length Example: Predicted Values: Newborn Infant Length Example Last week, came up with least squares line for mean length of low birth weight babies: y = length x = gestation age (weeks) What is the predicted mean length of infants at 20 weeks? 30 weeks? “Hat” denotes estimatePredicted Values: Newborn Infant Length Example: We can make a point estimate Let x = 29 weeks: Now, interested in a CI around this… Predicted Values: Newborn Infant Length ExamplePredicted Values: CIs: Predicted Values: CIs Confidence interval for y-hat: In order to calculate this interval, we need the standard error of y-hat: Note, we get a different standard error of y-hat for each xSlide12: Notice as x gets further from x, the standard error gets larger (leading to a wider confidence interval) se(y) at 29 weeks = 0.265 cm Predicted Values: Newborn Infant Length ExampleSlide13: Plugging in x = 29 weeks & se(y) = 0.265 95% CI for mean length of infant at 29 weeks of gestation is (36.41, 37.45) Predicted Values: Newborn Infant Length ExamplePredicted Values: CIs: Predicted Values: CIs We can do the same for an individual infant… Confidence interval for y: In order to calculate this interval, we need the standard error (always larger than the standard error of y-hat): Note, we get a different standard error of y for each xSlide15: Again, as x gets further from x, the standard error gets larger (leading to a wider confidence interval) se(y) at x=29 (an infant at 29 weeks) = 2.661 cm Much more variability at this level Predicted Values: Newborn Infant Length ExampleSlide16: Plugging in x = 29 weeks & se(y) = 2.661 Note, point estimate of y = y 95% CI for length of individual infant at 29 weeks of gestation is (31.66, 42.20) Wider interval - compared to (36.41, 37.45) for y-hat Predicted Values: Newborn Infant Length Example2. Model Evaluation: 2. Model Evaluation Homoscedasticity (Residual plots) Coefficient of Determination (R2) Just how good does our model fit the data? Review of Assumptions: Review of Assumptions Assumptions of the linear regression model: The y values are distributed according to a normal distribution with mean and variance that is unknown The relationship between X and Y is given by the formula: The y are independent For every value x the standard deviation of the outcomes y is constant and equal to This concept is called homoscedasticityModel Evaluation:Homoscedasticity: Model Evaluation: Homoscedasticity x y Model Evaluation:Homoscedasticity: Model Evaluation: Homoscedasticity x y Calculate residual distance for each (xi,yi) In end, we have n ei’sModel Evaluation:Homoscedasticity: Now we plot each of the ei’s Are the residuals increasing or decreasing as the fitted values get larger? Fairly consistent across y-hats Look for outliers If present, may want to remove and refit line Model Evaluation: Homoscedasticity Fitted Values (y-hats) Residuals -4 -2 0 2 4 Model Evaluation:Homoscedasticity: Example of heteroscedasticity Increasing variability as fitted values increase Suggests nonlinear relationship…may need to transform Model Evaluation: Homoscedasticity Fitted Values (y-hats) Residuals -4 -2 0 2 4 x y Model Evaluation: Coefficient of Determination: Model Evaluation: Coefficient of Determination R2 is a measure of the relative predictive power of a model i.e., the proportion of variability in Y that is explained by the linear regression of Y on X Pearson correlation coefficient squared aka r2 = R2 Also ranges between 0 and 1Model Evaluation: Coefficient of Determination: Model Evaluation: Coefficient of Determination Closer to one = better the model (greater ability to predict) R2 = 1 would imply that your regression model provides perfect predictions (all data points lie on least-squares regression line R2 = 0.7 would mean 70% of variation in response variable can be explained by preditor(s) Model Evaluation: Coefficient of Determination: Given R-squared is the Pearson correlation coefficient squared, we can solve… If x explains none of the variation in y, then = 0 and R2 = 0 Model Evaluation: Coefficient of DeterminationModel Evaluation: Coefficient of Determination: Adjust R-squared = adjusted for number of variables in model i.e., “punished” for additional variables Want more parsimonious (simple) Note that R-squared does NOT tell you if: The predictor is the true cause of the changes in the dependent variable CORRELATION ≠ CAUSATION !!! The correct regression line was used May be omitting variables (multiple linear regression…) Model Evaluation: Coefficient of Determination3. Multiple Linear Regression: 3. Multiple Linear Regression Extend Simple Model to include more variables Increase our power to make predictions! Model is no longer a line, but multidimensional Outcome = function of many variables e.g., sex, age, race, smoking status, exercise, education, treatment, genetic factors, etc. Multiple Regression: Naïve model Now, we can assess effect of x1 on y, while controlling for x2 (potential “confounder”) We can continue adding predictors... We can even add “interaction” terms (i.e., x1*x2) Multiple RegressionMultiple Regression: Interpretation: β0 = y-intercept (when both x1=0 and x2=0) Often not interpretable β1 = Increase in y for every increase in x1 While holding x2 constant β2 = Increase in y for every increase in x2 While holding x1 constant Multiple RegressionMultiple Linear Regression: Multiple Linear Regression Can incorporate (and control for) many variables A single (continuous) dependent variable Multiple independent variables (predictors) These variables may be of any scale (continuous, nominal, or ordinal) Multiple Regression: Indicator (“dummy”) variables are created and used for categorical variables: i.e. Need # categories-1 “dummy” variables Analysis of Variance equivalent – will cover later… Multiple RegressionMultiple Regression: Multiple Regression Conduct F-test for overall model as before, but now with k and n-k-1 degrees of freedom k = # of predictors in the model Conduct t-tests for coefficients of predictors as before, but now with n-k-1 degrees of freedom Note F-test no longer equivalent to t when k>1Multiple Regression:Confounding: Multiple Regression: Confounding Multiple regression can estimate the effect of each variable while controlling for (adjusting for) the effects of other (potentially confounding) variables in the model Confounding occurs when the effect of a variable of interest is distorted when you do not control for the effect of another “confounding” variableMultiple Regression: Confounding: Multiple Regression: Confounding For example, not accounting for confounding adjusting for effect of x2 By definition, a confounding variable is associated with both the dependent variable and the independent variable (predictor of interest - i.e., x1) Does β1 change in second model? If yes, then evidence that x2 is confounding the association between x1 and the dependent variable Multiple Regression: Confounding: Multiple Regression: Confounding Assume a model of blood pressure, with predictors of alcohol consumption and weight Weight and alcohol consumption may be associated Weight Alcohol Consumption (confounder for effect of weight on blood pressure) Blood Pressure Multiple Regression:Effect Modification: Multiple Regression: Effect Modification Interactions (effect modification) may be investigated The effect of one variable depends on the level of another variable Multiple Regression: Effect Modification: Multiple Regression: Effect Modification Effect of x1 depends on x2: 0 β1 (non-smoker) 1 β1 + β3 (smoker) BP example: If x1 = weight and x2 = smoking status, then the effect on your BP of an additional 10 lbs would be different if you were smoker vs. non-smoker x2 Multiple Regression: Effect Modification: Smoker: Non-Smoker: DIFFERENCE = *Difference between smokers and non-smokers dependent on x1 Multiple Regression: Effect Modification New slope and intercept!Multiple Regression: Effect Modification: DIFFERENCE in slope and intercept… Multiple Regression: Effect Modification x yMultiple Regression:Confounding or Effect Modification?: Multiple Regression: Confounding or Effect Modification? Confounding without Effect Modification: Overall association of predictor of interest and dependent variable is not the same as it is after stratifying on third variable (“confounder”) However, after stratifying, the association is the same within each stratum Multiple Regression:Confounding or Effect Modification?: Multiple Regression: Confounding or Effect Modification? Effect Modification without Confounding: Overall association accurately estimates average effect of predictor on dependent variable, but after stratifying on third variable, that effect differs across strata Both: Overall association is not a correct estimate of effect, and different effects across subgroups of third variable Multiple Regression: Multiple Regression How to build a multiple regression model: Examine two-way scatter plots of potential predictors against your dependent variable Those that look associated, evaluate in a simple linear regression model (“univariate” analysis) Pick out significant univariate predictors Use stepwise model building techniques: Backwards Forwards Stepwise Best subsetsMultiple Regression: Multiple Regression Like simple linear regression, models require an assessment of model adequacy and goodness of fit Examination of residuals (comparison of observed vs. predicted values) Coefficient of Determination Pay attention to adjusted R-squaredLead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.): Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.) Study of lead exposure on neurological and psychological function in children Compared mean finger-wrist tapping score (maxfwt), a measure of neurological function, between exposed (≥ 40 mg/100 mL) and control children (< 40 mg/100 mL) Measured in taps per 10 seconds Already have tools to do this in “naïve” case! 2-sample t-testLead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.): Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.) Need a dummy variable for exposure CSCN2 = With 2-sample T-test, we compared the means of the exposed & controls 1 if child is exposed 0 if child is controlLead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.): Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.) Now, we can turn this into a simple linear regression model: MAXFWT = α + βxCSCN2 + e Estimates for each group: Exposed (CSCN2=1): MAXFWT = α + βx1 = α + β Controls (CSCN2=0): MAXFWT = α + βx0 = α β represents difference between groups One unit increase in CSNC2 Testing β = 0 same as testing if mean difference = 0 Lead Exposure Example: Lead Exposure Example As just shown, MAXFWT(exposed) = α + β = 55.095 – 6.658 = 48.437 & MAXFWT(controls) = α + e = 55.095 Mean Difference = -6.658! Lead Exposure Example: Lead Exposure Example Equivalent to two-sample t-test (w/ equal vars) of H0: μc = μe t=-3.003 p=0.0034 Slope of -6.658 is equivalent to mean difference between exposed and controls Lead Exposure Example: Lead Exposure Example R-squared not strong… Model doesn’t predict much of the group differences Lead Exposure Example: Lead Exposure Example What other variables related to neurological-function? Often strongly related to age and gender Look at scatterplots of both age and gender vs. MAXFWT, separately. Both show evidence of association… Age (years) MAXFWT Males Females MAXFWT Lead Exposure Example: Lead Exposure Example Age in years, and sex coded (1=Male, 2=Female) Both appear to be associated with MAXFWT, age is statistically significant (p=0.0001) Lead Exposure Example: Lead Exposure Example Our first multiple linear regression model… Numerator DF = k = 2 (sum of squares regression) Denominator DF = n-k-1 = 92 (sum of squares error) Lead Exposure Example: Lead Exposure Example Adjusted multiple linear regression model… Coefficients for Age and Sex haven’t changed by much Coefficient for CSNC2 smaller than the crude (naïve) difference -5.147 from -6.658 taps/10 seconds Lead Exposure Example: Lead Exposure Example R-squared up to 0.56 (from 0.09 in simple model) Note: Adjusted R-squared compensates for added complexity in model Since R-squared will ALWAYS increase as more variables are added, we want to keep things as simple as we can…this takes that into account Lead Exposure Example: Lead Exposure Example Interpretation: Holding sex and age constant (i.e., male and 10 years), the estimated mean difference between groups is -5.15 taps/10 seconds, with a 95% CI of (-8.23, -2.06) Other Regression Models: Other Regression Models Logistic Regression Used when the dependent variable is binary Very common in public health/medical studies (e.g., disease vs. no disease) Poisson Regression Used when the dependent variable is a count (Cox) Proportional Hazards (PH) Regression Used when the dependent variable is a “event time” with censoring 4. ANOVA: *Note: If categorical variable Is ordinal, Rank correlation Methods are applicable… 4. ANOVAAnalysis of Variance: Analysis of Variance Hypothesis Test for difference in means of k groups H0: μ1= μ2= μ3=…=μk HA: At least one pair not equal Assessing differences in means using VARIANCES Within-group and between-group variability If no difference in means, then two types of variability should be equal Assuming within-group variability is constant across groups Note, if k=2, then same as two-sample t-test Only need k-1 “dummy” variablesAnalysis of Variance: Analysis of Variance Parallels methods used for regression when we had one continuous and one categorical variable (with k levels) In constructing Least-Squares lines, we evaluated how much variability in our response could be explained by our explanatory (predictor) variables vs. left unexplained (residual error)…Analysis of Variance: The Total Error (SSY) was split into two portions: Variability explained by the regression (SSR) and Residual variability (SSE) Analysis of VarianceAnalysis of Variance: Similarly, we can think of this as the variability WITHIN and BETWEEN each level of the predictor: Analysis of VarianceAnalysis of Variance: Box plots for five levels of an explanatory variable: Size of boxes (Q1-Q3) reflect “Within-Group” variability Placement of boxes along y-axis reflect “Between-Group” variability Analysis of VarianceAnalysis of Variance: Box plots for five levels of an explanatory variable and total (combined): Total and y line added – so we can see where groups lie relative to the mean… How much overlap? TOTAL Analysis of Variance A B C D E yAnalysis of Variance Table: Analysis of Variance Table k = # of groups (levels of categorical variable) Remember, using parallel regression methods, only needed k-1 variables for k groups – so now, k-1 and n-k degrees of freedom… *Formerly known as SSR *Formerly known as SSE Analysis of Variance Table: Analysis of Variance Table Same test is conducted as we saw with regression, testing the ratio of between-group sum of squares and within-group sum of squares The larger the between is relative to within, the more likely we are to reject the null hypothesis *Formerly known as SSR *Formerly known as SSEAnalysis of Variance: Analysis of Variance If we do reject the null hypothesis of all group means being equal (based on the F-test), then we only know that at least one pair differ Still need to find where those differences lie Post-hoc tests (aka Multiple Comparisons) i.e., Tukey, Bonferroni Perform two-sample tests, while adjusting to maintain overall α level Review: Review