Chap1and2

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

ST 311: Introduction to Inference: 

ST 311: Introduction to Inference Spencer Muse www.stat.ncsu.edu/~muse muse@stat.ncsu.edu

Cover Handout: 

Cover Handout

Course Goals: 

Course Goals Understand the logic of statistical thinking Learn basic statistical methods for studying means and percentages Learn the basics of least squares regression Learn basic probability concepts Learn to implement methods using StatCrunch Recognize the role of statistics in day to day life

Introduction: 

Introduction Statistics is a collection of procedures and principles for gathering data and analyzing information to help people make decisions when faced with uncertainty (randomness).

Individuals and Variables: 

Individuals and Variables A variable is a characteristic of an individual, and takes different values for different individuals. In the previous slide, Manufacturer and Max Price were examples of variables. For individual 4 (the Audi 100), the value of Manufacturer was Audi, while the value of Max Price was 44.6 (thousand). Individuals are objects described by data. They are often people or things. In the previous slide, the individuals were car models.

Categorical and Quantitative Variables: 

Categorical and Quantitative Variables A categorical variable places an individual into one of several groups or categories. (Sometimes called a qualitative variable) A quantitative variable is a meaningful numerical value describing an individual.

Samples and Populations: 

Samples and Populations We usually collect data from a sample of individuals that we will use to make inferences about a larger population. Numerical values computed from a sample are called statistics; values computed from an entire population are called parameters.

Distributions of Variables: 

Distributions of Variables EG: If I roll a 6-sided die, the possible values are the integers 1 through 6, and each occurs 1/6 of the time. The distribution of a variable describes the possible values that variable can take, and how often each of those values occurs (their relative frequencies).

Distributions of Variables: 

Distributions of Variables

Graphs for Categorical Variables: 

Graphs for Categorical Variables Graphs are good starting points for analyzing data. We use bar graphs and pie charts to display categorical variables.

Graphs for Quantitative Variables: 

Graphs for Quantitative Variables We use stemplots and histograms to display quantitative variables.

Examining a Distribution: 

Examining a Distribution When examining a distribution, we describe: The overall pattern, consisting of: Shape (How many peaks, or modes? Skewed or symmetric?) Center/Location (For now, eyeball the midpoint) Spread/Variability (For now, give the smallest and largest values) Deviations from that pattern (outliers)

Numerical Summaries of Quantitative Variables: 

Numerical Summaries of Quantitative Variables In addition to the general descriptions of distributions discussed previously, it is also useful to have specific numerical descriptions of distributional properties. Such numerical measures provide “standards” that are widely recognized, and they also allow for specific comparisons among different distributions or studies. The description of a distribution should include its shape, along with numerical measures of its center and spread.

Measuring Center: 

Measuring Center Measures of center describe the “typical value” of the variable under study. We will consider two different concepts of “typical”: The “average value” of a variable is best described by the mean, which you are familiar with as the arithmetic average. The “middle value” of a variable is best described by the median, which is the center value in a numerically sorted list.

Measuring center: the Mean: 

The mean, , of a set of numbers (also called the sample mean) is found by taking the sum of all the numbers and dividing by the total number of observations in the data set: Measuring center: the Mean

Slide22: 

Fuel economy of minicompact cars.

Computing the mean: 

Computing the mean For city mileage: For highway mileage:

Measuring center: the Median: 

The median, , of a set of numbers is found by first sorting all the numbers. The median is the middle value in the sorted list (if n is odd), or the average of the two middle values (if n is even): Measuring center: the Median

Computing the median: 

Computing the median For city mileage, the sorted values are: 12, 14, 16, 16, 18, 18, 18, 19, 19, 20, 21, 23, 25 For highway mileage, the sorted values are: 19, 22, 23, 23, 23, 26, 26, 27, 28, 29, 29, 31, 32

Computing numerical measures with StatCrunch: 

Computing numerical measures with StatCrunch Data/ Load data/ from file Follow prompts to select your data file Stat/ Summary Stats/ Columns Select the column (variable) of interest Calculate (to get a set of numerical measures)

Which measure, mean or median?: 

Which measure, mean or median? The mean and median of (roughly) symmetric distributions are similar. In this case, we prefer to use the mean to measure center. For skewed distributions, or in the presence of outliers, the mean and median can be very different. In this case, we usually prefer to use the median. Why? Because the median is a resistant measure of center.

Measuring spread: 

Measuring spread Measures of spread tell us something about the range of values observed in a distribution, or its variability. For roughly symmetric distributions, we will measure spread using the standard deviation. For distributions that are skewed or have outliers, we can use the range or interquartile range (IQR). These are often incorporated into a description called the 5-number summary.

The 5-number summary: 

The 5-number summary The 5-number summary consists of the 5 values: Minimum: the smallest observation Q1: the first (lower) quartile M: the median Q3: the third (upper) quartile Maximum: the largest observation Q1 is the median of the observations to the left of the location of the median; Q3 is the median of the observations to the right of the location of the median;

Calculating the 5-number summary: 

Calculating the 5-number summary Minicompact highway mpg: 19, 22, 23, 23, 23, 26, 26, 27, 28, 29, 29, 31, 32 Minimum = 19 Q1 = median of 19, 22, 23, 23, 23, 26 = 23 M = 26 Q3 = median of 27, 28, 29, 29, 31, 32 = 29 Maximum = 32 M = 26

Calculating the 5-number summary: 

Calculating the 5-number summary Suppose that the Minicompact highway mpg had been: 19, 22, 23, 23, 23, 26, 26, 27, 28, 29, 29, 31, 32 Minimum = 19 Q1 = median of 19, 22, 23, 23, 23, 26 = 23 M = 26 Q3 = median of 26, 27, 28, 29, 29, 31 = 28.5 Maximum = 31 M = 26 The range is defined as Maximum - Minimum. The IQR is defined as Q3 - Q1.

Displaying the 5-number Summary: Boxplots: 

Displaying the 5-number Summary: Boxplots

The IQR and Outlier Identification: 

The IQR and Outlier Identification Recall that the interquartile range, or IQR, is IQR = Q3 - Q1 Values more than 1.5  IQR above Q3 or 1.5  IQR below Q1 are suspected outliers.

Slide36: 

Be cautious: the 1.5xIQR Rule works best for symmetric, bell-shaped distributions

Calculating the standard deviation, s: 

Calculating the standard deviation, s Whenever there is no strong skewness or outlier in a data set, the preferred measure of spread is the standard deviation, s: is called the variance. is the standard deviation.

Choosing a summary: 

Choosing a summary Because of many useful properties that we will study later in the course, we want to use the mean and standard deviation to describe distributions whenever possible. Therefore, if the distribution is strongly skewed or has outliers, use the 5-number summary; otherwise, use and s.

Bell-Shaped Distributions: 

Bell-Shaped Distributions It is very common for quantitative variables to have distributions that have the appearance of a bell-shaped curve. We will work a lot with a particular curve called the normal distribution later.

The Empirical Rule: 

The Empirical Rule For bell-shaped distributions, the empirical rule states that approximately: 68% of the values fall within 1 standard deviation of the mean 95% of the values fall within 2 standard deviation of the mean 99.7% of the values fall within 3 standard deviation of the mean

Slide41: 

68% 95% 99.7% The empirical rule

Standardized z-scores: 

Standardized z-scores We can standardize observations from bell-shaped distributions to allow for easy comparisons. EG: What is the z-score of an observed value of 75 if it came from a bell-shaped distribution with mean 100 and standard deviation 20?

The Empirical Rule, restated: 

The Empirical Rule, restated For bell-shaped distributions, the empirical rule states that approximately: 68% of the values have z-scores between -1 and +1 95% of the values have z-scores between -2 and +2 99.7% of the values have z-scores between -3 and +3