Introduction to statistics I :
Introduction to statistics I Sophia King
Rm. P24 HWB
sk219@le.ac.uk Descriptive Statistics :
Descriptive Statistics Statistical procedures used to summarise, organise, and simplify data. This process should be carried out in such a way that reflects overall findings
Raw data is made more manageable
Raw data is presented in a logical form
Patterns can be seen from organised data
Frequency tables
Graphical techniques
Measures of Central Tendency
Measures of Spread (variability) Plotting Data: describing spread of data :
Plotting Data: describing spread of data A researcher is investigating short-term memory capacity: how many symbols remembered are recorded for 20 participants:
4, 6, 3, 7, 5, 7, 8, 4, 5,10
10, 6, 8, 9, 3, 5, 6, 4, 11, 6
We can describe our data by using a Frequency Distribution. This can be presented as a table or a graph. Always presents:
The set of categories that made up the original category
The frequency of each score/category
Three important characteristics: shape, central tendency, and variability Frequency Distribution Tables :
Frequency Distribution Tables Highest Score is placed at top
All observed scores are listed
Gives information about distribution, variability, and centrality
X = score value
f = frequency
fx = total value associated with frequency
?f = N
?X =?fX Grouped Frequency Distribution Tables :
Grouped Frequency Distribution Tables Sometimes the spread of data is too wide
Grouped tables present scores as class intervals
About 10 intervals
An interval should be a simple round number (2, 5, 10, etc), and same width
Bottom score should be a multiple of the width
Class intervals represent Continuous variable of X:
E.g. 51 is bounded by real limits of 50.5-51.5
If X is 8 and f is 3, does not mean they all have the same scores: they all fell somewhere between 7.5 and 8.5 Percentiles and Percentile Ranks :
Percentiles and Percentile Ranks X values = raw scores, without context
Percentile rank = the percentage of the sample with scores below or at the particular value
This can be represented be a cumulative frequency column
Cumulative percentage obtained by:
c% = cf/N(100)
This gives information about relative position in the data distribution Representing data as graphs :
Representing data as graphs Frequency Distribution Graph presents all the info available in a Frequency Table (can be fitted to a grouped frequency table)
Uses Histograms
Bar width corresponds to real limits of intervals
Histograms can be modified to include blocks representing individual scores Frequency Distribution Polygons :
Frequency Distribution Polygons Shows same information with lines: traces ‘shape’ of distribution
Both histograms and polygons represent continuous data
For non numerical data, frequency distribution can be represented by bar graphs
Bar graphs have spaces between adjacent bars to represent distinct categories Frequencies of Populations and Samples :
Frequencies of Populations and Samples Population
All the individuals of interest to the study
Sample
The particular group of participants you are testing: selected from the population
Although it is possible to have graphs of population distributions, unlike graphs of sample distributions, exact frequencies are not normally possible. However, you can
Display graphs of relative frequencies (categorical data)
Use smooth curves to indicate relative frequencies (interval or ratio data) Frequency Distribution: the Normal Distribution :
Bell-shaped: specific shape that can be defined as an equation
Symmetrical around the mid point, where the greatest frequency if scores occur
Asymptotes of the perfect curve never quite meet the horizontal axis
Normal distribution is an assumption of parametric testing Frequency Distribution: the Normal Distribution Measures of Central Tendency :
Measures of Central Tendency A way of summarising the data using a single value that is in some way representative of the entire data set
It is not always possible to follow the same procedure in producing a central representative value: this changes with the shape of the distribution
Mode
Most frequent value
Does not take into account exact scores
Unaffected by extreme scores
Not useful when there are several values that occur equally often in a set Measures of Central Tendency :
Measures of Central Tendency Median
The values that falls exactly in the midpoint of a ranked distribution
Does not take into account exact scores
Unaffected by extreme scores
In a small set it can be unrepresentative
Mean (Arithmetic average)
Sample mean: M = ?X Population mean: ? = ?X
n N
Takes into account all values
Easily distorted by extreme values Measures of Central Tendency :
Measures of Central Tendency For our set of memory scores:
4, 6, 3, 7, 5, 7, 8, 4, 5,10
10, 6, 8, 9, 3, 5, 6, 4, 11, 6
Mode = 6: Median = 6: Mean = 6.35
The mean is the preferred measure of central tendency, except when
There are extreme scores or skewed distributions
Non interval data
Discrete variables Central Tendencies and Distribution Shape :
Central Tendencies and Distribution Shape Describing Variability :
Describing Variability Describes in an exact quantitative measure, how spread out/clustered together the scores are
Variability is usually defined in terms of distance
How far apart scores are from each other
How far apart scores are from the mean
How representative a score is of the data set as a whole Describing Variability: the Range :
Describing Variability: the Range Simplest and most obvious way of describing variability
Range = ?Highest - ?Lowest
The range only takes into account the two extreme scores and ignores any values in between. To counter this there the distribution is divided into quarters (quartiles). Q1 = 25%, Q2 =50%, Q3 =75%
The Interquartile range: the distance of the middle two quartiles (Q3 – Q1)
The Semi-Interquartile range: is one half of the Interquartile range Describing Variability: Deviation :
Describing Variability: Deviation A more sophisticated measure of variability is one that shows how scores cluster around the mean
Deviation is the distance of a score from the mean
X - ?, e.g. 11 - 6.35 = 3.65, 3 – 6.35 = -3.35
A measure representative of the variability of all the scores would be the mean of the deviation scores
?(X - ?) Add all the deviations and divide by n
n
However the deviation scores add up to zero (as mean serves as balance point for scores) Describing Variability: Variance :
Describing Variability: Variance To remove the +/- signs we simply square each deviation before finding the average. This is called the Variance:
?(X - ?)² = 106.55 = 5.33
n 20
The numerator is referred to as the Sum of Squares (SS): as it refers to the sum of the squared deviations around the mean value Describing Variability: Population Variance :
Describing Variability: Population Variance Population variance is designated by ?²
?² = ?(X - ?)² = SS
N N
Sample Variance is designated by s²
Samples are less variable than populations: they therefore give biased estimates of population variability
Degrees of Freedom (df): the number of independent (free to vary) scores. In a sample, the sample mean must be known before the variance can be calculated, therefore the final score is dependent on earlier scores: df = n -1
s² = ?(X - M)² = SS = 106.55 = 5.61
n - 1 n -1 20 -1 Describing Variability: the Standard Deviation :
Describing Variability: the Standard Deviation Variance is a measure based on squared distances
In order to get around this, we can take the square root of the variance, which gives us the standard deviation
Population (?) and Sample (s) standard deviation
? = ???(X - ?)²
N
s = ???(X - M)²
n - 1 So for our memory score example we simple take the square root of the variance:
= ?5.61 = 2.37 Describing Variability :
Describing Variability The standard deviation is the most common measure of variability, but the others can be used. A good measure of variability must:
Must be stable and reliable: not be greatly affected by little details in the data
Extreme scores
Multiple sampling from the same population
Open-ended distributions
Both the variance and SD are related to other statistical techniques Descriptive statistics :
Descriptive statistics A researcher is investigating short-term memory capacity: how many symbols remembered are recorded for 20 participants:
4, 6, 3, 7, 5, 7, 8, 4, 5,10
10, 6, 8, 9, 3, 5, 6, 4, 11, 6
What statistics can we display about this data, and what do they mean?
Frequency table: show how often different scores occur
Frequency graph: information about the shape of the distribution
Measures of central tendency and variability Descriptive statistics :
Descriptive statistics References and Further Reading :
References and Further Reading Gravetter & Wallnau
Chapter 2
Chapter 3
Chapter 4