➔
Statistics for Economics
Accounting and Business Studies
The Power of Practice
With your purchase of a new copy of this textbook you received a Student Access Kit for getting
started with statistics using MathXL. Follow the instructions on the card to register successfully
and start making the most of the resources.
Don’t throw it away
The Power of Practice
MathXL is an online study and testing resource that puts you in control of your study providing
extensive practice exactly where and when you need it.
MathXL gives you unrivalled resources:
● Sample tests for each chapter to see how much you have learned and where you still need
practice.
● A personalised study plan which constantly adapts to your strengths and weaknesses taking
you to exercises you can practise over and over with different variables every time.
● ‘Help me solve this’ provide guided solutions which break the problem into its component steps
and guide you through with hints.
● Audio animations guide you step-by-step through the key statistical techniques.
● Click on the E-book textbook icon to read the relevant part of your textbook again.
See pages xiv–xv for more details.
To activate your registration go to www.pearsoned.co.uk/barrow and follow the instructions
on-screen to register as a new user.
STFE_A01.qxd 26/02/2009 09:01 Page i

slide 3:

We work with leading authors to develop the strongest
educational materials in Accounting bringing cutting-edge
thinking and best learning practice to a global market.
Under a range of well-known imprints including
Financial Times Prentice Hall we craft high-quality print
and electronic publications which help readers to
understand and apply their content whether studying
or at work.
To ﬁnd out more about the complete range of our
publishing please visit us on the World Wide Web at:
www.pearsoned.co.uk
STFE_A01.qxd 26/02/2009 09:01 Page ii

slide 4:

Michael Barrow
University of Sussex
Statistics for Economics
Accounting and Business Studies
Fifth Edition
STFE_A01.qxd 26/02/2009 09:01 Page iii

For Patricia Caroline and Nicolas
STFE_A01.qxd 26/02/2009 09:01 Page v

slide 7:

vii
Contents
Guided tour of the book xii
Getting started with statistics using MathXL xiv
Preface to the ﬁfth edition xvii
Introduction 1
1 Descriptive statistics 7
Learning outcomes 8
Introduction 8
Summarising data using graphical techniques 10
Looking at cross-section data: wealth in the UK in 2003 16
Summarising data using numerical techniques 24
The box and whiskers diagram 44
Time-series data: investment expenditures 1973–2005 45
Graphing bivariate data: the scatter diagram 58
Data transformations 60
Guidance to the student: how to measure your progress 62
Summary 63
Key terms and concepts 64
Reference 64
Problems 65
Answers to exercises 71
Appendix 1A: Σ notation 75
Problems on Σ notation 76
Appendix 1B: E and V operators 77
Appendix 1C: Using logarithms 78
Problems on logarithms 79
2 Probability 80
Learning outcomes 80
Probability theory and statistical inference 81
The deﬁnition of probability 81
Probability theory: the building blocks 84
Bayes’ theorem 91
Decision analysis 93
Summary 98
Key terms and concepts 98
Problems 99
Answers to exercises 105
STFE_A01.qxd 26/02/2009 09:01 Page vii

slide 8:

Contents
viii
3 Probability distributions 108
Learning outcomes 108
Introduction 109
Random variables 110
The Binomial distribution 111
The Normal distribution 117
The sample mean as a Normally distributed variable 125
The relationship between the Binomial and
Normal distributions 131
The Poisson distribution 132
Summary 135
Key terms and concepts 136
Problems 137
Answers to exercises 142
4 Estimation and conﬁdence intervals 144
Learning outcomes 144
Introduction 145
Point and interval estimation 145
Rules and criteria for ﬁnding estimates 146
Estimation with large samples 149
Precisely what is a conﬁdence interval 153
Estimation with small samples: the t distribution 160
Summary 165
Key terms and concepts 165
Problems 166
Answers to exercises 169
Appendix: Derivations of sampling distributions 170
5 Hypothesis testing 172
Learning outcomes 172
Introduction 173
The concepts of hypothesis testing 173
The Prob-value approach 180
Signiﬁcance effect size and power 181
Further hypothesis tests 183
Hypothesis tests with small samples 187
Are the test procedures valid 189
Hypothesis tests and conﬁdence intervals 190
Independent and dependent samples 191
Discussion of hypothesis testing 194
Summary 195
Key terms and concepts 196
Reference 196
STFE_A01.qxd 26/02/2009 09:01 Page viii

slide 9:

Contents
ix
Problems 197
Answers to exercises 201
6 The χ
2
and F distributions 204
Learning outcomes 204
Introduction 205
The χ
2
distribution 205
The F distribution 220
Analysis of variance 222
Summary 229
Key terms and concepts 230
Problems 231
Answers to exercises 234
Appendix: Use of χ
2
and F distribution tables 236
7 Correlation and regression 237
Learning outcomes 237
Introduction 238
What determines the birth rate in developing countries 238
Correlation 240
Regression analysis 251
Inference in the regression model 257
Summary 271
Key terms and concepts 272
References 272
Problems 273
Answers to exercises 276
8 Multiple regression 279
Learning outcomes 279
Introduction 280
Principles of multiple regression 281
What determines imports into the UK 282
Finding the right model 300
Summary 307
Key terms and concepts 308
Reference 308
Problems 309
Answers to exercises 313
9 Data collection and sampling methods 318
Learning outcomes 318
Introduction 319
Using secondary data sources 319
Using electronic sources of data 321
STFE_A01.qxd 26/02/2009 09:01 Page ix

slide 10:

Contents
x
Collecting primary data 323
The meaning of random sampling 324
Calculating the required sample size 333
Collecting the sample 335
Case study: the UK Expenditure and Food Survey 338
Summary 339
Key terms and concepts 340
References 340
Problems 341
10 Index numbers 342
Learning outcomes 343
Introduction 343
A simple index number 344
A price index with more than one commodity 345
Using expenditures as weights 353
Quantity and expenditure indices 355
The Retail Price Index 360
Inequality indices 366
The Lorenz curve 367
The Gini coefﬁcient 370
Concentration ratios 374
Summary 376
Key terms and concepts 376
References 376
Problems 377
Answers to exercises 382
Appendix: Deriving the expenditure share form of
the Laspeyres price index 385
11 Seasonal adjustment of time-series data 386
Learning outcomes 386
Introduction 387
The components of a time series 387
Forecasting 399
Further issues 400
Summary 401
Key terms and concepts 401
Problems 402
Answers to exercises 404
Important formulae used in this book 408
Appendix: Tables 412
Table A1 Random number table 412
Table A2 The standard Normal distribution 414
STFE_A01.qxd 26/02/2009 09:01 Page x

slide 11:

Contents
xi
Table A3 Percentage points of the t distribution 415
Table A4 Critical values of the χ
2
distribution 416
Table A5a Critical values of the F distribution upper 5 points 418
Table A5b Critical values of the F distribution upper 2.5 points 420
Table A5c Critical values of the F distribution upper 1 points 422
Table A5d Critical values of the F distribution upper 0.5 points 424
Table A6 Critical values of Spearman’s rank correlation coefﬁcient 426
Table A7 Critical values for the Durbin–Watson test at 5
signiﬁcance level 427
Answers to problems 428
Index 449
STFE_A01.qxd 26/02/2009 09:01 Page xi

slide 12:

Setting the scene
Practising and testing your understanding
Probability distributions
3
Contents
Learning outcomes 108
Introduction 109
Random variables 110
The Binomial distribution 111
The mean and variance of the Binomial distribution 115
The Normal distribution 117
The sample mean as a Normally distributed variable 125
Sampling from a non-Normal population 129
The relationship between the Binomial and Normal distributions 131
Binomial distribution method 131
Normal distribution method 132
The Poisson distribution 132
Summary 135
Key terms and concepts 136
Problems 137
Answers to exercises 142
By the end of this chapter you should be able to:
● recognise that the result of most probability experiments e.g. the score on a
die can be described as a random variable
● appreciate how the behaviour of a random variable can often be summarised by
a probability distribution a mathematical formula
● recognise the most common probability distributions and be aware of their
uses
● solve a range of probability problems using the appropriate probability
distribution.
Learning
outcomes
108
Complete your diagnostic test for Chapter 3 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
Introduction
109
Introduction
In this chapter the probability concepts introduced in Chapter 2 are generalised
by using the idea of a probability distribution. A probability distribution lists
in some form all the possible outcomes of a probability experiment and the
probability associated with each one. For example the simplest experiment
is tossing a coin for which the possible outcomes are heads or tails each with
probability one-half. The probability distribution can be expressed in a variety
of ways: in words or in a graphical or mathematical form. For tossing a coin the
graphical form is shown in Figure 3.1 and the mathematical form is
PrH
PrT
The different forms of presentation are equivalent but one might be more
suited to a particular purpose.
1
2
1
2
Some probability distributions occur often and so are well known. Because of
this they have names so we can refer to them easily for example the Binomial
distribution or the Normal distribution. In fact each constitutes a family of dis-
tributions. A single toss of a coin gives rise to one member of the Binomial
distribution family two tosses would give rise to another member of that fam-
ily. These two distributions differ in the number of tosses. If a biased coin were
tossed this would lead to yet another Binomial distribution but it would differ
from the previous two because of the different probability of heads.
Members of the Binomial family of distributions are distinguished either by
the number of tosses or by the probability of the event occurring. These are the
two parameters of the distribution and tell us all we need to know about the
distribution. Other distributions might have different numbers of parameters with
different meanings. Some distributions for example have only one parameter.
We will come across examples of different types of distribution throughout the
rest of this book.
In order to understand fully the idea of a probability distribution a new
concept is ﬁrst introduced that of a random variable. As will be seen later in the
chapter an important random variable is the sample mean and to understand
Figure 3.1
The probability distribution
for the toss of a coin
Chapter 4 • Estimation and conﬁdence intervals
158
−14.05 −1.95
The estimate is that school 2’s average mark is between 1.95 and 14.05 per-
centage points above that of school 1. Notice that the conﬁdence interval does
not include the value zero which would imply equality of the two schools’
marks. Equality of the two schools can thus be ruled out with 95 conﬁdence.
Worked example 4.3
A survey of holidaymakers found that on average women spent 3 hours
per day sunbathing men spent 2 hours. The sample sizes were 36 in each
case and the standard deviations were 1.1 hours and 1.2 hours respectively.
Estimate the true difference between men and women in sunbathing habits.
Use the 99 conﬁdence level.
The point estimate is simply one hour the difference of sample means. For
the conﬁdence interval we have
0.30 1.70
This evidence suggests women do spend more time sunbathing than men zero
is not in the conﬁdence interval. Note that we might worry the samples
might not be independent here – it could represent 36 couples. If so the
evidence is likely to underestimate the true difference if anything as couples
are likely to spend time sunbathing together.
Estimating the difference between two proportions
We move again from means to proportions. We use a simple example to illustrate
the analysis of this type of problem. Suppose that a survey of 80 Britons showed
that 60 owned personal computers. A similar survey of 50 Swedes showed 30
with computers. Are personal computers more widespread in Britain than Sweden
Here the aim is to estimate π 1 − π 2 the difference between the two population
proportions so the probability distribution of p 1 − p 2 is needed the difference
of the sample proportions. The derivation of this follows similar lines to those
set out above for the difference of two sample means so is not repeated. The
probability distribution is
p 1 − p 2 N π 1 − π 2 + 4.14
D
F
π 2 1 − π 2
n 2
π 1 1 − π 1
n 1
A
C
.
.
.
.
.
.
32 257
11
36
12
36
32 257
11
36
12
36
22 22
−− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ μ
. . XX XX 12
1
2
1
2
2
2
12
1
2
1
2
2
2
257 2 57 −− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ s
n
s
n
s
n
s
n
μ
. . 62 70 1 96
18
60
12
35
62 70 1 96
18
60
12
35
22 22
−− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ Chapter contents guide
you through the chapter
highlighting key topics
and showing you where
to find them.
Learning outcomes
summarise what you
should have learned by
the end of the chapter.
Worked examples break down
statistical techniques step-by-step
and illustrate how to apply an
understanding of statistical
techniques to real life.
Chapter introductions set the scene for
learning and link the chapters together.
Guided tour of the book
xii
STFE_A01.qxd 26/02/2009 09:01 Page xii

slide 13:

Reinforcing your understanding
Chapter 2 • Probability
98
Summary
● The theory of probability forms the basis of statistical inference: the drawing
of inferences on the basis of a random sample of data. The reason for this is
the probability basis of random sampling.
● A convenient deﬁnition of the probability of an event is the number of times
the event occurs divided by the number of trials occasions when the event
could occur.
● For more complex events their probabilities can be calculated by combining
probabilities using the addition and multiplication rules.
● The probability of events A or B occurring is calculated according to the addi-
tion rule.
● The probability of A and B occurring is given by the multiplication rule.
● If A and B are not independent then PrA and B PrA × PrB| A where
PrB| A is the probability of B occurring given that A has occurred the con-
ditional probability.
● Tree diagrams are a useful technique for enumerating all the possible paths in
series of probability trials but for large numbers of trials the huge number of
possibilities makes the technique impractical.
● For experiments with a large number of trials e.g. obtaining 20 heads in 50
tosses of a coin the formulae for combinations and permutations can be used.
● The combinatorial formula nCr gives the number of ways of combining r
similar objects among n objects e.g. the number of orderings of three girls
and hence implicitly two boys also in ﬁve children.
● The permutation formula nPr gives the number of orderings of r distinct
objects among n e.g. three named girls among ﬁve children.
● Bayes’ theorem provides a formula for calculating a conditional probability e.g.
the probability of someone being a smoker given they have been diagnosed
with cancer. It forms the basis of Bayesian statistics allowing us to calculate
the probability of a hypothesis being true based on the sample evidence and
prior beliefs. Classical statistics disputes this approach.
● Probabilities can also be used as the basis for decision making in conditions of
uncertainty using as decision criteria expected value maximisation maximin
maximax or minimax regret.
addition rule
Bayes’ theorem
combinations
complement
compound event
conditional probability
exhaustive
expected value of perfect information
frequentist approach
independent events
maximin
minimax
minimax regret
multiplication rule
mutually exclusive
outcome or event
permutations
probability experiment
probability of an event
sample space
subjective approach
tree diagram
Key terms and concepts
99
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
2.1 Given a standard pack of cards calculate the following probabilities:
a drawing an ace
b drawing a court card i.e. jack queen or king
c drawing a red card
d drawing three aces without replacement
e drawing three aces with replacement.
2.2 The following data give duration of unemployment by age in July 1986.
Age Duration of unemployment weeks Total Economically active
8 8–26 26–52 52
000s 000s
Percentage ﬁgures
16–19 27.2 29.8 24.0 19.0 273.4 1270
20–24 24.2 20.7 18.3 36.8 442.5 2000
25–34 14.8 18.8 17.2 49.2 531.4 3600
35–49 12.2 16.6 15.1 56.2 521.2 4900
50–59 8.9 14.4 15.6 61.2 388.1 2560
60 18.5 29.7 30.7 21.4 74.8 1110
The ‘economically active’ column gives the total of employed not shown plus unemployed
in each age category.
a In what sense may these ﬁgures be regarded as probabilities What does the ﬁgure
27.2 top-left cell mean following this interpretation
b Assuming the validity of the probability interpretation which of the following state-
ments are true
i The probability of an economically active adult aged 25–34 drawn at random
being unemployed is 531.4/3600.
ii If someone who has been unemployed for over one year is drawn at random the
probability that they are aged 16–19 is 19.
iii For those aged 35–49 who became unemployed before July 1985 the probability
of their still being unemployed is 56.2.
iv If someone aged 50–59 is drawn at random from the economically active popula-
tion the probability of their being unemployed for eight weeks or less is 8.9.
v The probability of someone aged 35–49 drawn at random from the economically
active population being unemployed for between 8 and 26 weeks is 0.166 ×
521.2/4900.
c A person is drawn at random from the population and found to have been unemployed
for over one year. What is the probability that they are aged between 16 and 19
Problems
Problems
Hypothesis tests with small samples
187
STATISTICS
IN PR AC TIC E
··
Exercise 5.6
Exercise 5.7
Exercise 5.8
Are women better at multi-tasking
The conventional wisdom is ‘yes’. However the concept of multi-tasking originated
in computing and in that domain it appears men are more likely to multi-task.
Oxford Internet Surveys http://www.oii.ox.ac.uk/microsites/oxis/ asked a
sample of 1578 people if they multi-tasked while on-line e.g. listening to music
using the phone. 69 of men said they did compared to 57 of women. Is this
difference statistically signiﬁcant
The published survey does not give precise numbers of men and women
respondents for this question so we will assume equal numbers the answer is
not very sensitive to this assumption. We therefore have the test statistic
0.63 is the overall proportion of multi-taskers. The evidence is signiﬁcant and
clearly suggests this is a genuine difference: men are the multi-taskers
A survey of 80 voters ﬁnds that 65 are in favour of a particular policy. Test the
hypothesis that the true proportion is 50 against the alternative that a majority is
in favour.
A survey of 50 teenage girls found that on average they spent 3.6 hours per week
chatting with friends over the internet. The standard deviation was 1.2 hours. A sim-
ilar survey of 90 teenage boys found an average of 3.9 hours with standard deviation
2.1 hours. Test if there is any difference between boys’ and girls’ behaviour.
One gambler on horse racing won on 23 of his 75 bets. Another won on 34 out of 95.
Is the second person a better judge of horses or just luckier
Hypothesis tests with small samples
As with estimation slightly different methods have to be employed when the
sample size is small n 25 and the population variance is unknown. When
both of these conditions are satisﬁed the t distribution must be used rather than
the Normal so a t test is conducted rather than a z test. This means consulting
tables of the t distribution to obtain the critical value of a test but otherwise the
methods are similar. These methods will be applied to hypotheses about sample
means only since they are inappropriate for tests of a sample proportion as was
the case in estimation.
Testing the sample mean
A large chain of supermarkets sells 5000 packets of cereal in each of its stores
each month. It decides to test-market a different brand of cereal in 15 of its
stores. After a month the 15 stores have sold an average of 5200 packets each
z
. .
. .
. .
.
−−
×−
+
×−
0 69 0 57 0
063 1 063
789
063 1 063
789
494
Summarising data using graphical techniques
15
Figure 1.6
Educational
qualiﬁcations of the
unemployed
Exercise 1.1
STATISTICS
IN PR AC TIC E
· ·
contrasted with Figure 1.6 which shows a similar chart for the unemployed the
second row of Table 1.1.
The ‘other qualiﬁcation’ category is a little larger in this case but the ‘no
qualiﬁcation’ group now accounts for 20 of the unemployed a big increase.
Further the proportion with a degree approximately halves from 32 to 15.
Producing charts using Microsoft Excel
Most of the charts in this book were produced using Excel’s charting facility. With-
out wishing to dictate a precise style you should aim for a similar uncluttered
look. Some tips you might ﬁnd useful are:
● Make the grid lines dashed in a light grey colour they are not actually part of
the chart hence should be discreet or eliminate altogether.
● Get rid of the background ﬁll grey by default alter to ‘No ﬁll’. It does not look
great when printed.
● On the x-axis make the labels horizontal or vertical not slanted – it is then
difﬁcult to see which point they refer to. If they are slanted double click on the
x-axis then click the alignment tab.
● Colour charts look great on-screen but unclear if printed in black and white.
Change the style type of the lines or markers e.g. make some dashed to
distinguish them on paper.
● Both axes start at zero by default. If all your observations are large numbers
this may result in the data points being crowded into one corner of the graph.
Alter the scale on the axes to ﬁx this: set the minimum value on the axis to be
slightly less than the minimum observation.
Otherwise Excel’s default options will usually give a good result.
The following table shows the total numbers in millions of tourists visiting each
country and the numbers of English tourists visiting each country:
France Germany Italy Spain
All tourists 12.4 3.2 7.5 9.8
English tourists 2.7 0.2 1.0 3.6
a Draw a bar chart showing the total numbers visiting each country.
b Draw a stacked bar chart which shows English and non-English tourists making
up the total visitors to each country.
Statistics in practice
provide real and
interesting applications
of statistical techniques
in business practice.
They also provide helpful
hints on how to use
different software
packages such as Excel
and calculators to solve
statistical problems and
help you manipulate
data.
Exercises throughout the chapter allow you to stop and check your
understanding of the topic you have just learnt. You can check the
answers at the end of each chapter. Exercises with an icon have
a corresponding exercise in MathXL to practise.
Chapter summaries
recap all the important
topics covered in the
chapter.
Key terms and concepts
are highlighted when
they first appear in the
text and are brought
together at the end of
each chapter.
Problems at the end of each chapter range in difficulty to
provide a more in-depth practice of topics.
Guided tour of the book
xiii
STFE_A01.qxd 26/02/2009 09:01 Page xiii

slide 14:

Getting started with statistics using MathXL
This fifth edition of Statistics for Economics Accounting and Business Studies comes with a new computer
package called MathXL which is a new personalised and innovative online study and testing resource providing
extensive practice questions exactly where you need them most. In addition to the exercises interspersed in the
text when you see this icon you should log on to this new online tool and practise further.
To get started take out your access kit included inside this book to register online.
Registration and log in
Go to www.pearsoned.co.uk/barrow and follow the
instructions on-screen using the code inside your access
kit which will look like this:
The login screen will look like this:
Now you should be registered with your own password ready to log directly into your own course.
When you log in to your course for the first time the course home page will look like this:
Now follow these steps for the chapter you are studying.
xiv
STFE_A01.qxd 26/02/2009 09:01 Page xiv

slide 15:

Getting started with statistics using MathXL
Step 1 Take a sample test
Sample tests two for each chapter enable you to test
yourself to see how much you already know about a
particular topic and identify the areas in which you need
more practice. Click on the Study Plan button in the
menu and take Sample test a for the chapter you are
studying. Once you have completed a chapter go back
and take Sample test b and see how much you have
learned.
Step 2 Review your study plan
The results of the sample tests you have taken will be
incorporated into your study plan showing you what
sections you have mastered and what sections you
need to study further helping you make the most
efficient use of your self-study time.
Step 3 Have a go at an exercise
From the study plan click on the section of the book
you are studying and have a go at the series of inter-
active Exercises. When required use the maths panel
on the left hand side to select the maths functions you
need. Click on more to see the full range of functions
available. Additional study tools such as Help me solve
this and View an example break the question down
step-by-step for you helping you to complete the
exercises successfully. You can try the same exercises
over and over again and each time the values will
change giving you unlimited practice.
Step 4 Use the E-book and additional
multimedia tools to help you
If you are struggling with a question you can click on
the textbook icon to read the relevant part of your
textbook again.
You can also click on the animation icon to help you
visualise and improve your understanding of key
concepts.
Good luck getting started with MathXL.
For an online tour go to www.mathxl.com. For any help and advice contact the 24-hour online support at
www.mathxl.com and click on student support.
xv
STFE_A01.qxd 26/02/2009 09:01 Page xv

slide 16:

xvii
Preface to the ﬁfth edition
This text is aimed at students of economics and the closely related disciplines of
accountancy and business and provides examples and problems relevant to
those subjects using real data where possible. The book is at an elementary level
and requires no prior knowledge of statistics nor advanced mathematics. For
those with a weak mathematical background and in need of some revision
some recommended texts are given at the end of this preface.
This is not a cookbook of statistical recipes: it covers all the relevant concepts
so that an understanding of why a particular statistical technique should be used
is gained. These concepts are introduced naturally in the course of the text as they
are required rather than having sections to themselves. The book can form the
basis of a one- or two-term course depending upon the intensity of the teaching.
As well as explaining statistical concepts and methods the different schools
of thought about statistical methodology are discussed giving the reader some
insight into some of the debates that have taken place in the subject. The book
uses the methods of classical statistical analysis for which some justiﬁcation is
given in Chapter 5 as well as presenting criticisms that have been made of these
methods.
Changes in this edition
There have been changes to this edition in the light of my own experience and
comments from students and reviewers. The main changes are:
● The chapter on Seasonal adjustment which was dropped from the previous
edition has been reinstated as Chapter 11. Although it was available on the
web this was inconvenient and referees suggested restoring it.
● Where appropriate the examples used in the text have been updated using
more recent data.
● Accompanying the text is a new website MathXL accessed at www.pearsoned.
co.uk/barrow which will help students to get started with statistics. For this
edition the website contains:
For lecturers
❍ PowerPoint slides for lecturers to use these contain most of the key tables
formulae and diagrams but omit the text. Lecturers can adapt these for
their own use.
❍ Answers to even-numbered problems.
❍ An instructor’s manual giving hints and guidance on some of the teaching
issues including those that come up in response to some of the problems.
For students
❍ Sets of interactive exercises with guided solutions which students may
use to test their learning. The values within the questions are randomised
STFE_A01.qxd 26/02/2009 09:01 Page xvii

slide 17:

xviii
Preface to the ﬁfth edition
so the test can be taken several times if desired and different students
will have different calculations to perform. Answers are provided once the
question has been attempted and guided solutions are also available.
Mathematics requirements and texts
No more than elementary algebra is assumed in this text any extensions being
covered as they are needed in the book. It is helpful if students are comfortable
at manipulating equations so if some revision is required I recommend one of
the following books:
I. Jacques Mathematics for Economics and Business 2009 Prentice Hall
5th edn.
G. Renshaw Maths for Economics 2008 Oxford University Press 2nd edn.
Acknowledgements
I would like to thank the anonymous reviewers who made suggestions for this
new edition and to the many colleagues and students who have passed on
comments or pointed out errors or omissions in previous editions. I would like
to thank all those at Pearson Education who have encouraged me responded to
my various queries and reminded me of impending deadlines Finally I would
like to thank my family for giving me encouragement and the time to complete
this new edition.
Pearson Education would like to thank the following reviewers for their
feedback for this new edition:
Andrew Dickerson University of Shefﬁeld
Robert Watkins London
Julie Litchﬁeld University of Sussex
Joel Clovis University of East Anglia
The publishers are grateful to the following for permission to reproduce
copyright material: Blackwell Publishers for information from the Economic
Journal and the Economic History Review the Ofﬁce of National Statistics for
data extracted and adapted from the Statbase database the General Household
Survey 1991 the Expenditure and Food Survey 2003 Economic Trends and its
Annual Supplement the Family Resources Survey 2002–3 HMSO for data from
Inland Revenue Statistics 1981 1993 2003 Education and Training Statistics for the
U.K. 2003 Treasury Brieﬁng February 1994 Employment Gazette February 1995
Oxford University Press for extracts from World Development Report 1997 by the
World Bank and Pearson Education for information from Todaro M. 1992
Economic Development for a Developing World 3rd edn..
Although every effort has been made to trace the owners of copyright material
in a few cases this has proved impossible and the publishers take this opportun-
ity to apologise to any copyright holders whose rights have been unwittingly
infringed.
STFE_A01.qxd 26/02/2009 09:01 Page xviii

slide 18:

Custom publishing
Custom publishing allows academics to pick and choose content from one or more textbooks for
their course and combine it into a definitive course text.
Here are some common examples of custom solutions which have helped over 800 courses
across Europe:
● different chapters from across our publishing imprints combined into one book
● lecturer’s own material combined together with textbook chapters or published in a
separate booklet
● third-party cases and articles that you are keen for
your students to read as part of the course
● any combination of the above.
The Pearson Education custom text published for your
course is professionally produced and bound – just as
you would expect from a normal Pearson Education
text. Since many of our titles have online resources
accompanying them we can even build a Custom
website that matches your course text.
If you are teaching an introductory statistics course for
economics and business students do you also teach an
introductory mathematics course for economics and
business students If you do you might find chapters
from Mathematics for Economics and Business Sixth
Edition by Ian Jacques useful for your course. If you are
teaching a year-long course you may wish to recommend both texts. Some adopters have found
however that they require just one or two extra chapters from one text or would like to select a
range of chapters from both texts.
Custom publishing has allowed these adopters to provide access to additional chapters for their
students both online and in print. You can also customise the online resources.
If once you have had time to review this title you feel Custom publishing might benefit you and
your course please do get in contact. However minor or major the change – we can help you out.
For more details on how to make your chapter selection for your course please go to:
www.pearsoned.co.uk/barrow
You can contact us at: www.pearsoncustom.co.uk or via your local representative at:
www.pearsoned.co.uk/replocator
xix
STFE_A01.qxd 26/02/2009 09:01 Page xix

slide 19:

Introduction
1
Introduction
Statistics is a subject which can be and is applied to every aspect of our lives.
A glance at the annual Guide to Ofﬁcial Statistics published by the UK Ofﬁce
for National Statistics for example gives some idea of the range of material
available. Under the letter ‘S’ for example one ﬁnds entries for such disparate
subjects as salaries schools semolina shipbuilding short-time working spoons
and social surveys. It seems clear that whatever subject you wish to investigate
there are data available to illuminate your study. However it is a sad fact that
many people do not understand the use of statistics do not know how to draw
proper inferences conclusions from them or mis-represent them. Even espe-
cially politicians are not immune from this – for example it sometimes
appears they will not be happy until all school pupils and students are above
average in ability and achievement.
People’s intuition is often not very good when it comes to statistics – we did
not need this ability to evolve. A majority of people will still believe crime is
on the increase even when statistics show unequivocally that it is decreasing.
We often take more notice of the single shocking story than of statistics which
count all such events and ﬁnd them rare. People also have great difﬁculty
with probability which is the basis for statistical inference and hence make
erroneous judgements e.g. how much it is worth investing to improve safety.
Once you have studied statistics you should be less prone to this kind of error.
Two types of statistics
The subject of statistics can usefully be divided into two parts descriptive stat-
istics covered in Chapters 1 10 and 11 of this book and inferential statistics
Chapters 4–8 which are based upon the theory of probability Chapters 2
and 3. Descriptive statistics are used to summarise information which would
otherwise be too complex to take in by means of techniques such as averages
and graphs. The graph shown in Figure I.1 is an example summarising drinking
habits in the UK.
Figure I.1
Alcohol consumption
in the UK
STFE_A02.qxd 26/02/2009 09:03 Page 1

slide 20:

Introduction
2
The graph reveals for instance that about 43 of men and 57 of women
drink between 1 and 10 units of alcohol per week a unit is roughly equivalent
to one glass of wine or half a pint of beer. The graph also shows that men tend
to drink more than women this is probably not surprising with higher pro-
portions drinking 11–20 units and over 21 units per week. This simple graph
has summarised a vast amount of information the consumption levels of about
45 million adults.
Even so it is not perfect and much information is hidden. It is not obvious
from the graph that the average consumption of men is 16 units per week
of women only 6 units. From the graph you would probably have expected
the averages to be closer together. This shows that graphical and numerical
summary measures can complement each other. Graphs can give a very useful
visual summary of the information but are not very precise. For example it is
difﬁcult to convey in words the content of a graph: you have to see it. Numerical
measures such as the average are more precise and are easier to convey to others.
Imagine you had data for student alcohol consumption how do you think
this would compare to the graph It would be easy to tell someone whether the
average is higher or lower but comparing the graphs is difﬁcult without actually
viewing them.
Statistical inference the second type of statistics covered concerns the
relationship between a sample of data and the population in the statistical
sense not necessarily human from which it is drawn. In particular it asks what
inferences can be validly drawn about the population from the sample.
Sometimes the sample is not representative of the population either due to
bad sampling procedures or simply due to bad luck and does not give us a true
picture of reality.
The graph was presented as fact but it is actually based on a sample of indi-
viduals since it would obviously be impossible to ask everyone about their
drinking habits. Does it therefore provide a true picture of drinking habits We
can be reasonably conﬁdent that it does for two reasons. First the government
statisticians who collected the data designed the survey carefully ensuring that
all age groups are fairly represented and did not conduct all the interviews in
pubs for example. Second the sample is a large one about 10 000 households
so there is little possibility of getting an unrepresentative sample. It would
be very unlucky if the sample consisted entirely of teetotallers for example. We
can be reasonably sure therefore that the graph is a fair reﬂection of reality and
that the average woman drinks around 6 units of alcohol per week. However
we must remember that there is some uncertainty about this estimate. Statistical
inference provides the tools to measure that uncertainty.
The scatter diagram in Figure I.2 considered in more detail in Chapter 7
shows the relationship between economic growth and the birth rate in 12 develop-
ing countries. It illustrates a negative relationship – higher economic growth
appears to be associated with lower birth rates.
Once again we actually have a sample of data drawn from the population
of all countries. What can we infer from the sample Is it likely that the
‘true’ relationship what we would observe if we had all the data is similar
or do we have an unrepresentative sample In this case the sample size is quite
small and the sampling method is not known so we might be cautious in our
conclusions.
STFE_A02.qxd 26/02/2009 09:03 Page 2

slide 21:

Introduction
3
Statistics and you
By the time you have ﬁnished this book you will have encountered and I hope
mastered a range of statistical techniques. However becoming a competent
statistician is about more than learning the techniques and comes with time
and practice. You could go on to learn about the subject at a deeper level and
learn some of the many other techniques that are available. However I believe
you can go a long way with the simple methods you learn here and gain insight
into a wide range of problems. A nice example of this is contained in the
article ‘Error Correction Models: Speciﬁcation Interpretation Estimation’ by
G. Alogoskouﬁs and R. Smith in the Journal of Economic Surveys 1991 vol. 5
pp. 27–128 examining the relationship between wages prices and other vari-
ables. After 19 pages analysing the data using techniques far more advanced
than those presented in this book they state ‘the range of statistical techniques
utilised have not provided us with anything more than we would have got
by taking the . . . variables and looking at their graphs’. Sometimes advanced
techniques are needed but never underestimate the power of the humble graph.
Beyond a technical mastery of the material being a statistician encompasses
a range of more informal skills which you should endeavour to acquire. I hope
that you will learn some of these from reading this book. For example
you should be able to spot errors in analyses presented to you because your
statistical ‘intuition’ rings a warning bell telling you something is wrong. For
example the Guardian newspaper on its front page once provided a list of the
‘best’ schools in England based on the fact that in each school every one of its
pupils passed a national exam – a 100 success rate. Curiously all of the schools
were relatively small so perhaps this implies that small schools achieve better
results than large ones Once you can think statistically you can spot the fallacy
in this argument. Try it. The answer is at the end of this introduction.
Here is another example. The UK Department of Health released the following
ﬁgures about health spending showing how planned expenditure in £m was
to increase.
1998–99 1999–00 2000–01 2001–02 Total increase over
3-year period
Health spending 37 169 40 228 43 129 45 985 17 835
Figure I.2
Birthrate vs growth rate
STFE_A02.qxd 26/02/2009 09:03 Page 3

slide 22:

Introduction
4
The total increase in the ﬁnal column seems implausibly large especially
when compared to the level of spending. The increase is about 45 of the level.
This should set off the warning bell once you have a ‘feel’ for statistics and
perhaps a certain degree of cynicism about politics. The ‘total increase’ is the
result of counting the increase from 98–99 to 99–00 three times the increase
from 99–00 to 00–01 twice plus the increase from 00–01 to 01–02. It therefore
measures the cumulative extra resources to health care over the whole period
but not the year-on-year increase which is what many people would interpret
it to be.
You will also become aware that data cannot be examined without their
context. The context might determine the methods you use to analyse the
data or inﬂuence the manner in which the data are collected. For example the
exchange rate and the unemployment rate are two economic variables which
behave very differently. The former can change substantially even on a daily
basis and its movements tend to be unpredictable. Unemployment changes
only slowly and if the level is high this month it is likely to be high again next
month. There would be little point in calculating the unemployment rate on a
daily basis yet this makes some sense for the exchange rate. Economic theory
tells us quite a lot about these variables even before we begin to look at the data.
We should therefore learn to be guided by an appropriate theory when looking
at the data – it will usually be a much more effective way to proceed.
Another useful skill is the ability to present and explain statistical concepts
and results to others. If you really understand something you should be able to
explain it to someone else – this is often a good test of your own knowledge.
Below are two examples of a verbal explanation of the variance covered in
Chapter 1 to illustrate.
Good explanation
The variance of a set of observations ex-
presses how spread out are the numbers.
A low value of the variance indicates that
the observations are of similar size a high
value indicates that they are widely spread
around the average.
Bad explanation
The variance is a formula for the deviations
which are squared and added up. The dif-
ferences are from the mean and divided by
n or sometimes by n – 1.
The bad explanation is a failed attempt to explain the formula for the vari-
ance and gives no insight into what it really is. The good explanation tries to
convey the meaning of the variance without worrying about the formula which
is best written down. For a statistically unsophisticated audience the explana-
tion is quite useful and might then be supplemented by a few examples.
Statistics can also be written well or badly. Two examples follow concerning
a conﬁdence interval which is explained in Chapter 4. Do not worry if you do
not understand the statistics now.
STFE_A02.qxd 26/02/2009 09:03 Page 4

slide 23:

Introduction
5
In good statistical writing there is a logical ﬂow to the argument like a
written sentence. It is also concise and precise without too much extraneous
material. The good explanation exhibits these characteristics whereas the
bad explanation is simply wrong and incomprehensible even though the ﬁnal
answer is correct. You should therefore try to note the way the statistical argu-
ments are laid out in this book as well as take in their content.
When you do the exercises at the end of each chapter ask another student to
read your work through. If they cannot understand the ﬂow or logic of your work
then you have not succeeded in presenting your work sufﬁciently accurately.
Answer to the ‘best’ schools problem
A high proportion of small schools appear in the list simply because they are
lucky. Consider one school of 20 pupils another with 1000 where the average
ability is similar in both. The large school is highly unlikely to obtain a 100
pass rate simply because there are so many pupils and at least one of them
will probably perform badly. With 20 pupils you have a much better chance of
getting them all through. This is just a reﬂection of the fact that there tends to
be greater variability in smaller samples. The schools themselves and the pupils
are of similar quality.
Good explanation
The 95 conﬁdence interval is given by
X ± 1.96 ×
Inserting the sample values X 400 s
2
1600 and n 30 into the formula we obtain
400 ± 1.96 ×
yielding the interval 385.7 414.3
Bad explanation
95 interval X − 1.96
X + 1.96 0.95
400 − 1.96 and
400 + 1.96
so we have 385.7 414.3
1600 30 /
1600 30 /
sn
2
/
sn
2
/
1600
30
s
n
2
STFE_A02.qxd 26/02/2009 09:03 Page 5

slide 24:

Descriptive statistics
1
Contents
➔
7
Learning outcomes 8
Introduction 8
Summarising data using graphical techniques 10
Education and employment or after all this will you get a job 10
The bar chart 11
The pie chart 14
Looking at cross-section data: wealth in the UK in 2003 16
Frequency tables and histograms 16
The histogram 18
Relative frequency and cumulative frequency distributions 20
Summarising data using numerical techniques 24
Measures of location: the mean 25
The mean as the expected value 27
The sample mean and the population mean 28
The weighted average 28
The median 29
The mode 31
Measures of dispersion 32
The variance 35
The standard deviation 35
The variance and standard deviation of a sample 36
Alternative formulae for calculating the variance and standard deviation 38
The coefﬁcient of variation 39
Independence of units of measurement 39
The standard deviation of the logarithm 40
Measuring deviations from the mean: z scores 41
Chebyshev’s inequality 41
Measuring skewness 42
Comparison of the 2003 and 1979 distributions of wealth 43
The box and whiskers diagram 44
Time-series data: investment expenditures 1973–2005 45
Graphing multiple series 50
Numerical summary statistics 53
The mean of a time series 53
The geometric mean 54
Another approximate way of obtaining the average growth rate 55
The variance of a time series 56
STFE_C01.qxd 26/02/2009 09:04 Page 7

slide 25:

8
Graphing bivariate data: the scatter diagram 58
Data transformations 60
Rounding 60
Grouping 61
Dividing/multiplying by a constant 61
Differencing 61
Taking logarithms 62
Taking the reciprocal 62
Deﬂating 62
Guidance to the student: how to measure your progress 62
Summary 63
Key terms and concepts 64
Reference 64
Problems 65
Answers to exercises 71
Appendix 1A: Σ Σ notation 75
Problems on Σ Σ notation 76
Appendix 1B: E and V operators 77
Appendix 1C: Using logarithms 78
Problems on logarithms 79
By the end of this chapter you should be able to:
● recognise different types of data and use appropriate methods to summarise
and analyse them
● use graphical techniques to provide a visual summary of one or more data
series
● use numerical techniques such as an average to summarise data series
● recognise the strengths and limitations of such methods
● recognise the usefulness of data transformations to gain additional insight into a
set of data.
Contents
continued
Learning
outcomes
Introduction
The aim of descriptive statistical methods is simple: to present information in a
clear concise and accurate manner. The difﬁculty in analysing many phenom-
ena be they economic social or otherwise is that there is simply too much
information for the mind to assimilate. The task of descriptive methods is there-
fore to summarise all this information and draw out the main features without
distorting the picture.
Chapter 1 • Descriptive statistics
Complete your diagnostic test for Chapter 1 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C01.qxd 26/02/2009 09:04 Page 8

slide 26:

Introduction
9
Consider for example the problem of presenting information about the
wealth of British citizens which follows later in this chapter. There are about
17 million adults for whom data are available: to present the data in raw form
i.e. the wealth holdings of each and every person would be neither useful nor
informative it would take about 30 000 pages of a book for example. It would
be more useful to have much less information but information that was still
representative of the original data. In doing this much of the original informa-
tion would be deliberately lost in fact descriptive statistics might be described
as the art of constructively throwing away much of the data
There are many ways of summarising data and there are few hard and fast
rules about how you should proceed. Newspapers and magazines often provide
innovative although not always successful ways of presenting data. There are
however a number of techniques that are tried and tested and these are the
subject of this chapter. These are successful because: a they tell us something
useful about the underlying data and b they are reasonably familiar to many
people so we can all talk in a common language. For example the average tells
us about the location of the data and is a familiar concept to most people. For
example my son talks of his day at school being ‘average’.
The appropriate method of analysing the data will depend on a number of
factors: the type of data under consideration the sophistication of the audience
and the ‘message’ that it is intended to convey. One would use different methods
to persuade academics of the validity of one’s theory about inﬂation than one
would use to persuade consumers that Brand X powder washes whiter than
Brand Y. To illustrate the use of the various methods three different topics are
covered in this chapter. First we look at the relationship between educational
attainment and employment prospects. Do higher qualiﬁcations improve your
employment chances The data come from people surveyed in 2004/5 so we
have a sample of cross-section data giving a picture of the situation at one point
in time. We look at the distribution of educational attainments amongst those
surveyed as well as the relationship to employment outcomes. In this example
we simply count the numbers of people in different categories e.g. the number
of people with a degree qualiﬁcation who are employed.
Second we examine the distribution of wealth in the UK in 2003. The data
are again cross-section but this time we can use more sophisticated methods
since wealth is measured on a ratio scale. Someone with £200 000 of wealth
is twice as wealthy as someone with £100 000 for example and there is a
meaning to this ratio. In the case of education one cannot say with any pre-
cision that one person is twice as educated as another hence the perennial
debate about educational standards. The educational categories may be ordered
so one person can be more educated than another although even that may be
ambiguous but we cannot measure the ‘distance’ between them. We refer to
this as education being measured on an ordinal scale. In contrast there is not
an obvious natural ordering to the three employment categories employed
unemployed inactive so this is measured on a nominal scale.
Third we look at national spending on investment over the period 1973 to
2005. This is time series data as we have a number of observations on the vari-
able measured at different points in time. Here it is important to take account
of the time dimension of the data: things would look different if the observa-
tions were in the order 1973 1983 1977 . . . rather than in correct time order.
STFE_C01.qxd 26/02/2009 09:04 Page 9

slide 27:

Chapter 1 • Descriptive statistics
10
1
This is now an internet-only publication available at http://www.dcsf.gov.uk/rsgateway/
DB/VOL/v000696/Vweb03-2006V1.pdf.
Table 1.1 Economic status and educational qualiﬁcations 2006 numbers in 000s
Higher A levels Other No Total
education qualiﬁcation qualiﬁcation
In work 8541 5501 10 702 2260 27 004
Unemployed 232 247 758 309 1546
Inactive 1024 1418 3150 2284 7876
Total 9797 7166 14 610 4853 36 426
We also look at the relationship between two variables – investment and output
– over that period of time and ﬁnd appropriate methods of presenting it.
In all three cases we make use of both graphical and numerical methods
of summarising the data. Although there are some differences between the
methods used in the three cases these are not watertight compartments: the
methods used in one case might also be suitable in another perhaps with slight
modiﬁcation. Part of the skill of the statistician is to know which methods of
analysis and presentation are best suited to each particular problem.
Summarising data using graphical techniques
Education and employment or after all this will you get a job
We begin by looking at a question which should be of interest to you: how does
education affect your chances of getting a job It is now clear that education
improves one’s life chances in various ways one of the possible beneﬁts being
that it reduces the chances of being out of work. But by how much does it
reduce those chances We shall use a variety of graphical techniques to explore
the question.
The raw data for this investigation come from the Education and Training
Statistics for the U.K. 2006.
1
Some of these data are presented in Table 1.1 and
show the numbers of people by employment status either in work unem-
ployed or inactive i.e. not seeking work and by educational qualiﬁcation
higher education A-levels other qualiﬁcation or no qualiﬁcation. The table
gives a cross-tabulation of employment status by educational qualiﬁcation and
is simply a count the frequency of the number of people falling into each of
the 12 cells of the table. For example there were 8 541 000 people in work who
had experience of higher education. This is part of a total of just over 36 million
people of working age. Note that the numbers in the table are in thousands for
the sake of clarity.
STFE_C01.qxd 26/02/2009 09:04 Page 10

slide 28:

Summarising data using graphical techniques
11
The bar chart
The ﬁrst graphical technique we shall use is the bar chart and this is shown
in Figure 1.1. This summarises the educational qualiﬁcations of those in work
i.e. the data in the ﬁrst row of the table. The four educational categories are
arranged along the horizontal x axis while the frequencies are measured on
the vertical y axis. The height of each bar represents the numbers in work for
that category.
The biggest group is seen to be those with ‘other qualiﬁcations’ although
this is now not much bigger than the ‘higher education’ category the numbers
entering higher education have been increasing substantially in the UK over
time although this is not evident in this chart which uses cross-section data.
The ‘no qualiﬁcations’ category is the smallest although it does make up a
substantial fraction of those in work.
It would be interesting to compare this distribution with those for the
unemployed and inactive. This is done in Figure 1.2 which adds bars for these
other two categories. This multiple bar chart shows that as for the ‘in work’
category among the inactive and unemployed the largest group consists of
those with ‘other’ qualiﬁcations which are typically vocational qualiﬁcations.
These ﬁndings simply reﬂect the fact that ‘other qualiﬁcations’ is the largest cat-
egory. We can also begin to see whether more education increases your chance
of having a job. For example compare the height of the ‘in work’ bar to the
‘inactive’ bar. It is relatively much higher for those with higher education than
for those with no qualiﬁcations. In other words the likelihood of being inactive
rather than employed is lower for graduates. However we are having to make
judgements about the relative heights of different bars simply by eye and it is
easy to make a mistake. It would be better if we could draw charts that would
better highlight the differences. Figure 1.3 shows an alternative method of
presentation: the stacked bar chart. In this case the bars are stacked one on top
of another instead of being placed side by side. This is perhaps slightly better
Figure 1.1
Educational
qualiﬁcations of people
in work in the UK 2006
Note: The height of each bar is determined by the associated frequency. The ﬁrst bar is
8541 units high the second is 5501 units high and so on. The ordering of the bars could be
reversed ‘no qualiﬁcations’ becoming the ﬁrst category without altering the message.
STFE_C01.qxd 26/02/2009 09:04 Page 11

slide 29:

Chapter 1 • Descriptive statistics
12
and the different overall sizes of the categories is clearly brought out. However
we are still having to make tricky visual judgements about proportions.
A clearer picture emerges if the data are transformed to column percentages
i.e. the columns are expressed as percentages of the column totals e.g. the
proportion of graduates are in work rather than the number. This makes it easier
directly to compare the different educational categories. These ﬁgures are shown
in Table 1.2.
Having done this it is easier to make a direct comparison of the different
education categories columns. This is shown in Figure 1.4 where all the bars
Figure 1.2
Educational
qualiﬁcations by
employment category
Note: The bars for the unemployed and inactive categories are constructed in the same way
as for those in work: the height of the bar is determined by the frequency.
Figure 1.3
Stacked bar chart
of educational
qualiﬁcations and
employment status
Note: The overall height of each bar is determined by the sum of the frequencies of the
category given in the ﬁnal row of Table 1.1.
STFE_C01.qxd 26/02/2009 09:04 Page 12

slide 30:

Summarising data using graphical techniques
13
are of the same height representing 100 and the components of each bar
now show the proportions of people in each educational category either in work
unemployed or inactive.
It is now clear how economic status differs according to education and the
result is quite dramatic. In particular:
● The probability of unemployment increases rapidly with lower educational
attainment this interprets proportions as probabilities i.e. if 10 are out of
work then the probability that a person picked at random is unemployed
is 10.
● The biggest difference is between the no qualiﬁcations category and the other
three which have relatively smaller differences between them. In particular
A-levels and other qualiﬁcations show a similar pattern.
Notice that we have looked at the data in different ways drawing different
charts for the purpose. You need to consider which type of chart of most
suitable for the data you have and the questions you want to ask. There is no
one graph that is ideal for all circumstances.
Table 1.2 Economic status and educational qualiﬁcations: column percentages
Higher A levels Other No All
education qualiﬁcation qualiﬁcation
In work 87 77 73 47 74
Unemployed 2 3 5 6 4
Inactive 10 20 22 47 22
Totals 99 100 100 100 100
Note: The column percentages are obtained by dividing each frequency by the column total.
For example 87 is 8541 divided by 9797 77 is 5501 divided by 7166 and so on. Columns
may not sum to 100 due to rounding.
Figure 1.4
Percentages in each
employment category by
educational qualiﬁcation
STFE_C01.qxd 26/02/2009 09:04 Page 13

slide 31:

Chapter 1 • Descriptive statistics
14
Can we safely conclude therefore that the probability of your being un-
employed is signiﬁcantly reduced by education Could we go further and argue
that the route to lower unemployment generally is through investment in
education The answer may be ‘yes’ to both questions but we have not proved
it. Two important considerations are as follows:
● Innate ability has been ignored. Those with higher ability are more likely to
be employed and are more likely to receive more education. Ideally we would
like to compare individuals of similar ability but with different amounts of
education.
● Even if additional education does reduce a person’s probability of becoming
unemployed this may be at the expense of someone else who loses their job
to the more educated individual. In other words additional education does
not reduce total unemployment but only shifts it around among the labour
force. Of course it is still rational for individuals to invest in education if they
do not take account of this externality.
The pie chart
Another useful way of presenting information graphically is the pie chart which
is particularly good at describing how a variable is distributed between different
categories. For example from Table 1.1 we have the distribution of people by
educational qualiﬁcation the ﬁrst row of the table. This can be shown in a pie
chart as in Figure 1.5.
The area of each slice is proportional to the respective frequency and the
pie chart is an alternative means of presentation to the bar chart shown in
Figure 1.1. The percentages falling into each education category have been
added around the chart but this is not essential. For presentational purposes it
is best not to have too many slices in the chart: beyond about six the chart tends
to look crowded. It might be worth amalgamating less important categories to
make a chart look clearer.
The chart reveals that 40 of those employed fall into the ‘other
qualiﬁcation’ category and that just 8 have no qualiﬁcations. This may be
Figure 1.5
Educational
qualiﬁcations of those
in work
Note: If you have to draw a pie chart by hand the angle of each slice can be calculated as
follows:
angle× 360.
The angle of the ﬁrst slice for example is
× 360 113.9°.
8541
27 004
frequency
total frequency
STFE_C01.qxd 26/02/2009 09:04 Page 14

slide 32:

Summarising data using graphical techniques
15
Figure 1.6
Educational
qualiﬁcations of the
unemployed
Exercise 1.1
STATISTICS
IN
PR AC TIC E
··
contrasted with Figure 1.6 which shows a similar chart for the unemployed the
second row of Table 1.1.
The ‘other qualiﬁcation’ category is a little larger in this case but the ‘no
qualiﬁcation’ group now accounts for 20 of the unemployed a big increase.
Further the proportion with a degree approximately halves from 32 to 15.
Producing charts using Microsoft Excel
Most of the charts in this book were produced using Excel’s charting facility. With-
out wishing to dictate a precise style you should aim for a similar uncluttered
look. Some tips you might ﬁnd useful are:
● Make the grid lines dashed in a light grey colour they are not actually part of
the chart hence should be discreet or eliminate altogether.
● Get rid of the background ﬁll grey by default alter to ‘No ﬁll’. It does not look
great when printed.
● On the x-axis make the labels horizontal or vertical not slanted – it is then
difﬁcult to see which point they refer to. If they are slanted double click on the
x-axis then click the alignment tab.
● Colour charts look great on-screen but unclear if printed in black and white.
Change the style type of the lines or markers e.g. make some dashed to
distinguish them on paper.
● Both axes start at zero by default. If all your observations are large numbers
this may result in the data points being crowded into one corner of the graph.
Alter the scale on the axes to ﬁx this: set the minimum value on the axis to be
slightly less than the minimum observation.
Otherwise Excel’s default options will usually give a good result.
The following table shows the total numbers in millions of tourists visiting each
country and the numbers of English tourists visiting each country:
France Germany Italy Spain
All tourists 12.4 3.2 7.5 9.8
English tourists 2.7 0.2 1.0 3.6
a Draw a bar chart showing the total numbers visiting each country.
b Draw a stacked bar chart which shows English and non-English tourists making
up the total visitors to each country.
STFE_C01.qxd 26/02/2009 09:04 Page 15

slide 33:

Chapter 1 • Descriptive statistics
16
c Draw a pie chart showing the distribution of all tourists between the four
destination countries.
d Do the same for English tourists and compare results.
Looking at cross-section data: wealth in the UK in 2003
Frequency tables and histograms
We now move on to examine data in a different form. The data on employment
and education consisted simply of frequencies where a characteristic such as
higher education was either present or absent for a particular individual. We
now look at the distribution of wealth – a variable that can be measured on a
ratio scale so that a different value is associated with each individual. For ex-
ample one person might have £1000 of wealth another might have £1 million.
Different presentational techniques will be used to analyse this type of data. We
use these techniques to investigate questions such as how much wealth does the
average person have and whether wealth is evenly distributed or not.
The data are given in Table 1.3 which shows the distribution of wealth in the
UK for the year 2003 the latest available at the time of writing available at
http://www.hmrc.gov.uk/stats/personal_wealth/menu.htm. This is an example
of a frequency table. Wealth is difﬁcult to deﬁne and to measure the data shown
here refer to marketable wealth i.e. items such as the right to a pension which
cannot be sold are excluded and are estimates for the population of adults as
a whole based on taxation data.
Wealth is divided into 14 class intervals: £0 up to but not including
£10000 £10000 up to £24999 etc. and the number or frequency of
Table 1.3 The distribution of wealth UK 2003
Class interval £ Numbers thousands
0–9999 2448
10 000–24 999 1823
25 000–39 999 1375
40 000–49 999 480
50 000–59 999 665
60 000–79 999 1315
80 000–99 999 1640
100 000–149 999 2151
150 000–199 000 2215
200 000–299 000 1856
300 000–499 999 1057
500 000–999 999 439
1 000 000–1 999 999 122
2 000 000 or more 50
Total 17 636
Note: It would be impossible to show the wealth of all 18 million individuals so it has been
summarised in this frequency table.
STFE_C01.qxd 26/02/2009 09:04 Page 16

slide 34:

Looking at cross-section data: wealth in the UK in 2003
17
individuals within each class interval is shown. Note that the widths of the
intervals the class widths vary up the wealth scale: the ﬁrst is £10 000 the
second £15 000 25 000 − 10 000 the third £15 000 also and so on. This will
prove an important factor when it comes to graphical presentation of the data.
This table has been constructed from the original 17 636 000 observations
on individuals’ wealth so it is already a summary of the original data note that
all the frequencies have been expressed in thousands in the table and much of
the original information is lost. The ﬁrst decision to make if one had to draw up
such a frequency table from the raw data is how many class intervals to have
and how wide they should be. It simpliﬁes matters if they are all of the same
width but in this case it is not feasible: if 10 000 were chosen as the standard
width there would be many intervals between 500 000 and 1 000 000 50 of them
in fact most of which would have a zero or very low frequency. If 100 000
were the standard width there would be only a few intervals and the ﬁrst
0–100 000 would contain 9746 observations 55 of all observations so
almost all the interesting detail would be lost. A compromise between these
extremes has to be found.
A useful rule of thumb is that the number of class intervals should equal the
square root of the total frequency subject to a maximum of about 12 intervals.
Thus for example a total of 25 observations should be allocated to ﬁve inter-
vals 100 observations should be grouped into 10 intervals and 17 636 should
be grouped into about 12 14 are used here. The class widths should be equal
in so far as this is feasible but should increase when the frequencies become
very small.
To present these data graphically one could draw a bar chart as in the case of
education above and this is presented in Figure 1.7. Before reading on spend
some time looking at it and ask yourself what is wrong with it.
The answer is that the ﬁgure gives a completely misleading picture of the
data Incidentally this is the picture that you will get using a spreadsheet
computer program as I have done here. All the standard packages appear to do
this so beware. One wonders how many decisions have been inﬂuenced by data
presented in this incorrect manner.
Figure 1.7
Bar chart of the
distribution of wealth
in the UK 2003
STFE_C01.qxd 26/02/2009 09:04 Page 17

slide 35:

Chapter 1 • Descriptive statistics
18
Why is the ﬁgure wrong Consider the following argument. The diagram
appears to show that there are few individuals around £40 000 to £60 000 the
frequency is at a low of 480 thousand but many around £150 000. But this is just
the result of the difference in the class width at these points 10 000 at £40 000
and 50 000 at £150 000. Suppose that we divide up the £150 000–£200 000
class into two: £150 000 to £175 000 and £175 000 to £200 000. We divide the
frequency of 2215 equally between the two this is an arbitrary decision but
illustrates the point. The graph now looks like Figure 1.8.
Comparing Figures 1.7 and 1.8 reveals a difference: the hump around
£150 000 has now disappeared replaced by a small crater. But this is disturbing –
it means that the shape of the distribution can be altered simply by altering the
class widths. If so how can we rely upon visual inspection of the distribution
What does the ‘real’ distribution look like A better method would make the
shape of the distribution independent of how the class intervals are arranged.
This can be done by drawing a histogram.
The histogram
A histogram is similar to a bar chart except that it corrects for differences in class
widths. If all the class widths are identical then there is no difference between
a bar chart and a histogram. The calculations required to produce the histogram
are shown in Table 1.4.
The new column in the table shows the frequency density which measures
the frequency per unit of class width. Hence it allows a direct comparison of
different class intervals i.e. accounting for the difference in class widths.
The frequency density is deﬁned as follows
frequency density 1.1
Using this formula corrects the ﬁgures for differing class widths. Thus 0.2448
2448/10 000 is the ﬁrst frequency density 0.1215 1823/15 000 is the second
frequency
class width
Figure 1.8
The wealth distribution
with alternative class
intervals
STFE_C01.qxd 26/02/2009 09:04 Page 18

slide 36:

Looking at cross-section data: wealth in the UK in 2003
19
etc. Above £200 000 the class widths are very large and the frequencies small
too small to be visible on the histogram so these classes have been combined.
The width of the ﬁnal interval is unknown so has to be estimated in order
to calculate the frequency density. It is likely to be extremely wide since the
wealthiest person may well have assets valued at several £m or even £bn the
value we assume will affect the calculation of the frequency density and there-
fore of the shape of the histogram. Fortunately it is in the tail of the distribution
and only affects a small number of observations. Here we assume arbitrarily a
width of £3.8m to be a ‘reasonable’ ﬁgure giving an upper class boundary of £4m.
The frequency density is then plotted on the vertical axis against wealth on
the horizontal axis to give the histogram. One further point needs to be made:
the scale on the wealth axis should be linear as far as possible e.g. £50 000
should be twice as far from the origin as £25 000. However it is difﬁcult to ﬁt
all the values onto the horizontal axis without squeezing the graph excessively
at lower levels of wealth where most observations are located. Therefore the
classes above £100 000 have been squeezed and the reader’s attention is drawn
to this. The result is shown in Figure 1.9.
The effect of taking frequency densities is to make the area of each block in
the histogram represent the frequency rather than the height which now
shows the density. This has the effect of giving an accurate picture of the shape
of the distribution.
Having done all this what does the histogram show
● The histogram is heavily skewed to the right i.e. the long tail is to the right.
● The modal class interval is £0–£10 000 i.e. has the greatest density: no other
£10 000 interval has more individuals in it.
● A little under half of all people 45.9 in fact have less than £80 000 of
marketable wealth.
● About 20 of people have more than £200 000 of wealth.
2
Table 1.4 Calculation of frequency densities
Range Number or frequency Class width Frequency density
0– 2448 10 000 0.2448
10 000– 1823 15 000 0.1215
25 000– 1375 15 000 0.0917
40 000– 480 10 000 0.0480
50 000– 665 10 000 0.0665
60 000– 1315 20 000 0.0658
80 000– 1640 20 000 0.0820
100 000– 2151 50 000 0.0430
150 000– 2215 50 000 0.0443
200 000– 3524 3 800 000 0.0009
Note: As an alternative to the frequency density one could calculate the frequency per
‘standard’ class width with the standard width chosen to be 10 000 the narrowest class.
The values in column 4 would then be 2448 1215.3 1823 ÷ 1.5 916.7 etc. This would
lead to the same shape of histogram as using the frequency density.
2
Due to the compressing of some class widths it is difﬁcult to see this accurately on the
histogram. There are limitations to graphical presentation.
STFE_C01.qxd 26/02/2009 09:04 Page 19

slide 37:

Chapter 1 • Descriptive statistics
20
3
If you are unfamiliar with the Σ notation then read Appendix 1A to this chapter before
continuing.
The ﬁgure shows quite a high degree of inequality in the wealth distribution.
Whether this is acceptable or even desirable is a value judgement. It should be
noted that part of the inequality is due to differences in age: younger people
have not yet had enough time to acquire much wealth and therefore appear
worse off although in life-time terms this may not be the case. To obtain a
better picture of the distribution of wealth would require some analysis of the
acquisition of wealth over the life-cycle or comparing individuals of a similar
age. In fact correcting for age differences does not make a big difference to the
pattern of wealth distribution on this point and on inequality in wealth in
general see Atkinson 1983 Chapters 7 and 8.
Relative frequency and cumulative frequency distributions
An alternative way of illustrating the wealth distribution uses the relative and
cumulative frequencies of the data. The relative frequencies show the proportion
of observations that fall into each class interval so for example 2.72 of
individuals have wealth holdings between £40 000 and £50 000 480 000 out
of 17 636 000 individuals. Relative frequencies are shown in the third column
of Table 1.5 using the following formula
3
Relative frequency 1.2
f
∑f
frequency
sum of frequencies
Figure 1.9
Histogram of the
distribution of wealth
in the UK 2003
Note: A frequency polygon would be the result if instead of drawing blocks for
the histogram lines were drawn connecting the centres of the top of each block.
The diagram is better drawn with blocks in general.
STFE_C01.qxd 26/02/2009 09:04 Page 20

slide 38:

Looking at cross-section data: wealth in the UK in 2003
21
Table 1.5 Calculation of relative and cumulative frequencies
Range Frequency Relative frequency Cumulative frequency
0– 2448 13.9 2448
10 000– 1823 10.3 4271
25 000– 1375 7.8 5646
40 000– 480 2.7 6126
50 000– 665 3.8 6791
60 000– 1315 7.5 8106
80 000– 1640 9.3 9746
100 000– 2151 12.2 11 897
150 000– 2215 12.6 14 112
200 000– 1856 10.5 15 968
300 000– 1057 6.0 17 025
500 000– 439 2.5 17 464
1 000 000– 122 0.7 17 586
2 000 000– 50 0.3 17 636
Total 17 636 100.00
Note: Relative frequencies are calculated in the same way as the column percentages
in Table 1.2. Thus for example 13.9 is 2448 divided by 17 636. Cumulative frequencies
are obtained by cumulating or successively adding the frequencies. For example
4271 is 2448 + 1823 5646 is 4271 + 1375 etc.
➔
STATISTICS
IN
PR AC TIC E
··
The AIDS epidemic
To show how descriptive statistics can be helpful in presenting information we
show below the ‘population pyramid’ for Botswana one of the countries most
seriously affected by AIDS projected for the year 2020. This is essentially two bar
charts one for men one for women laid on their sides showing the frequencies
in each age category rather than wealth categories. The inner pyramid in the
darker colour shows the projected population given the existence of AIDS the
outer pyramid assumes no deaths from AIDS.
Original source of data: US Census Bureau World Population Proﬁle 2000. Graph adapted from the
UNAIDS web site at http://www.unaids.org/epidemic_update/report/Epi_report.htmthepopulation.
STFE_C01.qxd 26/02/2009 09:04 Page 21

slide 39:

Chapter 1 • Descriptive statistics
22
Figure 1.10
The relative density
frequency distribution of
wealth in the UK 2003
One can immediately see the huge effect of AIDS especially on the 40–60 age
group currently aged 20–40 for both men and women. These people would
normally be in the most productive phase of their lives but with AIDS the country
will suffer enormously with many old and young people dependent on a small
working population. The severity of the future problems is brought out vividly in
this simple graphic based on the bar chart.
The sum of the relative frequencies has to be 100 and this acts as a check on
the calculations.
The cumulative frequencies shown in the fourth column are obtained by
cumulating successively adding the frequencies. The cumulative frequencies
show the total number of individuals with wealth up to a given amount for
example about 10 million people have less than £100 000 of wealth.
Both relative and cumulative frequency distributions can be drawn in a sim-
ilar way to the histogram. In fact the relative frequency distribution has exactly
the same shape as the frequency distribution. This is shown in Figure 1.10. This
time we have written the relative frequencies above the appropriate column
although this is not essential.
The cumulative frequency distribution is shown in Figure 1.11 where the
blocks increase in height as wealth increases. The simplest way to draw this is to
cumulate the frequency densities shown in the ﬁnal column of Table 1.4 and
to use these values as the y-axis coordinates.
STFE_C01.qxd 26/02/2009 09:04 Page 22

slide 40:

Looking at cross-section data: wealth in the UK in 2003
23
Figure 1.11
The cumulative
frequency distribution of
wealth in the UK 2003
Note: The y-axis coordinates are obtained by cumulating the frequency densities in Table 1.4
above. For example the ﬁrst two y coordinates are 0.2448 0.3663.
Worked example 1.1
There is a mass of detail in the sections above so this worked example
is intended to focus on the essential calculations required to produce the
summary graphs. Simple artiﬁcial data are deliberately used to avoid the
distraction of a lengthy interpretation of the results and their meaning. The
data on the variable X and its frequencies f are shown in the following table
with the calculations required:
X Frequency f Relative frequency Cumulative frequency F
10 6 0.17 6
11 8 0.23 14
12 15 0.43 29
13 5 0.14 34
14 1 0.03 35
Total 35 1.00
Notes:
The X values are unique but could be considered the mid-point of a range as earlier.
The relative frequencies are calculated as 0.17 6/35 0.23 8/35 etc.
The cumulative frequencies are calculated as 14 6 + 8 29 6 + 8 + 15 etc.
The symbol F usually denotes the cumulative frequency in statistical work.
➔
STFE_C01.qxd 26/02/2009 09:04 Page 23

slide 41:

Chapter 1 • Descriptive statistics
24
and
Exercise 1.2 Given the following data:
Range Frequency
0–10 20
11–30 40
31–60 30
60–100 20
a Draw both a bar chart and a histogram of the data and compare them.
b Calculate cumulative frequencies and draw a cumulative frequency diagram.
Summarising data using numerical techniques
Graphical methods are an excellent means of obtaining a quick overview of the
data but they are not particularly precise nor do they lend themselves to fur-
ther analysis. For this we must turn to numerical measures such as the average.
There are a number of different ways in which we may describe a distribution
such as that for wealth. If we think of trying to describe the histogram it is
useful to have:
The resulting bar chart and cumulative frequency distribution are:
STFE_C01.qxd 26/02/2009 09:04 Page 24

slide 42:

Summarising data using numerical techniques
25
● A measure of location giving an idea of whether people own a lot of wealth
or a little. An example is the average which gives some idea of where the dis-
tribution is located along the x-axis. In fact we will encounter three different
measures of the ‘average’:
❍ the mean
❍ the median
❍ the mode.
● A measure of dispersion showing how wealth is dispersed around usually
the average whether it is concentrated close to the average or is generally far
away from it. An example here is the standard deviation.
● A measure of skewness showing how symmetric or not the distribution is i.e.
whether the left half of the distribution is a mirror image of the right half or
not. This is obviously not the case for the wealth distribution.
We consider each type of measure in turn.
Measures of location: the mean
The arithmetic mean commonly called the average is the most familiar measure
of location and is obtained simply by adding all the observations and dividing
by the number of observations. If we denote the wealth of the ith household by
x
i
so that the index i runs from 1 to N where N is the number of observations
as an example x
3
would be the wealth of the third household then the mean is
given by the following formula
1.3
where μ the Greek letter mu pronounced ‘myu’
4
denotes the mean and
read ‘sigma x i from i 1 to N’ Σ being the Greek capital letter sigma means
the sum of the x values. We may simplify this to
μ 1.4
when it is obvious which x values are being summed usually all the available
observations. This latter form is more easily readable and we will generally use this.
Worked example 1.2
We will ﬁnd the mean of the values 17 25 28 20 35. The total of these ﬁve
numbers is 125 so we have N 5 and ∑ x 125. Therefore the mean is
μ 25
Formula 1.3 can only be used when all the individual x values are known. The
frequency table for wealth does not show all 17 million observations however
125
5
∑ x
N
∑ x
N
iN
∑
x
i
i1
μ
∑
x
N
i
i
iN
1
4
Well mathematicians pronounce it like this but modern Greeks do not. For them it is ‘mi’.
STFE_C01.qxd 26/02/2009 09:04 Page 25

slide 43:

Chapter 1 • Descriptive statistics
26
but only the range of values for each class interval and the associated frequency.
In this case of grouped data the following equivalent formula may be used
1.5
or more simply
μ 1.6
In this formula
● x denotes the mid-point of each class interval since the individual x values are
unknown. The mid-point is used as the representative x value for each class.
In the ﬁrst class interval for example we do not know precisely where each
of the 2448 observations lies. Hence we assume they all lie at the mid-point
£5000. This will cause a slight inaccuracy – because the distribution is so
skewed there are more households below the mid-point than above it in every
class interval except perhaps the ﬁrst. We ignore this problem here and it is
less of a problem for most distributions which are less skewed than this one.
● The summation runs from 1 to C the number of class intervals or distinct x
values. f times x gives the total wealth in each class interval. If we sum over
the 14 class intervals we obtain the total wealth of all individuals.
● ∑f
i
N gives the total number of observations the sum of the individual
frequencies. The calculation of the mean μ for the wealth data is shown in
Table 1.6.
∑ fx
∑ f
μ
∑
∑
fx
f
ii
i
iC
i
i
iC
1
1
Table 1.6 The calculation of average wealth
Range xf fx
0– 5.0 2448 12 240
10 000– 17.5 1823 31 902
25 000– 32.5 1375 44 687
40 000– 45.0 480 21 600
50 000– 55.0 665 36 575
60 000– 70.0 1315 92 050
80 000– 90.0 1640 147 600
100 000– 125.0 2151 268 875
150 000– 175.0 2215 387 625
200 000– 250.0 1856 464 000
300 000– 400.0 1057 422 800
500 000– 750.0 439 329 250
1 000 000– 1500.0 122 183 000
2 000 000– 3000.0 50 150 000
Total 17 636 2 592 205
Note: The fx column gives the product of the values in the f and x columns so for example
5.0 × 2448 12 240 which is the total wealth held by those in the ﬁrst class interval. The
sum of the fx values gives total wealth.
STFE_C01.qxd 26/02/2009 09:04 Page 26

slide 44:

Summarising data using numerical techniques
27
From this we obtain
μ 146.984
Note that the x values are expressed in £000 so we must remember that the
mean will also be in £000 the average wealth holding is therefore £146 984.
Note that the frequencies have also been divided by 1000 but this has no
effect upon the calculation of the mean since f appears in both numerator and
denominator of the formula for the mean.
The mean tells us that if the total wealth were divided up equally between all
individuals each would have £146 984. This value may seem surprising since
the histogram clearly shows most people have wealth below this point approx-
imately 65 of individuals are below the mean in fact. The mean does not
seem to be typical of the wealth that most people have. The reason the mean
has such a high value is that there are some individuals whose wealth is way
above the ﬁgure of £146 984 – up into the £millions in fact. The mean is the
‘balancing point’ of the distribution – if the histogram were a physical model it
would balance on a fulcrum placed at 146 984. The few very high wealth levels
exert a lot of leverage and counter-balance the more numerous individuals
below the mean.
Worked example 1.3
Suppose we have 10 families with a single television in their homes 12 fam-
ilies with two televisions each and 3 families with three. You can probably
work out in your head that there are 43 televisions in total 10 + 24 + 9
owned by the 25 families 10 + 12 + 3. The average number of televisions per
family is therefore 43/25 1.72.
Setting this out formally we have as for the wealth distribution but simpler:
xf fx
110 10
212 24
33 9
Totals 25 43
This gives our resulting mean as 1.72. Note that our data are discrete values
in this case and we have the actual values not a broad class interval.
The mean as the expected value
We also refer to the mean as the expected value of x and write
Ex μ 146 984 1.7
Ex is read ‘E of x’ or ‘the expected value of x’. The mean is the expected value
in the sense that if we selected a household at random from the population
we would ‘expect’ its wealth to be £146 984. It is important to note that this
2 592 205
17 636
STFE_C01.qxd 26/02/2009 09:04 Page 27

slide 45:

Chapter 1 • Descriptive statistics
28
is a statistical expectation rather than the everyday use of the term. Most of the
random individuals we encounter have wealth substantially below this value.
Most people might therefore ‘expect’ a lower value because that is their everyday
experience but statisticians are different they always expect the mean value.
The expected value notation is particularly useful in keeping track of the
effects upon the mean of certain data transformations e.g. dividing wealth by
1000 also divides the mean by 1000 Appendix 1B provides a detailed explana-
tion. Use is also made of the E operator in inferential statistics to describe the
properties of estimators see Chapter 4.
The sample mean and the population mean
Very often we have only a sample of data as in the worked example above and
it is important to distinguish this case from the one where we have all the pos-
sible observations. For this reason the sample mean is given by
X or X for grouped data 1.8
Note the distinctions between μ the population mean and X the sample
mean and between N the size of the population and n the sample size.
Otherwise the calculations are identical. It is a convention to use Greek letters
such as μ to refer to the population and Roman letters such as X to refer to
a sample.
The weighted average
Sometimes observations have to be given different weightings in calculating the
average as the following example. Consider the problem of calculating the aver-
age spending per pupil by an education authority. Some ﬁgures for spending
on primary ages 5 to 11 secondary 11 to 16 and post-16 pupils are given in
Table 1.7.
Clearly signiﬁcantly more is spent on secondary and post-16 pupils a gen-
eral pattern throughout England and most other countries and the overall aver-
age should lie somewhere between 1750 and 3820. However taking a simple
average of these values would give the wrong answer because there are different
numbers of children in the three age ranges. The numbers and proportions of
children in each age group are given in Table 1.8.
∑fx
∑f
∑ x
n
Table 1.7 Cost per pupil in different types of school £ p.a.
Primary Secondary Post-16
Unit cost 1750 3100 3820
Table 1.8 Numbers and proportions of pupils in each age range
Primary Secondary Post-16 Total
Numbers 8000 7000 3000 18 000
Proportion 44 39 17
STFE_C01.qxd 26/02/2009 09:04 Page 28

slide 46:

Summarising data using numerical techniques
29
STATISTICS
IN
PR AC TIC E
··
As there are relatively more primary school children than secondary and
relatively fewer post-16 pupils the primary unit cost should be given greatest
weight in the averaging process and the post-16 unit cost the least. The weighted
average is obtained by multiplying each unit cost ﬁgure by the proportion of
children in each category and summing. The weighted average is therefore
0.44 × 1750 + 0.39 × 3100 + 0.17 × 3820 2628 1.9
The weighted average gives an answer closer to the primary unit cost than
does the simple average of the three ﬁgures 2890 in this case which would be
misleading. The formula for the weighted average is
X
w
∑
i
w
i
x
i
1.10
where w represents the weights which must sum to one i.e.
∑
i
w
i
1 1.11
and x represents the unit cost ﬁgures.
Notice that what we have done is equivalent to multiplying each unit cost
by its frequency 8000 etc. and then dividing the sum by the grand total of
18 000. This is the same as the procedure we used for the wealth calculation.
The difference with weights is that we ﬁrst divide 8000 by 18 000 and 7000 by
18 000 etc. to obtain the weights which must then sum to one and use these
weights in formula 1.10.
Calculating your degree result
If you are a university student your ﬁnal degree result will probably be calculated
as a weighted average of your marks on the individual courses. The weights may
be based on the credits associated with each course or on some other factors. For
example in my university the average mark for a year is a weighted average of the
marks on each course the weights being the credit values of each course.
The grand mean G on which classiﬁcation is based is then a weighted average
of the averages for the different years as follows
G
i.e. the year 3 mark has a weight of 60 year 2 is weighted 40 and the ﬁrst year
is not counted at all.
For students taking a year abroad the formula is slightly different
G
Note that to accommodate the year abroad mark the weights on years 2 and
3 are reduced to 40/125 32 and 60/125 48 respectively.
The median
Returning to the study of wealth the unrepresentative result for the mean sug-
gests that we may prefer a measure of location which is not so strongly affected
by outliers extreme observations and skewness.
0 × Year 1 + 40 × Year 2 + 25 × Yabroad + 60 × Year 3
125
0 × Year 1 + 40 × Year 2 + 60 × Year 3
100
STFE_C01.qxd 26/02/2009 09:04 Page 29

slide 47:

Chapter 1 • Descriptive statistics
30
The median is a measure of location which is more robust to such extreme
values it may be deﬁned by the following procedure. Imagine everyone in a line
from poorest to wealthiest. Go to the individual located halfway along the
line. Ask what their wealth is. Their answer is the median. The median is clearly
unaffected by extreme values unlike the mean: if the wealth of the richest person
were doubled with no reduction in anyone else’s wealth there would be no
effect upon the median. The calculation of the median is not so straightforward
as for the mean especially for grouped data. The following worked example
shows how to calculate the median for ungrouped data.
Worked example 1.4 The median
Calculate the median of the following values: 45 12 33 80 77.
First we put them into ascending order: 12 33 45 77 80.
It is then easy to see that the middle value is 45. This is the median. Note
that if the value of the largest observation changes to say 150 the value
of the median is unchanged. This is not the case for the mean which would
change from 49.4 to 63.4.
If there is an even number of observations then there is no middle observa-
tion. The solution is to take the average of the two middle observations. For
example:
Find the median of 12 33 45 63 77 80.
Note the new observation 63 making six observations. The median value is
halfway between the third and fourth observations i.e. 45 + 63/2 54.
For grouped data there are two stages to the calculation: ﬁrst we must ﬁrst
identify the class interval which contains the median person then we must
calculate where in the interval that person lies.
1 To ﬁnd the appropriate class interval: since there are 17 636 000 observa-
tions we need the wealth of the person who is 8 818 000 in rank order. The
table of cumulative frequencies see Table 1.5 above is the most suitable
for this. There are 8 106 000 individuals with wealth of less than £80 000
and 9 746 000 with wealth of less than £100 000. The middle person there-
fore falls into the £80 000–100 000 class. Furthermore given that 8 818 000
falls roughly half way between 8 106 000 and 9 746 000 it follows that the
median is close to the middle of the class interval. We now go on to make
this statement more precise.
2 To ﬁnd the position in the class interval we can now use formula 1.12
median x
L
+ x
U
− x
L
1.12
where
x
L
the lower limit of the class interval containing the median
x
U
the upper limit of this class interval
N the number of observations using N + 1 rather than N in the formula
is only important when N is relatively small
5
4
6
4
7
N + 1
− F
2
f
1
4
2
4
3
STFE_C01.qxd 26/02/2009 09:04 Page 30

slide 48:

Summarising data using numerical techniques
31
STATISTICS
IN
PR AC TIC E
··
F the cumulative frequency of the class intervals up to but not including
the one containing the median
f the frequency for the class interval containing the median.
For the wealth distribution we have
median 80 000 + 100 000 − 80 000 £90 829
This alternative measure of location gives a very different impression: it is less
than two-thirds of the mean. Nevertheless it is equally valid despite having a
different meaning. It demonstrates that the person ‘in the middle’ has wealth of
£90 829 and in this sense is typical of the UK population. Before going on to
compare these measures further we examine a third: the mode.
Generalising the median – quantiles
The idea of the median as the middle of the distribution can be extended:
quartiles divide the distribution into four equal parts quintiles into ﬁve deciles
into 10 and ﬁnally percentiles divide the distribution into 100 equal parts. Generically
they are known as quantiles. We shall illustrate the idea by examining deciles
quartiles are covered below.
The ﬁrst decile occurs one-tenth of the way along the line of people ranked from
poorest to wealthiest. This means we require the wealth of the person ranked
1 763 600 N/10 in the distribution. From the table of cumulative frequencies
this person lies in the ﬁrst class interval. Adapting formula 1.12 we obtain
ﬁrst decile 0 + 10 000 − 0 × £7203
Thus we estimate that any household with less than £7203 of wealth falls into
the bottom 10 of the wealth distribution. In a similar fashion the ninth decile can
be found by calculating the wealth of the household ranked 15 872 400 N × 9/10
in the distribution.
The mode
The mode is deﬁned as that level of wealth which occurs with the greatest
frequency in other words the value that occurs most often. It is most useful and
easiest to calculate when one has all the data and there are relatively few distinct
observations. This is the case in the simple example below.
Suppose we have the following data on sales of dresses by a shop according
to size
Size Sales
87
10 25
12 36
14 11
16 3
18 1
5
6
7
1 763 600 − 0
2 448 000
1
2
3
5
4
6
4
7
17 636 000
− 8 106 000
2
1 640 000
1
4
2
4
3
STFE_C01.qxd 26/02/2009 09:04 Page 31

slide 49:

Chapter 1 • Descriptive statistics
32
The modal size is 12. There are more women buying dresses of this size than
any other. This may be the most useful form of average as far as the shop is
concerned. Although it needs to stock a range of sizes it knows it needs to
order more dresses in size 12 than in any other size. The mean would not be so
helpful in this case it is X 11.7 as it is not an actual dress size.
In the case of grouped data matters are more complicated. It is the modal
class interval which is required once the intervals have been corrected for width
otherwise a wider class interval is unfairly compared with a narrower one. For
this we can again make use of the frequency densities. From Table 1.4 it can be
seen that it is the ﬁrst interval from £0 to £10 000 which has the highest
frequency density. It is ‘typical’ of the distribution because it is the one which
occurs most often using the frequency densities not frequencies. The wealth
distribution is most concentrated at this level and more people are like this in
terms of wealth than anything else. Once again it is notable how different it is
from both the median and the mean.
The three measures of location give different messages because of the skewness
of the distribution: if it were symmetric they would all give approximately
the same answer. Here we have a rather extreme case of skewness but it does
serve to illustrate how the different measures of location compare. When the
distribution is skewed to the right as here they will be in the order mode
median mean if skewed to the left the ordering is reversed. If the distribution
has more than one peak then this rule for orderings may not apply.
Which of the measures is ‘correct’ or most useful In this particular case the
mean is not very useful: it is heavily inﬂuenced by extreme values. The median
is therefore often used when discussing wealth and income distributions.
Where inequality is even more pronounced as in some less developed countries
then the mean is even less informative. The mode is also quite useful in telling
us about a large section of the population although it can be sensitive to how
the class intervals are arranged. If the data were arranged such that there was
a class interval of £5000 to £15 000 then this might well be the modal class
conveying a slightly different impression.
The three different measures of location are marked on the histogram in
Figure 1.12. This brings out the substantial difference between the measures for
a skewed distribution such as for wealth.
a For the data in Exercise 2 calculate the mean median and mode of the data.
b Mark these values on the histogram you drew for Exercise 2.
Measures of dispersion
Two different distributions e.g. wealth in two different countries might have
the same mean yet look very different as shown in Figure 1.13 the distributions
have been drawn using smooth curves rather than bars to improve clarity. In
one country everyone might have a similar level of wealth curve B. In another
although the average is the same there might be extremes of great wealth and
poverty curve A. A measure of dispersion is a number which allows us to
distinguish between these two situations.
Exercise 1.3
STFE_C01.qxd 26/02/2009 09:04 Page 32

slide 50:

Summarising data using numerical techniques
33
The simplest measure of dispersion is the range which is the difference
between the smallest and largest observations. It is impossible to calculate
accurately from the table of wealth holdings since the largest observation is not
available. In any case it is not a very useful ﬁgure since it relies on two extreme
values and ignores the rest of the distribution. In simpler cases it might be
more informative. For example in an exam the marks may range from a low of
28 to a high of 74. In this case the range is 74 − 28 46 and this tells us
something useful.
An improvement is the inter-quartile range IQR which is the difference
between the ﬁrst and third quartiles. It therefore deﬁnes the limits of wealth
of the middle half of the distribution and ignores the very extremes of the
Figure 1.12
The histogram with
mean median and mode
marked
Figure 1.13
Two distributions with
different degrees of
dispersion
Note: Distribution A has a greater degree of dispersion than B where everyone has a similar
level of wealth.
STFE_C01.qxd 26/02/2009 09:04 Page 33

slide 51:

Chapter 1 • Descriptive statistics
34
distribution. To calculate the ﬁrst quartile which we label Q
1
we have to go
one-quarter of the way along the line of wealth holders ranked from poorest to
wealthiest and ask the person in that position what their wealth is. Their
answer is the ﬁrst quartile. The calculation is as follows:
● one-quarter of 17 636 is 4409
● the person ranked 4409 is in the £25 000–40 000 class
● adapting formula 1.12
Q
1
25 000 + 40 000 − 25 000 26 505.5 1.13
The third quartile is calculated in similar fashion:
● three-quarters of 17 636 is 13 227
● the person ranked 13 227 is in the £150 000–200 000 class
● again using formula 1.12
Q
3
150 000 + 200 000 − 150 000 180 022.6
and therefore the inter-quartile range is Q
3
− Q
1
180 022 − 26 505 153 517.
This might be reasonably rounded to £150 000 given the approximations in our
calculation and is a much more memorable ﬁgure.
This gives one summary measure of the dispersion of the distribution: the
higher the value the more spread-out is the distribution. Two different wealth
distributions might be compared according to their inter-quartile ranges there-
fore with the country having the larger ﬁgure exhibiting greater inequality.
Note that the ﬁgures would have to be expressed in a common unit of currency
for this comparison to be valid.
Worked example 1.5 The range and inter-quartile range
Suppose 110 children take a test with the following results:
Mark X Frequency f Cumulative frequency F
13 5 5
14 13 18
15 29 47
16 33 80
17 17 97
18 8 105
19 4 109
20 1 110
Total 110
The range is simply 20 − 13 7. The inter-quartile range requires calcula-
tion of the quartiles. Q
1
is given by the value of the 27.5th observation
110/4 which is 15. Q
3
is the value of the 82.5th observation 110 × 0.75
which is 17. The IQR is therefore 17 − 15 2 marks. Half the students achieve
marks within this range.
5
6
7
13 227 − 11 897
2215
1
2
3
5
6
7
4409 − 4271
1375
1
2
3
STFE_C01.qxd 26/02/2009 09:04 Page 34

slide 52:

Summarising data using numerical techniques
35
Notice that a slight change in the data three more students getting 16 rather
than 17 marks would alter the IQR to 1 mark 16–15. The result should be
treated with some caution therefore. This is a common problem when there
are few distinct values of the variable eight in this example. It is often worth
considering whether a few small changes to the data could alter the calcula-
tion considerably. In such a case the original result might not be very robust.
The variance
A more useful measure of dispersion is the variance which makes use of all of
the information available rather than trimming the extremes of the distribu-
tion. The variance is denoted by the symbol σ
2
. σ is the Greek lower-case letter
sigma so σ
2
is read ‘sigma squared’. It has a completely different meaning from
Σ capital sigma used before. Its formula is
σ
2
1.14
In this formula x −μ measures the distance from each observation to the mean.
Squaring these makes all the deviations positive whether above or below the
mean. We then take the average of all the squared deviations from the mean.
A more dispersed distribution such as A in Figure 1.13 will tend to have larger
deviations from the mean and hence a larger variance. In comparing two
distributions with similar means therefore we could examine their variances to
see which of the two has the greater degree of dispersion. With grouped data the
formula becomes
σ
2
1.15
The calculation of the variance is shown in Table 1.9 and from this we obtain
σ
2
56 802.69
This calculated value is before translating back into the original units of
measurement as was done for the mean by multiplying by 1000. In the case of the
variance however we must multiply by 1 000 000 which is the square of 1000.
The variance is therefore 56 802 690 000. Multiplying by the square of 1000 is a
consequence of using squared deviations in the variance formula see Appendix
1B on E and V operators for more details of this.
One needs to be a little careful about the units of measurement therefore.
If the mean is reported at 146.984 then it is appropriate to report the variance
as 56 802.69. If the mean is reported as 146 984 then the variance should be
reported as 56 802 690 000. Note that it is only the presentation that changes:
the underlying facts are the same.
The standard deviation
In what units is the variance measured As we have used a squaring procedure
in the calculation we end up with something like ‘squared’ £s which is not very
1 001 772 261.83
17 636
∑fx − μ
2
∑f
∑x − μ
2
N
STFE_C01.qxd 26/02/2009 09:04 Page 35

slide 53:

Chapter 1 • Descriptive statistics
36
convenient. Because of this we deﬁne the square root of the variance to be the
standard deviation which is therefore back in £s. The standard deviation is
therefore given by
1.16
or for grouped data
1.17
These are simply the square roots of equations 1.14 and 1.15. The standard
deviation of wealth is therefore 238.333. This is in £000 so the
standard deviation is actually £238 333 note that this is the square root of
56 802 690 000 as it should be. On its own the standard deviation and the
variance is not easy to interpret since it is not something we have an intuitive
feel for unlike the mean. It is more useful when used in a comparative setting.
This will be illustrated later on.
The variance and standard deviation of a sample
As with the mean a different symbol is used to distinguish a variance calculated
from the population and one calculated from a sample. In addition the sample
variance is calculated using a slightly different formula from the one for the
population variance. The sample variance is denoted by s
2
and its formula is
given by equations 1.18 and 1.19 below
s
2
1.18
∑x − X
2
n − 1
56 802 69 .
σ
μ
− ∑fx
N
2
σ
μ
− ∑ x
N
2
Table 1.9 The calculation of the variance of wealth
Range Mid-point Frequency f Deviation x − μ
2
fx − μ
2
x £000 x − μ
0 5.0 2448 −142.0 20 159.38 49 350 158.77
10 000– 17.5 1823 −129.5 16 766.04 30 564 482.57
25 000– 32.5 1375 −114.5 13 106.52 18 021 469.99
40 000– 45.0 480 −102.0 10 400.68 4 992 326.62
50 000– 55.0 665 −92.0 8461.01 5 626 568.95
60 000– 70.0 1315 −77.0 5926.49 7 793 339.80
80 000– 90.0 1640 −57.0 3247.15 5 325 317.93
100 000– 125.0 2151 −22.0 483.28 1 039 544.38
150 000– 175.0 2215 28.0 784.91 1 738 579.16
200 000– 250.0 1856 103.0 10 612.35 19 696 526.45
300 000– 400.0 1057 253.0 64 017.23 67 666 217.05
500 000– 750.0 439 603.0 363 628.63 159 632 966.88
1 000 000– 1500.0 122 1353.0 1 830 653.04 223 339 670.45
2 000 000– 3000.0 50 2853.0 8 139 701.86 406 985 092.85
Total 17 636 1 001 772 261.83
STFE_C01.qxd 26/02/2009 09:04 Page 36

slide 54:

Summarising data using numerical techniques
37
and for grouped data
s
2
1.19
where n is the sample size. The reason n − 1 is used in the denominator rather than
n as one might expect is the following. Our real interest is in the population
variance and the sample variance is an estimate of it. The former is measured
by the dispersion around μ and the sample variance should ideally be measured
around μ also. However μ is unknown so X is used instead. But the variation of
the sample observations around X tends to be smaller than that around μ. Using
n − 1 rather than n in the formula compensates for this and the result is an
unbiased
5
i.e. correct on average estimate of the population variance.
Using the correct formula is more important the smaller is the sample size
as the proportionate difference between n − 1 and n increases. For example if
n 10 the adjustment amounts to 10 of the variance when n 100 the
adjustment is only 1.
The sample standard deviation is given by the square root of equation 1.18
or 1.19.
Worked example 1.6 The variance and standard deviation
We continue with the previous worked example relating to students’ marks.
The variance and standard deviation can be calculated as:
X f fx x −μ x −μ
2
f x −μ
2
13 5 65 −2.81 7.89 39.45
14 13 182 −1.81 3.27 42.55
15 29 435 −0.81 0.65 18.98
16 33 528 0.19 0.04 1.20
17 17 289 1.19 1.42 24.11
18 8 144 2.19 4.80 38.40
19 4 76 3.19 10.18 40.73
20 1 20 4.19 17.56 17.56
Totals 110 1739 222.99
The mean is calculated as 1739/110 15.81 and from this the deviations
column x − μ is calculated so −2.81 13 − 15.81 etc..
The variance is calculated as ∑fx − μ
2
/n − 1 222.99/109 2.05. The
standard deviation is therefore 1.43 the square root of 2.05. Calculations are
shown to two decimal places but have been calculated using exact values.
For distributions which are approximately symmetric and bell-shaped
i.e. the observations are clustered around the mean there is an approximate
relationship between the standard deviation and the inter-quartile range.
This rule of thumb is that the IQR is 1.3 times the standard deviation. In this
case 1.3 × 1.43 1.86 close to the value calculated earlier 2.
∑f x − X
2
n − 1
5
The concept of bias is treated in more detail in Chapter 4.
STFE_C01.qxd 26/02/2009 09:04 Page 37

slide 55:

Chapter 1 • Descriptive statistics
38
Alternative formulae for calculating the variance and
standard deviation
The following formulae give the same answers as equations 1.14 to 1.17 but
are simpler to calculate either by hand or using a spreadsheet. For the popula-
tion variance one can use
σ
2
− μ
2
1.20
or for grouped data
σ
2
− μ
2
1.21
The calculation of the variance using equation 1.21 is shown in Figure 1.14.
∑fx
2
∑f
∑x
2
N
Figure 1.14
Descriptive statistics
calculated using Excel
STATISTICS
IN
PR AC TIC E
··
The sample variance can be calculated using
s
2
1.22
or for grouped data
s
2
1.23
The standard deviation may of course be obtained as the square root of these
formulae.
Using a calculator or computer for calculation
Electronic calculators and particularly computers have simpliﬁed the calcula-
tion of the mean etc. Figure 1.14 shows how to set out the above calculations in
a spreadsheet Microsoft Excel in this case including some of the appropriate cell
formulae.
∑fx
2
− nX
2
n − 1
∑x
2
− nX
2
n − 1
STFE_C01.qxd 26/02/2009 09:04 Page 38

slide 56:

Summarising data using numerical techniques
39
The variance in this case is calculated using the formula σ
2
− μ
2
which
is the formula given in equation 1.21 above. Note that it gives the same result as
that calculated in the text.
The following formulae are contained in the cells:
D5: C5B5 to calculate f times x
E5: D5B5 to calculate f times x
2
C20: SUMC5:C18 to sum the frequencies
H6: D20/C20 calculates ∑fx/∑f
H7: E20/C20 − H62 calculates ∑fx
2
/∑f − μ
2
H8: SQRTH7 calculates σ
H9: H8/H6 calculates σ/ μ
The coefﬁcient of variation
The measures of dispersion examined so far are all measures of absolute disper-
sion and in particular their values depend upon the units in which the variable
is measured. It is therefore difﬁcult to compare the degrees of dispersion of two
variables which are measured in different units. For example one could not
compare wealth in the UK with that in Germany if the former uses £s and the
latter euros for measurement. Nor could one compare the wealth distribution
in one country between two points in time because inﬂation alters the value of
the currency over time. The solution is to use a measure of relative dispersion
which is independent of the units of measurement. One such measure is the
coefﬁcient of variation deﬁned as
Coefﬁcient of variation 1.24
i.e. the standard deviation divided by the mean. Whenever the units of measure-
ment are changed the effect upon the mean and the standard deviation is the
same hence the coefﬁcient of variation is unchanged. For the wealth distribution
its value is 238.333/146.984 1.621 i.e. the standard deviation is 162 of the mean.
This may be compared directly with the coefﬁcient of variation of a different
wealth distribution to see which exhibits a greater relative degree of dispersion.
Independence of units of measurement
It is worth devoting a little attention to this idea that some summary measures are
independent of the units of measurement and some are not as it occurs quite
often in statistics and is not often appreciated at ﬁrst. A statistic that is inde-
pendent of the units of measurement is one which is unchanged even when the
units of measurement are changed. It is therefore more useful in general than a
statistic which is not independent since one can use it to make comparisons or
judgements without worrying about how it was measured.
The mean is not independent of the units of measurement. If we are told
the average income in the UK is 20 000 for example we need to know whether
it is measured in pounds sterling euros or even dollars. The underlying level of
income is the same of course but it is measured differently. By contrast the rate
σ
μ
∑fx
2
∑f
STFE_C01.qxd 26/02/2009 09:04 Page 39

slide 57:

Chapter 1 • Descriptive statistics
40
of growth described in detail shortly is independent of the units of measure-
ment. If we are told it is 3 per annum it would be the same whether it were
calculated in pounds euros or dollars. If told that the rate of growth in the US
is 2 per annum we can immediately conclude that the UK is growing faster
no further information is needed.
Most measures we have encountered so far such as the mean and variance
do depend on units of measurement. The coefﬁcient of variation is one that
does not. We now go on to describe another means of measuring dispersion that
avoids the units of measurement problem.
The standard deviation of the logarithm
Another solution to the problem of different units of measurement is to use the
logarithm
6
of wealth rather than the actual value. The reason why this works
can best be illustrated by an example. Suppose that between 1997 and 2003 each
individual’s wealth doubled so that X
i
2003
2X
i
1997
where X
i
t
indicates the wealth
of individual i in year t. It follows that the standard deviation of wealth in 2003
X
2003
is therefore exactly twice that of 1997 X
1997
. Taking logs we have ln X
i
2003
ln 2 + ln X
i
1997
so it follows that the distribution of ln X
2003
is the same as that
of ln X
1997
except that it is shifted to the right by ln 2 units. The variances and
hence standard deviations of the two logarithmic distributions must therefore
be the same indicating no change in the relative dispersion of the two wealth
distributions.
The standard deviation of the logarithm of wealth is calculated from the data
in Table 1.10. The variance turns out to be
6
See Appendix 1C if you are unfamiliar with logarithms. Note that we use the natural
logarithm here but the effect would be the same using logs to base 10.
Table 1.10 The calculation of the standard deviation of the logarithm of wealth
Range Mid-point x £000 ln x Frequency ffx fx
2
0– 5.0 1.609 2448 3939.9 6341.0
10 000– 17.5 2.862 1823 5217.8 14 934.4
25 000– 32.5 3.481 1375 4786.7 16 663.7
40 000– 45.0 3.807 480 1827.2 6955.5
50 000– 55.0 4.007 665 2664.9 10 679.0
60 000– 70.0 4.248 1315 5586.8 23 735.4
80 000– 90.0 4.500 1640 7379.7 33 207.2
100 000– 125.0 4.828 2151 10 385.7 50 145.4
150 000– 175.0 5.165 2215 11 440.0 59 085.2
200 000– 250.0 5.521 1856 10 247.8 56 583.0
300 000– 400.0 5.991 1057 6333.0 37 943.8
500 000– 750.0 6.620 439 2906.2 19 239.3
1 000 000– 1500.0 7.313 122 892.2 6524.9
2 000 000– 3000.0 8.006 50 400.3 3205.1
Totals 17 636 74 008.2 345 243.0
Note: Use the ‘ln’ key on your calculator or the LN function in a spreadsheet to obtain
natural logarithms of the data. You should obtain ln 5 1.609 ln 17.5 2.862 etc.
STFE_C01.qxd 26/02/2009 09:04 Page 40

slide 58:

Summarising data using numerical techniques
41
σ
2
2
1.966
and the standard deviation σ 1.402.
For comparison the standard deviation of log income in 1979 discussed
in more detail later on is 1.31 so there appears to have been a slight increase
in relative dispersion over this time period.
Measuring deviations from the mean: z-scores
Imagine the following problem. A man and a woman are arguing over their
career records. The man says he earns more than she does so is more successful.
The woman replies that women are discriminated against and that relative
to women she is doing better than the man is relative to other men. Can the
argument be resolved
Suppose the data are as follows: the average male salary is £19 500 the aver-
age female salary £16 800. The standard deviation of male salaries is £4750 for
women it is £3800. The man’s salary is £31 375 while the woman’s is £26 800.
The man is therefore £11 875 above the mean the woman £10 000. However
women’s salaries are less dispersed than men’s so the woman has done well to
reach £26 800.
One way to resolve the problem is to calculate the z-score which gives the
salary in terms of the number of standard deviations from the mean. Thus for
the man the z-score is
z 2.50 1.25
Thus the man is 2.5 standard deviations above the male mean salary. For the
woman the calculation is
z 2.632 1.26
The woman is 2.632 standard deviations above her mean and therefore wins
the argument – she is nearer the top of her distribution than is the man and so
is more of an outlier. Actually this probably will not end the argument but is
the best the statistician can do The z-score is an important concept which will
be used again later in the book when we cover hypothesis testing Chapter 5.
Chebyshev’s inequality
Use of the z-score leads on naturally to Chebyshev’s inequality which tells us
about the proportion of observations that fall into the tails of any distribution
regardless of its shape. The theorem is expressed as follows
At least 1 − 1/k
2
of the observations in any distribution
lie within k standard deviations of the mean 1.27
If we take the female wage distribution given above we can ask what propor-
tion of women lie beyond 2.632 standard deviations from the mean in both tails
of the distribution. Setting k 2.632 then 1 − 1/k
2
1 − 1/2.632
2
0.8556.
26 800 − 16 800
3800
31 375 − 19 500
4750
X − μ
σ
D
F
74 008.2
17 636
A
C
345 243.0
17 636
STFE_C01.qxd 26/02/2009 09:04 Page 41

slide 59:

Chapter 1 • Descriptive statistics
42
So at least 85 of women have salaries within ±2.632 standard deviations of
the mean i.e. between £6 800 16 800 − 2.632 × 3800 and £26 800 16 800
+ 2.632 × 3800. 15 of women therefore lie outside this range.
Chebyshev’s inequality is a very conservative rule since it applies to any
distribution if we know more about the shape of a particular distribution for
example men’s heights follow a Normal distribution – see Chapter 3 then we
can make a more precise statement. In the case of the Normal distribution over
99 of men are within 2.632 standard deviations of the average height because
there is a concentration of observations near the centre of the distribution.
We can also use Chebyshev’s inequality to investigate the inter-quartile
range. The formula 1.27 implies that 50 of observations lie within √2 1.41
standard deviations of the mean a more conservative value than our previous 1.3.
a For the data in Exercise 2 calculate the inter-quartile range the variance and the
standard deviation.
b Calculate the coefﬁcient of variation.
c Check if the relationship between the IQR and the standard deviation stated in the
text is approximately true for this distribution.
d Approximately how much of the distribution lies within one standard deviation either
side of the mean How does this compare with the prediction from Chebyshev’s
inequality
Measuring skewness
The skewness of a distribution is the third characteristic that was mentioned
earlier in addition to location and dispersion. The wealth distribution is heavily
skewed to the right or positively skewed it has its long tail in the right-hand
end of the distribution. A measure of skewness gives a numerical indication of
how asymmetric is the distribution.
One measure of skewness known as the coefﬁcient of skewness is
1.28
and it is based upon cubed deviations from the mean. The result of applying
formula 1.28 is positive for a right-skewed distribution such as wealth zero
for a symmetric one and negative for a left-skewed one. Table 1.11 shows the
calculation for the wealth data some rows are omitted for brevity. From this we
obtain
88 670 693.89
and dividing by σ
3
gives 6.550 which is positive as expected.
The measure of skewness is much less useful in practical work than measures
of location and dispersion and even knowing the value of the coefﬁcient does
not always give much idea of the shape of the distribution: two quite different
distributions can share the same coefﬁcient. In descriptive work it is probably
better to draw the histogram itself.
88 670 693.89
13 537 964
1 563 796 357 499
17 636
∑fx − μ
3
N
∑fx − μ
3
Nσ
3
Exercise 1.4
STFE_C01.qxd 26/02/2009 09:04 Page 42

slide 60:

Summarising data using numerical techniques
43
Comparison of the 2003 and 1979 distributions of wealth
Some useful lessons may be learned by comparing the 2003 distribution with its
counterpart from 1979. This covers the period of Conservative government
starting with Mrs Thatcher in 1979 up until the ﬁrst six years of Labour admin-
istration. This shows how useful the various summary statistics are when it
comes to comparing two different distributions. The wealth data for 1979 are
given in Problem 1.5 below where you are asked to conﬁrm the following cal-
culations.
Average wealth in 1979 was £16 399 about one-ninth of its 2003 value. The
average increased substantially therefore at about 10 per annum on average
but some of this was due to inﬂation rather than a real increase in the quantity
of assets held. In fact between 1979 and 2003 the retail price index rose from
52.0 to 181.3 i.e. it increased approximately three and a half times. Thus the
nominal
7
increase i.e. in cash terms before any adjustment for rising prices in
wealth is made up of two parts: i an inﬂationary part which more than tripled
measured wealth and ii a real part consisting of a 2.5 fold increase thus
3.5 × 2.5 9 approximately. Price indexes are covered in Chapter 10 where it
is shown more formally how to divide a nominal increase into price and real
quantity components. It is likely that the extent of the real increase in wealth
is overstated here due to the use of the retail price index rather than an index of
asset prices. A substantial part of the increase in asset values over the period is
probably due to the very rapid rise in house prices houses form a signiﬁcant
part of the wealth of many households.
The standard deviation is similarly affected by inﬂation. The 1979 value is
25 552 compared to 2003’s 238 333 which is about nine times larger. The spread
of the distribution appears to have increased therefore even if we take account
of the general price effect. Looking at the coefﬁcient of variation however
shows that it has increased from 1.56 to 1.62 which is a modest difference. The
spread of the distribution relative to its mean has not changed by much. This is
conﬁrmed by calculating the standard deviation of the logarithm: for 1979 this
gives a ﬁgure of 1.31 slightly smaller than the 2003 ﬁgure of 1.40.
Table 1.11 Calculation of the skewness of the wealth data
Range Mid-point Frequency fDeviation x − μ
3
fx − μ
3
x £000 x − μ
0 5.0 2448 −142.0 −2 862 304 −7 006 919 444
10 000 17.5 1823 −129.5 −2 170 929 −3 957 603 101
:: : :
1 000 000 1500.0 122 1353.0 2 476 903 349 302 182 208 638
2 000 000 3000.0 50 2853.0 23 222 701 860 1 161 135 092 991
Totals 17 636 4457.2 25 927 167 232 1 563 796 357 499
7
This is a different meaning of the term ‘nominal’ from that used earlier to denote data
measured on a nominal scale i.e. data grouped into categories without an obvious order-
ing. Unfortunately both meanings of the word are in common statistical usage although
it should be obvious from the context which use is meant.
STFE_C01.qxd 26/02/2009 09:04 Page 43

slide 61:

Chapter 1 • Descriptive statistics
44
Figure 1.15
Box plot of the wealth
distribution
The measure of skewness for the 1979 data comes out as 5.723 smaller that
the 2003 ﬁgure of 6.550. This suggests that the 1979 distribution is less skewed
than is the 1994 one. Again these two ﬁgures can be directly compared because
they do not depend upon the units in which wealth is measured. However the
relatively small difference is difﬁcult to interpret in terms of how the shape of
the distribution has changed.
The box and whiskers diagram
Having calculated these various summary statistics we can now return to a
useful graphical method of presentation. This is the box and whiskers diagram
sometimes called a box plot which shows the median quartiles and other
aspects of a distribution on a single diagram. Figure 1.15 shows the box plot for
the wealth data.
Wealth is measured on the vertical axis. The rectangular box stretches vertic-
ally from the ﬁrst to third quartile and therefore encompasses the middle half
of the distribution. The horizontal line through it is at the median and lies
less than halfway up the box. This tells us that there is a degree of skewness even
within the central half of the distribution although it does not appear very
severe. The two ‘whiskers’ extend above and below the box as far as the highest
and lowest observations excluding outliers. An outlier is deﬁned to be any obser-
vation which is more than 1.5 times the inter-quartile range which is the same
as the height of the box above or below the box. Earlier we found the IQR to be
153 517 and the upper quartile to be 180 022 so an upper outlier lies beyond
STFE_C01.qxd 26/02/2009 09:04 Page 44

slide 62:

Time-series data: investment expenditures 1973–2005
45
180 022 + 1.5 × 153 517 410 298. There are no outliers below the box as wealth
cannot fall below zero. The top whisker is thus substantially longer than the
bottom one and indicates the extent of dispersion towards the tails of the dis-
tribution. The crosses indicate the outliers and in reality extend far beyond
those shown in the diagram.
A simple diagram thus reveals a lot of information about the distribution.
Other boxes and whiskers could be placed alongside in the same diagram per-
haps representing other countries making comparisons straightforward. Some
statistical software packages such as SPSS and STATA can generate box plots
from the original data without the need for the user to calculate the median
etc. However spreadsheet packages do not yet have this useful facility.
Time-series data: investment expenditures 1973–2005
The data on the wealth distribution give a snapshot of the situation at par-
ticular points in time and comparisons can be made between the 1979 and
2003 snapshots. Often however we wish to focus on the time-path of a variable
and therefore we use time-series data. The techniques of presentation and sum-
marising are slightly different than for cross-section data. As an example we use
data on investment in the UK for the period 1973–2005. These data were taken
from Statbase http://www.statistics.gov.uk/statbase/ although you can ﬁnd the
data in Economic Trends Annual Supplement. Investment expenditure is important
to the economy because it is one of the primary determinants of growth. Until
recent years the UK economy’s growth record had been poor by international
standards and lack of investment may have been a cause. The variable studied
here is total gross i.e. before depreciation is deducted domestic ﬁxed capital
formation measured in £m. The data are shown in Table 1.12.
It should be remembered that the data are in current prices so that the ﬁgures
reﬂect price increases as well as changes in the volume of physical invest-
ment. The series in Table 1.12 thus shows the actual amount of cash that was
Table 1.12 UK investment 1973–2005
Year Investment Year Investment Year Investment
1973 15 227 1984 58 589 1995 118 031
1974 18 134 1985 64 400 1996 126 593
1975 21 856 1986 68 546 1997 133 620
1976 25 516 1987 78 996 1998 151 083
1977 28 201 1988 96 243 1999 156 344
1978 32 208 1989 111 324 2000 161 468
1979 38 211 1990 114 300 2001 165 472
1980 43 238 1991 105 179 2002 173 525
1981 43 331 1992 101 111 2003 178 751
1982 47 394 1993 101 153 2004 194 491
1983 51 490 1994 108 534 2005 205 843
Note: Time-series data consist of observations on one or more variables over several time
periods. The observations can be daily weekly monthly quarterly or as here annually.
STFE_C01.qxd 26/02/2009 09:04 Page 45

slide 63:

Chapter 1 • Descriptive statistics
46
spent each year on investment. The techniques used below for summarising the
investment data could equally well be applied to a series showing the volume of
investment.
First of all we can use graphical techniques to gain an insight into the charac-
teristics of investment. Figure 1.16 shows a time-series graph of investment. The
graph plots the time periods on the horizontal axis and the investment variable
on the vertical.
Plotting the data in this way brings out clearly some key features of the series:
● The trend in investment is upwards with only a few years in which there was
either no increase or a decrease.
● There is a ‘hump’ in the data in the late 1980s/early 1990s before the series
returns to its trend. Something unusual must have happened around that
time. If we want to know what factors determine investment or the effect of
investment upon other economic magnitudes we should get some useful
insights from this period of the data.
● The trend is slightly non-linear – it follows an increasingly steep curve over
time. This is essentially because investment grows by a percentage or propor-
tionate amount each year. As we shall see shortly it grows by about 8.5 each
year. Therefore as the level of investment increases each year so does the
increase in the level giving a non-linear graph.
● Successive values of the investment variable are similar in magnitude i.e. the
value in year t is similar to that in t − 1. Investment does not change from
£40bn in one year to £10bn the next then back to £50bn for instance. In
fact the value in one year appears to be based on the value in the previous
year plus in general 8.5 or so. We refer to this phenomenon as serial cor-
relation and it is one of the aspects of the data that we might wish to invest-
igate. The ordering of the data matters unlike the case with cross-section data
where the ordering is usually irrelevant. In deciding how to model invest-
ment behaviour we might focus on changes in investment from year to year.
Figure 1.16
Time-series graph of
investment in the UK
1973–2005
Note: The X Y coordinates are the values year investment the ﬁrst data point has the
coordinates 1973 15 227 for example.
STFE_C01.qxd 26/02/2009 09:04 Page 46

slide 64:

Time-series data: investment expenditures 1973–2005
47
● The series seems ‘smoother’ in the earlier years up to perhaps 1986 and
exhibits greater volatility later on. In other words there are greater ﬂuctua-
tions around the trend in the later years. We could express this more formally
by saying that the variance of investment around its trend appears to change
increase over time. This is known as heteroscedasticity a constant variance
is termed homoscedasticity.
We may gain further insight into how investment evolves over time by focus-
ing on the change in investment from year to year. If we denote investment in
year t by I
t
then the change in investment ΔI
t
is given by I
t
− I
t−1
. Table 1.13
shows the changes in investment each year and Figure 1.17 provides a time-
series graph.
The series is made up of mainly positive values indicating that investment
increases over time. It also shows that the increase grows each year with perhaps
some greater volatility of the increase towards the end of the period. The graph
also shows dramatically the change that occurred around 1990.
Figure 1.17
Time-series graph of the
change in investment
Table 1.13 The change in investment
Year Δ Investment Year Δ Investment Year Δ Investment
1973 2880 1984 7099 1995 9497
1974 2907 1985 5811 1996 8562
1975 3722 1986 4146 1997 7027
1976 3660 1987 10 450 1998 17 463
1977 2685 1988 17 247 1999 5261
1978 4007 1989 15 081 2000 5124
1979 6003 1990 2976 2001 4004
1980 5027 1991 −9121 2002 8053
1981 93 1992 −4068 2003 5226
1982 4063 1993 42 2004 15 740
1983 4096 1994 7381 2005 11 352
Note: The change in investment is obtained by taking the difference between successive
observations. For example 2907 is the difference between 18 134 and 15 227.
STFE_C01.qxd 26/02/2009 09:04 Page 47

slide 65:

Chapter 1 • Descriptive statistics
48
Table 1.14 The logarithm of investment and the change in the logarithm
Year ln Investment Δ ln Investment Year ln Investment Δ ln Investment Year ln Investment Δ ln Investment
1973 9.631 0.210 1984 10.978 0.129 1995 11.679 0.084
1974 9.806 0.175 1985 11.073 0.095 1996 11.749 0.070
1975 9.992 0.187 1986 11.135 0.062 1997 11.803 0.054
1976 10.147 0.155 1987 11.277 0.142 1998 11.926 0.123
1977 10.247 0.100 1988 11.475 0.197 1999 11.960 0.034
1978 10.380 0.133 1989 11.620 0.146 2000 11.992 0.032
1979 10.551 0.171 1990 11.647 0.026 2001 12.017 0.024
1980 10.674 0.124 1991 11.563 −0.083 2002 12.064 0.048
1981 10.677 0.002 1992 11.524 −0.039 2003 12.094 0.030
1982 10.766 0.090 1993 11.524 0.000 2004 12.178 0.084
1983 10.849 0.083 1994 11.595 0.070 2005 12.235 0.057
Note: For 1973 9.631 is the natural logarithm of 15 227 i.e. ln 15 227 9.631.
STATISTICS
IN
PR AC TIC E
··
Outliers
Graphing data also allows you to see outliers unusual observations. Outliers
might be due to an error in inputting the data e.g. typing 97 instead of 970 or
because something unusual happened e.g. the investment ﬁgure for 1991. Either
of these should be apparent from an appropriate graph. For example the graph
of the change in investment highlights the 1991 ﬁgure. In the case of a straight-
forward error you should obviously correct it. If you are satisﬁed that the outlier
is not simply a typo you might want to think about the possible reasons for its
existence and whether it distorts the descriptive picture you are trying to paint.
Another useful way of examining the data is to look at the logarithm of
investment. This transformation has the effect of straightening out the non-
linear investment series. Table 1.14 shows the transformed values and Figure 1.18
graphs the series. In this case we use the natural base e logarithm.
Figure 1.18
Time-series graph of the
logarithm of investment
expenditures
STFE_C01.qxd 26/02/2009 09:04 Page 48

slide 66:

Time-series data: investment expenditures 1973–2005
49
This new series is much smoother than the original one as is usually the case
when taking logs and is helpful in showing the long-run trend though it tends
to mask some of the volatility of investment. The slope of the graph gives a close
approximation to the average rate of growth of investment over the period
expressed as a decimal. This is calculated as follows
slope 0.081 1.29
i.e. 8.1 per annum. Note that although there are 33 observations there are only
32 years of growth. A word of warning: you must use natural base e logarithms
not logarithms to the base 10 for this calculation to work. Remember also that
the growth of the volume of investment will be less than 8.1 per annum
because part of it is due to price increases.
The logarithmic presentation is useful when comparing two different data
series: when graphed in logs it is easy to see which is growing faster – just see
which series has the steeper slope.
A corollary of equation 1.29 is that change in the natural logarithm of
investment from one year to the next represents the percentage change in the
data over that year. For example the natural logarithm of investment in 1973 is
9.631 while in 1974 it is 9.806. The difference is 0.175 so the rate of growth is
17.5. Remember that this is an approximation and the result of a quick and
easy calculation. It is reasonably accurate up to a ﬁgure of about 20.
Finally we can graph the difference of the logarithm as we graphed the dif-
ference of the level. This is shown in Figure 1.19 the calculations are in Table 1.14.
This is quite revealing. It shows the series ﬂuctuating about the value of
approximately 0.08 the average calculated in equation 1.29 above with a
slight downwards trend. Furthermore the series does not seem to show increas-
ing volatility over time as the others did. The graph therefore demonstrates that
in proportionate terms there is no increasing volatility the variance of the series
around 0.08 does not change much over time although 1991 still seems to be
an ‘unusual’ observation.
12.235 − 9.631
32
change in ln investment
number of years
Figure 1.19
Time-series graph of
the difference of the
logarithmic series
STFE_C01.qxd 26/02/2009 09:04 Page 49

slide 67:

Chapter 1 • Descriptive statistics
50
Graphing multiple series
Investment is made up of different categories: the table in Problem 1.14 presents
investment data under four different headings: dwellings transport machinery
intangible ﬁxed assets and other buildings. Together they make up total invest-
ment. It is often useful to show all of the series together on one graph. Figure 1.20
shows a multiple time-series graph of the investment data.
Construction of this type of graph is straightforward it is just an extension of
the technique for presenting a single series. The chart shows that all investment
categories have increased over time in a fairly similar way including the hump
then fall around 1990. It is noticeable however that investment in machinery
fell signiﬁcantly around 2000 while other categories particularly dwellings con-
tinued to increase. It is difﬁcult from the graph to tell which categories have
increased most rapidly over time: the 1973 values are relatively small and hard
to distinguish. In fact it is the ‘intangible ﬁxed assets’ category the smallest
one that has increased fastest in proportionate terms. This is easier to observe
with a few numerical calculations covered later in this chapter rather than try-
ing to read a cramped graph.
One could also produce a multiple series graph of the logarithms of the vari-
ables and also of the change as was done for the total investment series. Since
the log transformation tends to squeeze the values on the y-axis closer together
compare Figures 1.16 and 1.18 it might be easier to see the relative rates of
growth of the series using this method. This is left as an exercise for the reader.
Another complication arises when the series are of different orders of magni-
tude and it is difﬁcult to make all the series visible on the chart. In this case you
can chart some of the series against a second vertical scale on the right-hand
axis. An example is shown in Figure 1.21 plotting the total investment data
with the interest rate which has much smaller numerical values. If the same axis
were used for both series the interest rate would appear as a horizontal line
coinciding with the x-axis. This would reveal no useful information to the
viewer.
It would usually be inappropriate to use this technique on data such as the
investment categories graphed in Figure 1.20. Those are directly comparable to
each other and to magnify one of the series by plotting it on a separate axis risks
Figure 1.20
A multiple time-series
graph of investment
STFE_C01.qxd 26/02/2009 09:04 Page 50

slide 68:

Time-series data: investment expenditures 1973–2005
51
Figure 1.21
Time-series graph using
two vertical scales:
investment LH scale
and the interest rate
RH scale 1985–2005
STATISTICS
IN
PR AC TIC E
··
distorting the message for the reader. However investment and interest rates are
measured in inherently different ways and one cannot directly compare their
sizes hence it is acceptable to use separate axes. The graph allows one to observe
the movements of the series together and hence perhaps infer something about
the relationship between them. The rising investment and falling interest rate
possibly suggest an inverse relationship between them.
Overlapping the ranges of the data series
The graph below taken from the Treasury Brieﬁng February 1994 provides a nice
example of how to plot multiple time-series and compare them. The aim is to
compare the recessions and recoveries of 1974–78 1979–83 and 1990–93. Instead
of plotting time on the horizontal axis the number of quarters since the start of
each recession is used so that the series overlap. This makes it easy to see the
depth of the last recession and the long time before recovery commenced. By
contrast the 1974–78 recession ended quite quickly and recovery was quite rapid.
STFE_C01.qxd 26/02/2009 09:04 Page 51

slide 69:

Chapter 1 • Descriptive statistics
52
Figure 1.22
Area graph of
investment categories
1973–2005
Figure 1.23
Over-the-top graph of
investment
STATISTICS
IN
PR AC TIC E
··
The investment categories may also be illustrated by means of an area graph
which plots the four series stacked one on top of the other as illustrated in
Figure 1.22.
This shows for example the ‘dwellings’ and ‘machinery’ categories each take
up about one quarter of total investment. This is easier to see from the area
graph than from the multiple series graph in Figure 1.20.
‘Chart junk’
With modern computer software it is easy to get carried away and produce a chart
that actually hides more than it reveals. There is a great temptation to add some
3D effects liven it up with a bit of colour rotate and tilt the viewpoint etc. This
sort of stuff is generally known as ‘chart junk’. As an example look at Figure 1.23
which is an alternative to the area graph in Figure 1.22 above. It was fun to
create but it does not get the message across at all Taste is of course personal
but moderation is usually an essential part of it.
STFE_C01.qxd 26/02/2009 09:04 Page 52

slide 70:

Time-series data: investment expenditures 1973–2005
53
Given the following data:
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Proﬁt 50 60 25 −10 10 45 60 50 20 40
Sales 300 290 280 255 260 285 300 310 300 330
a Draw a multiple time series graph of the two variables. Label both axes appro-
priately and provide a title for the graph.
b Adjust the graph by using the right-hand axis to measure proﬁts the left-hand
axis sales. What difference does this make
Numerical summary statistics
The graphs have revealed quite a lot about the data already but we can also
calculate numerical descriptive statistics as we did for the cross-section data.
First we consider the mean then the variance and standard deviation.
The mean of a time series
We could calculate the mean of investment itself but would this be helpful
Because the series is trended it passes through the mean at some point between
1973 and 2005 but never returns to it. The mean of the series is actually
£95.103bn which is not very informative since it tells nothing about its value
today for instance. The problem is that the variable is trended so that the mean
is not typical of the series. The annual increase in investment is also trended so
is subject to the same criticism see Figure 1.17.
It is better in this case to calculate the average growth rate as this is more
likely to be representative of the whole time period. It seems more reasonable to
say that a series is growing at for example 8 per annum than that it is growing
at 5000 per annum. The average growth rate was calculated in equation 1.29
as 8.1 per annum by measuring the slope of the graph of the log investment
series. That was stated to be an approximate answer. We can obtain an accurate
value in the following way:
1 Calculate the overall growth factor of the series i.e. x
T
/x
1
where x
T
is the
ﬁnal observation and x
1
is the initial observation. This is
13.518 i.e. investment expenditure is 13.5 times larger in 2005 than in
1973.
2 Take the T − 1 root of the growth factor. Since T 33 we calculate
1.085. This can be performed on a scientiﬁc calculator by raising 13.518
to the power 1/32 i.e. 13.518
1/32
1.085.
3 Subtract 1 from the result in the previous step giving the growth rate as a
decimal. In this case we have 1.085 − 1 0.085.
Thus the average growth rate of investment is 8.5 per annum rather than the
8.1 calculated earlier.
13 518
32
.
205 843
15 227
x
T
x
1
Exercise 1.5
STFE_C01.qxd 26/02/2009 09:04 Page 53

slide 71:

Chapter 1 • Descriptive statistics
54
STATISTICS
IN
PR AC TIC E
··
The power of compound growth
The Economist magazine provided some amusing and interesting examples of
how a 1 investment can grow over time. They assumed that an investor they
named her Felicity Foresight for reasons that become obvious started with 1 in
1900 and had the foresight or luck to invest each year in the best performing
asset of the year. Sometimes she invested in equities some years in gold and so
on. By the end of the century she had amassed 9.6 quintillion 9.6 × 10
18
more
than world gross domestic product GDP so highly unrealistic. This is equivalent
to an average annual growth rate of 55. In contrast Henry Hindsight did the
same but invested in the previous year’s best asset. This might be thought more
realistic. Unfortunately his 1 turned into only 783 a still respectable annual
growth rate of 6.9. This however is beaten by the strategy of investing in
the previous year’s worst performing asset what goes down must come up . . ..
This turned 1 into 1730 a return of 7.7. Food for thought
Source: The Economist 12 February 2000 p. 111.
Note that we could also obtain the accurate answer from our earlier calculation
as follows:
● the slope of the graph is 0.0814 from equation 1.29 above but to four
decimal places for accuracy
● calculate the anti-log e
x
of this: e
0.0814
1.085
● subtract 1 giving a growth rate of 1.085 − 1 0.085 8.5 p.a..
Note that as the calculated growth rate is based only upon the initial and
ﬁnal observations it could be unreliable if either of these two values is an outlier.
With a sufﬁcient span of time however this is unlikely to be a serious problem.
The geometric mean
In calculating the average growth rate of investment we have implicitly calcu-
lated the geometric mean of a series. If we have a series of n values then their
geometric mean is calculated as the nth root of the product of the values i.e.
geometric mean 1.30
The x values in this case are the growth factors in each year as in Table 1.15 the
values in intermediate years are omitted. The ‘Π’ symbol is similar to the use of
Σ but means ‘multiply together’ rather than ‘add up’.
The product of the 32 growth factors is 13.518 the same as is obtained by
dividing the ﬁnal observation by the initial one – why and the 32nd root of
this is 1.085. This latter ﬁgure 1.085 is the geometric mean of the growth factors
and from it we can derive the growth rate of 8.5 p.a. by subtracting 1.
Whenever one is dealing with growth data or any series that is based on a
multiplicative process one should use the geometric mean rather than the
arithmetic mean to get the answer. However using the arithmetic mean in this
case generally gives only a small error as is indicated below.
x
i
i
n
n
∏
1
STFE_C01.qxd 26/02/2009 09:04 Page 54

slide 72:

Time-series data: investment expenditures 1973–2005
55
Another approximate way of obtaining the average
growth rate
We have seen that when calculating rates of growth one should use the
geometric mean but if the growth rate is reasonably small then taking the
arithmetic mean of the growth factors will give approximately the right answer.
The arithmetic mean of the growth factors is
1.087
giving an estimate of the growth rate of 1.087 − 1 0.087 8.7 p.a. – close to
the correct value. Note also that one could equivalently take the average of the
annual growth rates 0.191 0.205 etc. giving 0.087 to obtain the same result.
Use of the arithmetic mean is justiﬁed in this context if one needs only an
approximation to the right answer and annual growth rates are reasonably
small. It is usually quicker and easier to calculate the arithmetic rather than geo-
metric mean especially if one does not have a computer to hand.
By now you might be feeling a little overwhelmed by the various methods we
have used all to get an idea of the average – methods which give similar but not
always identical answers. Let us summarise the ﬁndings:
a measuring the slope of the log graph: gives approximately the right answer
b transforming the slope using the formula e
b
− 1: gives the precise answer
b is the measured slope
c calculating : gives the precise answer as in b
d calculating the geometric mean of the growth factors: gives the precise
answer
e calculating the arithmetic mean of the growth factors: gives approximately
the right answer although not the same approximation as a above.
Remember also that the ‘precise’ answer could be slightly misleading if either
initial or ﬁnal value is an outlier.
x
x
T
T
1
1
1
−
−
1.191 + 1.205 + ... + 1.088 + 1.058
32
Table 1.15 Calculation of the geometric mean – annual growth factors
Investment Growth factors
1973 15 227
1974 18 134 1.191 18 134/15 227
1975 21 856 1.205 21 856/18 134
1976 25 516 1.167 Etc.
33 3
2002 173 525 1.049
2003 178 751 1.030
2004 194 491 1.088
2005 205 843 1.058
Note: Each growth factor simply shows the ratio of that year’s investment to the
previous year’s.
STFE_C01.qxd 26/02/2009 09:04 Page 55

slide 73:

Chapter 1 • Descriptive statistics
56
STATISTICS
IN
PR AC TIC E
··
Compound interest
The calculations we have performed relating to growth rates are analogous to
computing compound interest. If we invest £100 at a rate of interest of 10 per
annum then the investment will grow at 10 p.a. assuming all the interest is
reinvested. Thus after one year the total will have grown to £100 × 1.1 £110 after
two years to £100 × 1.1
2
£121 and after t years to £100 × 1.1
t
. The general formula
for the terminal value S
t
of a sum S
0
invested for t years at a rate of interest r is
S
t
S
0
1 + r
t
1.31
where r is expressed as a decimal. Rearranging 1.31 to make r the subject yields
1.32
which is precisely the formula for the average growth rate. To give a further
example: suppose an investment fund turns an initial deposit of £8000 into
£13 500 over 12 years. What is the average rate of return on the investment Setting
S
0
8 S
t
13.5 t 12 and using equation 1.32 we obtain
or 4.5 per annum.
Formula 1.32 can also be used to calculate the depreciation rate and the
amount of annual depreciation on a ﬁrm’s assets. In this case S
0
represents the
initial value of the asset S
t
represents the ﬁnal or scrap value and the annual rate
of depreciation as a negative number is given by r from equation 1.32.
The variance of a time series
How should we describe the variance of a time series The variance of the invest-
ment data can be calculated but it would be uninformative in the same way as
the mean. As the series is trended and this is likely to continue in the longer
run the variance is in principle equal to inﬁnity. The calculated variance would
be closely tied to the sample size: the larger it is the larger the variance. Again
it makes more sense to calculate the variance of the growth rate which has
little trend in the long run.
This variance can be calculated from the formula
s
2
1.33
where X is the average rate of growth. The calculation is set out in Table 1.16
using the right-hand formula in equation 1.33.
The variance is therefore
s
2
0.0051
and the standard deviation is 0.071 the square root of the variance. The
coefﬁcient of variation is
0.3990 − 32 × 0.087
2
31
∑x
2
− nX
2
n − 1
∑x − X
2
n − 1
r− 1 0.045
13 5
8
12 .
r− 1
S
S
t
t
0
STFE_C01.qxd 26/02/2009 09:04 Page 56

slide 74:

Time-series data: investment expenditures 1973–2005
57
Table 1.16 Calculation of the variance of the growth rate
Year Investment Growth rate
xx
2
1974 18 134 0.191 0.036
1975 21 856 0.205 0.042
1976 25 516 0.167 0.028
33 3 3
2002 173 525 0.049 0.002
2003 178 751 0.030 0.001
2004 194 491 0.088 0.008
2005 205 843 0.058 0.003
Totals 2.7856 0.3990
cv 0.816
i.e. the standard deviation of the growth rate is about 80 of the mean.
Note three things about this calculation: ﬁrst we have used the arithmetic
mean using the geometric mean makes very little difference second we have
used the formula for the sample variance since the period 1974–2005 constitutes
a sample of all the possible data we could collect and third we could have
equally used the growth factors for the calculation of the variance why.
Worked example 1.7
Given the following data
Year 1999 2000 2001 2002 2003
Price of a laptop PC 1100 900 800 750 700
we can work out the average rate of price growth per annum as follows. The
overall growth factor is 0.6363. The fact that this number is less than
one simply reﬂects the fact that the price has fallen over time. It has fallen to
64 of its original value. To ﬁnd the annual rate we take the fourth root of
0.6363 four years of growth. Hence we obtain 0.893 i.e. each
year the price falls to 89 of its value the previous year. This implies price is
falling at 0.893 − 1 −0.107 or approximately an 11 fall each year.
We can see if the fall is more or less the same by calculating each year’s
growth factor. These are:
Year 1999 2000 2001 2002 2003
Laptop price 1100 900 800 750 700
Growth factor – 0.818 0.889 0.9375 0.933
Price fall – −19 −11 −6 −7
The price fall was larger in the earlier years in percentage as well as abso-
lute terms. Calculating the standard deviation of the values in the ﬁnal row
0 6363
4
.
700
1100
0.071
0.087
➔
STFE_C01.qxd 26/02/2009 09:04 Page 57

slide 75:

Chapter 1 • Descriptive statistics
58
Exercise 1.6
provides a measure of the variability from year to year. The variance is
given by
s
2
30.7
and the standard deviation is then 5.54. The calculations are shown
rounded but the answer is accurate.
a Using the data in Exercise 1.5 calculate the average level of proﬁt over the time
period and the average growth rate of proﬁt over the period. Which appears more
useful
b Calculate the variance of proﬁt and compare it to the variance of sales.
Graphing bivariate data: the scatter diagram
The analysis of investment is an example of the use of univariate methods: only
a single variable is involved. However we often wish to examine the relation-
ship between two or sometimes more variables and we have to use bivariate
or multivariate methods. To illustrate the methods involved we shall examine
the relationship between investment expenditures and gross domestic product
GDP. Economics tells us to expect a positive relationship between these
variables higher GDP is usually associated with higher investment. Table 1.17
provides data on GDP for the UK.
A scatter diagram also called an XY chart plots one variable in this case
investment on the y axis the other GDP on the x axis and therefore shows
the relationship between them. For example one can see whether high values
of one variable tend to be associated with high values of the other. Figure 1.24
shows the relationship for investment and GDP.
The chart shows a strong linear relationship between the two variables apart
from a curious dip in the middle. This reﬂects the sharp fall in investment after
1990 which is not matched by a fall in GDP if it were the XY chart would show
19 − 11
2
+ 11 − 11
2
+ 6 − 11
2
+ 7 − 11
2
3
Table 1.17 GDP data
Year GDP Year GDP Year GDP
1973 74 020 1984 324 633 1995 719 747
1974 83 793 1985 355 269 1996 765 152
1975 105 864 1986 381 782 1997 811 194
1976 125 203 1987 420 211 1998 860 796
1977 145 663 1988 469 035 1999 906 567
1978 167 905 1989 514 921 2000 953 227
1979 197 438 1990 558 160 2001 996 987
1980 230 800 1991 587 080 2002 1 048 767
1981 253 154 1992 611 974 2003 1 110 296
1982 277 198 1993 642 656 2004 1 176 527
1983 302 973 1994 680 978 2005 1 224 715
STFE_C01.qxd 26/02/2009 09:04 Page 58

slide 76:

Graphing bivariate data: the scatter diagram
59
a linear relationship without the dip. It is important to recognise the difference
between the time-series plot and the XY chart. Because of inﬂation later observa-
tions tend to be towards the top right of the XY chart both investment and
GDP are increasing over time but this does not have to happen if both variables
ﬂuctuated up and down later observations could be at the bottom left or
centre or anywhere. By contrast in a time series plot later observations are
always further to the right.
Note that both variables are in nominal terms i.e. they make no correction
for inﬂation over the time period. This may be seen algebraically: investment
expenditure is made up of the volume of investment I times its price P
I
.
Similarly nominal GDP is real GDP Y times its price P
Y
. Thus the scatter dia-
gram actually charts P
I
× I against P
Y
× Y. It is likely that the two prices follow a
similar trend over time and that this dominates the movements in real invest-
ment and GDP. The chart then shows the relationship between a mixture of
prices and quantities when the more interesting relationship is between the
quantities of investment and output.
Figure 1.25 shows the relationship between the quantities of investment and
output i.e. after the strongly trending price effects have been removed. It is not
so straightforward as the nominal graph. There is now a ‘knot’ of points in the
centre where perhaps both real investment and GDP ﬂuctuated up and down.
Overall it is clear that something ‘interesting’ happened around 1990 that mer-
its additional investigation.
Chapter 10 on index numbers explains in detail how to derive real variables
from nominal ones as we have done here and generally describes how to cor-
rect for the effects of inﬂation on economic magnitudes.
a Once again using the data from Exercise 1.5 draw an XY chart with proﬁts on the vert-
ical axis sales on the horizontal axis. Choose the scale of the axes appropriately.
b If using Excel to produce graphs Right click on the graph choose ‘Add trendline’
and choose a linear trend. This gives the ‘line of best ﬁt’ covered in detail in
Chapter 7. What does this appear to show
Figure 1.24
Scatter diagram of
investment vertical
axis against GDP
horizontal axis
nominal values
Note: The x y coordinates of each point are given by the values of investment and GDP
respectively. Thus the ﬁrst 1973 data point is drawn 15 227 units above the horizontal axis
and 74 020 units from the vertical one.
Exercise 1.7
STFE_C01.qxd 26/02/2009 09:04 Page 59

slide 77:

Chapter 1 • Descriptive statistics
60
Figure 1.25
The relationship
between real investment
and real output
Data transformations
In analysing employment and investment data in the examples above we have
often changed the variables in some way in order to bring out the important
characteristics. In statistics one usually works with data that have been trans-
formed in some way rather than using the original numbers. It is therefore worth
summarising the main data transformations available providing justiﬁcations
for their use and exploring the implications of such adjustments to the original
data. We brieﬂy deal with the following transformations:
● rounding
● grouping
● dividing or multiplying by a constant
● differencing
● taking logarithms
● taking the reciprocal
● deﬂating.
Rounding
Rounding improves readability. Too much detail can confuse the message so
rounding the answer makes it more memorable. To give an example the aver-
age wealth holding calculated earlier in this chapter is actually £146 983.726 to
three decimal places. It would be absurd to present it in this form however. We
do not know for certain that this ﬁgure is accurate in fact it almost certainly is
not. There is a spurious degree of precision which might mislead the reader.
How much should this be rounded for presentational purposes therefore
Remember that the ﬁgures have already been effectively rounded by allocation
to classes of width 10 000 or more all observations have been rounded to the
mid-point of the interval. However much of this rounding is offsetting i.e.
numbers rounded up offset those rounded down so the mean is reasonably
accurate. Rounding to £147 000 makes the ﬁgure much easier to remember and
is only a change of 0.01 147 000/146 984 1.000 111 so is a reasonable
STFE_C01.qxd 26/02/2009 09:04 Page 60

slide 78:

Data transformations
61
STATISTICS
IN
PR AC TIC E
··
compromise. In the text above the answer was not rounded to such an extent
since the purpose was to highlight the methods of calculation.
Inﬂation in Zimbabwe
‘Zimbabwe’s rate of inﬂation surged to 3731.9 driven by higher energy and food costs and
ampliﬁed by a drop in its currency ofﬁcial ﬁgures show.’
BBC news online 17 May 2007.
Whether ofﬁcial or not it is impossible that the rate of inﬂation is known with such
accuracy to one decimal place especially when prices are rising so fast. It would
be more reasonable to report a ﬁgure of 3700 in this case. Sad to say inﬂation
rose even further in subsequent months.
Rounding is a ‘trap door’ function: you cannot obtain the original value
from the transformed rounded value. Therefore if you are going to need the
original value in further calculations you should not round your answer.
Furthermore small rounding errors can cumulate leading to a large error in the
ﬁnal answer. Therefore you should never round an intermediate answer only
the ﬁnal one. Even if you only round the intermediate answer by a small
amount the ﬁnal answer could be grossly inaccurate. Try the following:
calculate 60.29 × 30.37 − 1831 both before and after rounding the ﬁrst two
numbers to integers. In the ﬁrst case you obtain 0.0073 in the second −31.
Grouping
When there is too much data to present easily grouping solves the problem
although at the cost of hiding some of the information. The examples relating
to education and unemployment and to wealth used grouped data. Using the
raw data would have given us far too much information so grouping is a ﬁrst
stage in data analysis. Grouping is another trap door transformation: once it is
done you cannot recover the original information.
Dividing/multiplying by a constant
This transformation is carried out to make numbers more readable or to make
calculation simpler by removing trailing zeros. The data on wealth were divided
by 1000 to ease calculation otherwise the fx
2
column would have contained
extremely large values. Some summary statistics e.g. the mean will be affected
by the transformation but not all e.g. the coefﬁcient of variation. Try to
remember which are affected E and V operators see Appendix 1B can help. The
transformation is easy to reverse.
Differencing
In time-series data there may be a trend and it is better to describe the features
of the data relative to the trend. The result may also be more economically
meaningful for example governments are often more concerned about the growth
of output than about its level. Differencing is one way of eliminating the trend
STFE_C01.qxd 26/02/2009 09:04 Page 61

slide 79:

Chapter 1 • Descriptive statistics
62
see Chapter 11 for other methods of detrending data. Differencing was used
for the investment data for both of these reasons. One of the implications of
differencing is that information about the level of the variable is lost.
Taking logarithms
Taking logarithms is used to linearise a non-linear series in particular one that
is growing at a fairly constant rate. It is often easier to see the important features
of such a series if the logarithm is graphed rather than the raw data. The loga-
rithmic transformation is also useful in regression see Chapter 9 because it
yields estimates of elasticities e.g. of demand. Taking the logarithm of the
investment data linearised the series and tended to smooth it. The inverses of
the logarithmic transformations are 10
x
for common logarithms and e
x
for
natural logarithms so one can recover the original data.
Taking the reciprocal
The reciprocal of a variable might have a useful interpretation and provide a
more intuitive explanation of a phenomenon. The reciprocal transformation
will also turn a linear series into a non-linear one. The reciprocal of turnover
in the labour market i.e. the number leaving unemployment divided by the
number unemployed gives an idea of the duration of unemployment. If a half of
those unemployed ﬁnd work each year turnover 0.5 then the average dura-
tion of unemployment is 2 years 1/0.5. If a graph of turnover shows a linear
decline over time then the average duration of unemployment will be rising at
a faster and faster rate. Repeating the reciprocal transformation recovers the
original data.
Deﬂating
Deﬂating turns a nominal series into a real one i.e. one that reﬂects changes
in quantities without the contamination of price changes. This is dealt with in
more detail in Chapter 10. It is often more meaningful in economic terms to talk
about a real variable than a nominal one. Consumers are more concerned about
their real income than about their money income for example.
Confusing real and nominal variables is dangerous For example someone’s
nominal money income may be rising yet their real income falling if prices
are rising faster than money income. It is important to know which series you
are dealing with this is a common failing among students new to statistics and
economics. An income series that is growing at 2–3 per annum is probably a
real series one that is growing at 10 per annum or more is likely to be nominal.
Guidance to the student: how to measure your progress
Now you have reached the end of the chapter your work is not yet over It is
very unlikely that you have fully understood everything after one read through.
What you should do now is:
STFE_C01.qxd 26/02/2009 09:04 Page 62

slide 80:

Summary
63
● Check back over the learning outcomes at the start of the chapter. Do
you feel you have achieved them For example can you list the various
different data types you should be able to recognise the ﬁrst learning
outcome
● Read the chapter summary below to help put things in context. You should
recognise each topic and be aware of the main issues techniques etc. within
them. There should be no surprises or gaps
● Read the list of key terms. You should be able to give a brief and precise
deﬁnition or description of each one. Do not worry if you cannot remember
all the formulae although you should try to memorise simple ones such as
that for the mean.
● Try out the problems most important. Answers to odd-numbered problems
are at the back of the book so you can check your answers. There is more
detail for some of the answers on the book’s web site.
From all of this you should be able to work out whether you have really mas-
tered the chapter. Do not be surprised if you have not – it will take more than
one reading. Go back over those parts where you feel unsure of your knowledge.
Use these same learning techniques for each chapter of the book.
Summary
● Descriptive statistics are useful for summarising large amounts of informa-
tion highlighting the main features but omitting the detail.
● Different techniques are suited to different types of data e.g. bar charts for
cross-section data and rates of growth for time series.
● Graphical methods such as the bar chart provide a picture of the data. These
give an informal summary but they are unsuitable as a basis for further
analysis.
● Important graphical techniques include the bar chart frequency distribution
relative and cumulative frequency distributions histogram and pie chart. For
time-series data a time-series chart of the data is informative.
● Numerical techniques are more precise as summaries. Measures of location
such as the mean of dispersion the variance and of skewness form the
basis of these techniques.
● Important numerical summary statistics include the mean median and
mode variance standard deviation and coefﬁcient of variation coefﬁcient of
skewness.
● For bivariate data the scatter diagram or XY graph is a useful way of illus-
trating the data.
● Data are often transformed in some way before analysis for example by
taking logs. Transformations often make it easier to see key features of the
data in graphs and sometimes make summary statistics easier to interpret.
For example with time-series data the average rate of growth may be more
appropriate than the mean of the series.
STFE_C01.qxd 26/02/2009 09:04 Page 63

slide 81:

Chapter 1 • Descriptive statistics
64
Atkinson A. B. The Economics of Inequality 1983 2nd edn. Oxford University
Press.
bar chart
box and whiskers plot
coefﬁcient of variation
compound growth
cross-section data
cross-tabulation
data transformation
frequencies
frequency table
histogram
mean
median
mode
outliers
pie chart
quantiles
relative and cumulative frequencies
scatter diagram XY chart
skewness
standard deviation
time-series data
variance
z-score
Key terms and concepts
Reference
STFE_C01.qxd 26/02/2009 09:04 Page 64

slide 82:

65
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
1.1 The following data show the education and employment status of women aged 20–29
from the General Household Survey:
Higher A levels Other No Total
education qualiﬁcation qualiﬁcation
In work 209 182 577 92 1060
Unemployed 12 9 68 32 121
Inactive 17 34 235 136 422
Sample 238 225 880 260 1603
a Draw a bar chart of the numbers in work in each education category. Can this be
easily compared with the similar diagram for in Figure 1.1
b Draw a stacked bar chart using all the employment states similar to Figure 1.3.
Comment upon any similarities and differences from the diagram in the text.
c Convert the table into column percentages and produce a stacked bar chart similar
to Figure 1.4. Comment upon any similarities and differences.
d Draw a pie chart showing the distribution of educational qualiﬁcations of those in
work and compare it to Figure 1.5 in the text.
1.2 The data below show the median weekly earnings in £s of those in full-time employment
in Great Britain in 1992 by category of education.
Degree Other higher A level GCSE A–C GCSE D–G None
education
Males 433 310 277 242 226 220
Females 346 278 201 183 173 146
a In what fundamental way do the data in this table differ from those in Problem 1.1
b Construct a bar chart showing male and female earnings by education category. What
does it show
c Why would it be inappropriate to construct a stacked bar chart of the data How
should one graphically present the combined data for males and females What extra
information is necessary for you to do this
1.3 Using the data from Problem 1.1:
a Which education category has the highest proportion of women in work What is the
proportion
b Which category of employment status has the highest proportion of women with a
degree What is the proportion
Problems
Problems
STFE_C01.qxd 26/02/2009 09:05 Page 65

slide 83:

Chapter 1 • Descriptive statistics
66
1.4 Using the data from Problem 1.2:
a What is the premium in terms of median earnings of a degree over A levels Does
this differ between men and women
b Would you expect mean earnings to show a similar picture What differences if any
might you expect
1.5 The distribution of marketable wealth in 1979 in the UK is shown in the table below taken
from Inland Revenue Statistics 1981 p. 105:
Range Number Amount
000s £m
0– 1606 148
1000– 2927 5985
3000– 2562 10 090
5000– 3483 25 464
10 000– 2876 35 656
15 000– 1916 33 134
20 000– 3425 104 829
50 000– 621 46 483
100 000– 170 25 763
200 000– 59 30 581
Draw a bar chart and histogram of the data assume the ﬁnal class interval has a width
of 200 000. Comment on the differences between the two. Comment on any differences
between this histogram and the one for 1994 given in the text.
1.6 The data below show the number of manufacturing plants in the UK in 1991/92 arranged
according to employment:
Number of employees Number of ﬁrms
1– 95 409
10– 15 961
20– 16 688
50– 7229
100– 4504
200– 2949
500– 790
1000– 332
Draw a bar chart and histogram of the data assume the mid-point of the last class
interval is 2000. What are the major features apparent in each and what are the differences
1.7 Using the data from Problem 1.5:
a Calculate the mean median and mode of the distribution. Why do they differ
b Calculate the inter-quartile range variance standard deviation and coefﬁcient of
variation of the data.
c Calculate the skewness of the distribution.
d From what you have calculated and the data in the chapter can you draw any con-
clusions about the degree of inequality in wealth holdings and how this has changed
STFE_C01.qxd 26/02/2009 09:05 Page 66

slide 84:

67
c What would be the effect upon the mean of assuming the ﬁnal class width to be £10m
What would be the effects upon the median and mode
1.8 Using the data from Problem 1.6:
a Calculate the mean median and mode of the distribution. Why do they differ
b Calculate the inter-quartile range variance standard deviation and coefﬁcient of
variation of the data.
c Calculate the coefﬁcient of skewness of the distribution.
1.9 A motorist keeps a record of petrol purchases on a long journey as follows:
Petrol station 1 2 3
Litres purchased 33 40 25
Price per litre 55.7 59.6 57.0
Calculate the average petrol price for the journey.
1.10 Demonstrate that the weighted average calculation given in equation 1.9 is equivalent to
ﬁnding the total expenditure on education divided by the total number of pupils.
1.11 On a test taken by 100 students the average mark is 65 with variance 144. Student A
scores 83 student B scores 47.
a Calculate the z-scores for these two students.
b What is the maximum number of students with a score either better than A’s or worse
than B’s
c What is the maximum number of students with a score better than A’s
1.12 The average income of a group of people is £8000. 80 of the group have incomes within
the range £6000–10 000. What is the minimum value of the standard deviation of the
distribution
1.13 The following data show car registrations in the UK during 1970–91 source: ETAS 1993
p. 57:
Year Registrations Year Registrations Year Registrations
1970 91.4 1978 131.6 1986 156.9
1971 108.5 1979 142.1 1987 168.0
1972 177.6 1980 126.6 1988 184.2
1973 137.3 1981 124.5 1989 192.1
1974 102.8 1982 132.1 1990 167.1
1975 98.6 1983 150.5 1991 133.3
1976 106.5 1984 146.6 – –
1977 109.4 1985 153.5 – –
a Draw a time-series graph of car registrations. Comment upon the main features of
the series.
b Draw time-series graphs of the change in registrations the natural log of registra-
tions and the change in the ln. Comment upon the results.
Problems
STFE_C01.qxd 26/02/2009 09:05 Page 67

slide 85:

Chapter 1 • Descriptive statistics
68
1.14 The table below shows the different categories of investment 1986–2005.
Year Dwellings Transport Machinery Intangible Other buildings
ﬁxed assets
1986 14 140 6527 25 218 2184 20 477
1987 16 548 7872 28 225 2082 24 269
1988 21 097 9227 32 614 2592 30 713
1989 22 771 10 624 38 417 2823 36 689
1990 21 048 10 571 37 776 3571 41 334
1991 18 339 9051 35 094 4063 38 632
1992 18 826 8420 35 426 3782 34 657
1993 19 886 9315 35 316 3648 32 988
1994 21 155 11 395 38 426 3613 33 945
1995 22 448 11 036 45 012 3939 35 596
1996 22 516 12 519 50 102 4136 37 320
1997 23 928 12 580 51 465 4249 41 398
1998 25 222 16 113 58 915 4547 46 286
1999 25 700 14 683 60 670 4645 50 646
2000 27 394 13 577 63 535 4966 51 996
2001 29 806 14 656 60 929 5016 55 065
2002 34 499 16 314 57 152 5588 59 972
2003 38 462 15 592 54 441 5901 64 355
2004 44 299 14 939 57 053 6395 71 805
2005 48 534 15 351 57 295 6757 77 906
Use appropriate graphical techniques to analyse the properties of any one of the invest-
ment series. Comment upon the results.
1.15 Using the data from Problem 1.13:
a Calculate the average rate of growth of the series.
b Calculate the standard deviation around the average growth rate.
c Does the series appear to be more or less volatile than the investment ﬁgures used in
the chapter Suggest reasons.
1.16 Using the data from Problem 1.14:
a Calculate the average rate of growth of the series for dwellings.
b Calculate the standard deviation around the average growth rate.
c Does the series appear to be more or less volatile than the investment ﬁgures used in
the chapter Suggest reasons.
1.17 How would you expect the following time-series variables to look when graphed
e.g. Trended Linear trend Trended up or down Stationary Homoscedastic Auto-
correlated Cyclical Anything else
a Nominal national income.
b Real national income.
c The nominal interest rate.
STFE_C01.qxd 26/02/2009 09:05 Page 68

slide 86:

69
1.18 How would you expect the following time-series variables to look when graphed
a The price level.
b The inﬂation rate.
c The £/ exchange rate.
1.19 a A government bond is issued promising to pay the bearer £1000 in ﬁve years’ time.
The prevailing market rate of interest is 7. What price would you expect to pay now
for the bond What would its price be after two years If after two years the market
interest rate jumped to 10 what would the price of the bond be
b A bond is issued which promises to pay £200 per annum over the next ﬁve years. If the
prevailing market interest rate is 7 how much would you be prepared to pay for the
bond Why does the answer differ from the previous question Assume interest is
paid at the end of each year.
1.20 A ﬁrm purchases for £30 000 a machine that is expected to last for 10 years after which
it will be sold for its scrap value of £3000. Calculate the average rate of depreciation per
annum and calculate the written-down value of the machine after one two and ﬁve years.
1.21 Depreciation of BMW and Mercedes cars is given in the following table:
Age BMW 525i Mercedes 200E
Current 22 275 21 900
1 year 18 600 19 700
2 years 15 200 16 625
3 years 12 600 13 950
4 years 9750 11 600
5 years 8300 10 300
a Calculate the average rate of depreciation of each type of car.
b Use the calculated depreciation rates to estimate the value of the car after 1 2 etc.
years of age. How does this match the actual values
c Graph the values and estimated values for each car.
1.22 A bond is issued which promises to pay £400 per annum in perpetuity. How much is the
bond worth now if the interest rate is 5 Hint: the sum of an inﬁnite series of the form
++ + ...
is 1/r as long as r 0.
1.23 Demonstrate using Σ notation that Ex + k Ex + k.
1.24 Demonstrate using Σ notation that Vkx k
2
Vx.
1.25 Criticise the following statistical reasoning. The average price of a dwelling is £54 150.
The average mortgage advance is £32 760. So purchasers have to ﬁnd £21 390 that is
about 40 of the purchase price. On any basis that is an enormous outlay which young
couples in particular who are buying a house for the ﬁrst time would ﬁnd incredibly
difﬁcult if not impossible to raise.
1
1 + r
3
1
1 + r
2
1
1 + r
Problems
STFE_C01.qxd 26/02/2009 09:05 Page 69

slide 87:

Chapter 1 • Descriptive statistics
70
1.26 Criticise the following statistical reasoning. Among arts graduates 10 fail to ﬁnd
employment. Among science graduates only 8 remain out of work. Therefore science
graduates are better than arts graduates. Hint: imagine there are two types of job:
popular and unpopular. Arts graduates tend to apply for the former scientists for the
latter.
1.27 Project 1: Is it true that the Conservative government in the UK 1979–1997 lowered taxes
while the Labour government 1997–2007 raised them
You should gather data that you think are appropriate to the task summarise them
as necessary and write a brief report of your ﬁndings. You might like to consider the
following points:
● Should one consider tax revenue or revenue as a proportion of gross national product
GNP
● Should one distinguish between tax rates and the tax base i.e. what is taxed
● Has the balance between direct and indirect taxation changed
● Have different sections of the population fared differently
You might like to consider other points and do the problem for a different country.
Suitable data sources for the UK are: Inland Revenue Statistics UK National Accounts
Annual Abstract of Statistics or Financial Statistics.
1.28 Project 2: Is the employment and unemployment experience of the UK economy worse
than that of its competitors Write a report on this topic in a similar manner to the pro-
ject above. You might consider rates of unemployment in the UK and other countries
trends in unemployment in each of the countries the growth in employment in each coun-
try the structure of employment e.g. full-time/part-time and unemployment e.g. long-
term/short-term.
You might use data for a number of countries or concentrate on two in more depth.
Suitable data sources are: OECD Main Economic Indicators European Economy published
by the European Commission Employment Gazette.
STFE_C01.qxd 26/02/2009 09:05 Page 70

slide 88:

It is clear the English are more likely to visit Spain than are other nationalities.
Answers to exercises
Answers to exercises
Exercise 1.1
STFE_C01.qxd 26/02/2009 09:05 Page 71

slide 89:

Chapter 1 • Descriptive statistics
72
Exercise 1.2
a Bar chart
Histogram
Exercise 1.3
a Midpoint x Frequency ffx
0–10 5 20 100
11–30 20 40 800
31–60 45 30 1350
60–100 80 20 1600
– – 110 3850
Hence the mean 3850/110 35.
The median is contained in the 11–30 group and is 35/40 of the way through
the interval 20 + 35 moves us to observation 55. Hence the median is 11 + 35/40
× 19 27.625.
The mode is anywhere in the 0–30 range the frequency density is the same
throughout this range.
STFE_C01.qxd 26/02/2009 09:05 Page 72

slide 90:

Answers to exercises
73
b
Exercise 1.4
a Q1 relates to observation 27.5 110/4. This observation lies in the 11–30 range.
There are 20 observations in the ﬁrst class interval so Q1 will relate to observa-
tion 7.5 in the second interval. Hence we need to go 7.5/40 of the way through
the interval. This gives 11 + 7.5/40 × 19 14.6. Similarly Q3 is 22.5/30 of the
way through the third interval yielding Q3 31 + 22.5/30 × 29 52.8. The IQR
is therefore 38 approximately. For the variance we obtain ∑fx 3850 and ∑fx
2
205 250. The variance is therefore σ
2
205 250/110 − 35
2
640.9 and the standard
deviation 25.3.
b CV 25.3/35 0.72.
c 1.3 × 25.3 32.9 not far from the IQR value of 38.
d 1 standard deviation either side of the mean takes us from 9.7 up to 60.3. This
contains all 70 observations in the second and third intervals plus perhaps one
from the ﬁrst interval. Thus we obtain approximately 71 observations within this
range. Chebyshev’s inequality does not help us here as it is not deﬁned for k 1.
Exercise 1.5
a
STFE_C01.qxd 26/02/2009 09:05 Page 73

slide 91:

Chapter 1 • Descriptive statistics
74
b
Using the second axis brings out the variability of proﬁts relative to sales.
Exercise 1.6
a The average proﬁt is 35. The average rate of growth is calculated by comparing
the end values 50 and 40 over the 10-year period. The ratio is 0.8. Taking the
ninth root of this nine years of growth gives 0.926 so the annual rate of
growth is 0.976 − 1 −2.4.
b The variances are using the sample variance formula: for proﬁts ∑x − μ
2
4800
and dividing by 9 gives 533.3. For sales the mean is 291 and ∑x − μ
2
4540.
The variance is therefore 4540/9 504.4. This is similar in absolute size to the
variance of proﬁts but relative to the mean it is much smaller.
Exercise 1.7
a/b
The trend line seems to show a positive relationship between the variables:
higher proﬁts are associated with higher sales.
08
9
.
STFE_C01.qxd 26/02/2009 09:05 Page 74

slide 92:

Appendix 1A: Σ notation
75
Appendix 1A Σ notation
The Greek symbol Σ capital sigma means ‘add up’ and is a shorthand way of
writing what would otherwise be long algebraic expressions. Instead of writing
out each term in the series we provide a template or typical term of the series
with instructions about how many terms there are.
For example given the following observations on x:
x
1
x
2
x
3
x
4
x
5
3 5 648
then
x
i
x
1
+ x
2
+ x
3
+ x
4
+ x
5
3 + 5 + 6 + 4 + 8 26
The template is simply x in this case representing a number to be added in
the series. To expand the sigma expression the subscript i is replaced by succes-
sive integers beginning with the one below the Σ sign and ending with the one
above it 1 to 5 in the example above. Hence the instruction is to add the terms
x
1
to x
5
. Similarly
x
i
x
2
+ x
3
+ x
4
5 + 6 + 4 15
The instruction tells us to add up only the second third and fourth terms of
the series. When it is clear what range of values i takes usually when we are to
add all available values the formula can be simpliﬁed to x
i
or ∑x
i
or even ∑x.
When frequencies are associated with each of the observations as in the data
below:
i 12 3 45
x
i
35 6 48
f
i
22 4 31
then
f
i
x
i
f
1
x
1
+ ... + f
5
x
5
2 × 3 + 2 × 5 + ... + 1 × 8 60
And also
∑f
i
2 + 2 + 4 + 3 + 1 12
Thus the sum of the 12 observations is 60 and the mean is
5
60
12
∑fx
∑f
i
i
∑
1
5
∑
i
4
∑
i2
5
∑
i1
STFE_C01.qxd 26/02/2009 09:05 Page 75

slide 93:

Chapter 1 • Descriptive statistics
76
We are not limited just to adding the x values. For example we might wish
to square each observation before adding them together. This is expressed as
∑x
2
x
2
1
+ x
2
2
+ ... + x
2
5
150
Note that this is different from
∑x
2
x
1
+ x
2
+ ... + x
5
2
676
Part of the formula for the variance calls for the following calculation
∑fx
2
f
1
x
2
1
+ f
2
x
2
2
+ ... + f
5
x
2
5
2 × 3
2
+ 2 × 5
2
+ ... + 1 × 8
2
324
Using Σ notation we can see the effect of transforming x by dividing by 1000
as was done in calculating the average level of wealth. Instead of working with
x we used kx where k 1/1000. In ﬁnding the mean we calculated
k 1.34
So to ﬁnd the mean of the original variable x we had to divide by k again i.e.
multiply by 1000. In general whenever each observation in a sum is multiplied
by a constant the constant can be taken outside the summation operator as in
equation 1.34 above.
Problems on Σ notation
1A.1 Given the following data on x
i
: 4 6 3 2 5 evaluate
∑x
i
∑x
i
2
∑x
i
2
∑x
i
− 3 ∑x
i
− 3
1A.2 Given the following data on x
i
: 8 12 6 4 10 evaluate
∑x
i
∑x
i
2
∑x
i
2
∑x
i
− 3 ∑x
i
− 3
1A.3 Given the following frequencies f
i
associated with the x values in Problem 1A.1: 5 3 3
8 5 evaluate
∑fx ∑fx
2
∑fx − 3 ∑fx − 3
1A.4 Given the following frequencies f
i
associated with the x values in Problem 1A.2: 10 6 6
16 10 evaluate
∑fx ∑fx
2
∑fx − 3 ∑fx − 3
1A.5 Given the pairs of observations on x and y
x 43 6 8 12
y 39 1 4 3
evaluate ∑xy ∑xy − 3 ∑x + 2y − 1
4
∑
x
i
i2
4
∑
x
i
i2
∑x
N
kx
1
+ x
2
+ ...
N
kx
1
+ kx
2
+ ...
N
∑kx
N
STFE_C01.qxd 26/02/2009 09:05 Page 76

slide 94:

77
Appendix 1B: E and V operators
1A.6 Given the pairs of observations on x and y
x 37 4 19
y 12 5 12
evaluate ∑xy ∑xy − 2 ∑x − 2y + 1.
1A.7 Demonstrate that
− k
where k is a constant.
1A.8 Demonstrate that
− μ
2
Appendix 1B E and V operators
These operators are an extremely useful form of notation that we shall make use
of later in the book. It is quite easy to keep track of the effects of data trans-
formations using them. There are a few simple rules for manipulating them that
allow some problems to be solved quickly and elegantly.
Ex is the mean of a distribution and Vx is its variance. We showed above
in equation 1.34 that multiplying each observation by a constant k multiplies
the mean by k. Thus we have
Ekx kEx 1.35
Similarly if a constant is added to every observation the effect is to add that
constant to the mean see Problem 1.23
Ex + a Ex + a 1.36
Graphically the whole distribution is shifted a units to the right and hence
so is the mean. Combining equations 1.35 and 1.36
Ekx + a kEx + a 1.37
Similarly for the variance operator it can be shown that
Vx + k Vx 1.38
Proof
Vx + k Vx
A shift of the whole distribution leaves the variance unchanged. Also
Vkx k
2
Vx 1.39
∑x − μ
2
N
∑x − μ + k − k
2
N
∑x + k − μ + k
2
N
∑fx
2
∑f
∑fx − μ
2
∑f
∑fx
∑f
∑fx − k
∑f
STFE_C01.qxd 26/02/2009 09:05 Page 77

slide 95:

Chapter 1 • Descriptive statistics
78
8
This is equivalent to saying 10
x
× 10
y
10
x+y
.
See Problem 1.24 above. This is why when the wealth ﬁgures were divided
by 1000 the variance became divided by 1000
2
. Applying 1.38 and 1.39
Vkx + a k
2
Vx 1.40
Finally we should note that V itself can be expressed in terms of E
Vx Ex − Ex
2
1.41
Appendix 1C Using logarithms
Logarithms are less often used now that cheap electronic calculators are avail-
able. Formerly logarithms were an indispensable aid to calculation. However
the logarithmic transformation is useful in other contexts in statistics and eco-
nomics so its use is brieﬂy set out here.
The logarithm to the base 10 of a number x is deﬁned as the power to which
10 must be raised to give x. For example 10
2
100 so the log of 100 is 2 and we
write log
10
100 2 or simply log 100 2.
Similarly the log of 1000 is 3 1000 10
3
of 10 000 it is 4 etc. We are not re-
stricted to integer whole number powers of 10 so for example 10
2.5
316.227766
try this if you have a scientiﬁc calculator so the log of 316.227766 is 2.5. Every
number x can therefore be represented by its logarithm.
Multiplication of two numbers
We can use logarithms to multiply two numbers x and y based on the property
8
log xy log x + log y
For example to multiply 316.227766 by 10
log316.227766 × 10 log 316.227766 + log 10
2.5 + 1
3.5
The anti-log of 3.5 is given by 10
3.5
3162.27766 which is the answer.
Taking the anti-log i.e. 10 raised to a power is the inverse of the log transforma-
tion. Schematically we have
x → take logarithms → a log x → raise 10 to the power a → x
Division
To divide one number by another we subtract the logs. For example to divide
316.227766 by 100
log316.227766/100 log 316.227766 − log 100
2.5 − 2
0.5
and 10
0.5
3.16227766.
STFE_C01.qxd 26/02/2009 09:05 Page 78

slide 96:

79
Appendix 1C: Using logarithms
Powers and roots
Logarithms simplify the process of raising a number to a power. To ﬁnd the
square of a number multiply the logarithm by 2 e.g. to ﬁnd 316.227766
2
:
log316.227766
2
2 log316.227766 5
and 10
5
100 000.
To ﬁnd the square root of a number equivalent to raising it to the power
divide the log by 2. To ﬁnd the nth root divide the log by n. For example in the
text we have to ﬁnd the 32nd root of 13.518
0.0353
and 10
0.0353
1.085.
Common and natural logarithms
Logarithms to the base 10 are known as common logarithms but one can
use any number as the base. Natural logarithms are based on the number
e 2.71828 ... and we write ln x instead of log x to distinguish them from
common logarithms. So for example
ln 316.227766 5.756462732
since e
5.756462732
316.227766.
Natural logarithms can be used in the same way as common logarithms and
have the similar properties. Use the ‘ln’ key on your calculator just as you would
the ‘log’ key but remember that the inverse transformation is e
x
rather than 10
x
.
Problems on logarithms
1C.1 Find the common logarithms of: 0.15 1.5 15 150 1500 83.7225 9.15 −12.
1C.2 Find the log of the following values: 0.8 8 80 4 16 −37.
1C.3 Find the natural logarithms of: 0.15 1.5 15 225 −4.
1C.4 Find the ln of the following values: 0.3 e 3 33 −1.
1C.5 Find the anti-log of the following values: −0.823909 1.1 2.1 3.1 12.
1C.6 Find the anti-log of the following values: −0.09691 2.3 3.3 6.3.
1C.7 Find the anti-ln of the following values: 2.70805 3.70805 1 10.
1C.8 Find the anti-ln of the following values: 3.496508 14 15 −1.
1C.9Evaluate: 4
1/4
12
−3
25
−3/2
.
1C.10Evaluate: 8
1/4
15
0
12
0
3
−1/3
.
30 17
36
10 3 7
24
.
1.1309
32
log13.518
32
1
2
STFE_C01.qxd 26/02/2009 09:05 Page 79

slide 97:

Probability
2
Contents
Learning outcomes 80
Probability theory and statistical inference 81
The deﬁnition of probability 81
The frequentist view 82
The subjective view 83
Probability theory: the building blocks 84
Compound events 85
The addition rule 85
The multiplication rule 86
Combining the addition and multiplication rules 88
Tree diagrams 88
Combinations and permutations 89
Bayes’ theorem 91
Decision analysis 93
Decision criteria: maximising the expected value 95
Maximin maximax and minimax regret 96
The expected value of perfect information 97
Summary 98
Key terms and concepts 98
Problems 99
Answers to exercises 105
By the end of this chapter you should be able to:
● understand the essential concept of the probability of an event occurring
● appreciate that the probability of a combination of events occurring can be
calculated using simple arithmetic rules the addition and multiplication rules
● understand that a probability can depend upon the outcome of other events
conditional probability
● know how to make use of probability theory to help make decisions in situations
of uncertainty.
Learning
outcomes
80
Complete your diagnostic test for Chapter 2 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C02.qxd 26/02/2009 09:05 Page 80

slide 98:

The deﬁnition of probability
81
Probability theory and statistical inference
In October 1985 Mrs Evelyn Adams of New Jersey USA won 3.9 m in the State
lottery at odds of 1 in 3 200 000. In February 1986 she won again although this
time only 1.4 m at odds of 1 in 5 200 000. The odds against both these wins
were calculated at about 1 in 17 300 bn. Mrs Adams is quoted as saying ‘They
say good things come in threes so . . .’.
The above story illustrates the principles of probability at work. The same
principles underlie the theory of statistical inference which is the task of draw-
ing conclusions inferences about a population from a sample of data drawn
from that population. For example we might have a survey which shows that
30 of a sample of 100 families intend to take a holiday abroad next year. What
can we conclude from this about all families The techniques set out in this and
subsequent chapters show how to accomplish this.
Why is knowledge of probability necessary for the study of statistical infer-
ence In order to be able to say something about a population on the basis of
some sample evidence we must ﬁrst examine how the sample data are collected.
In many cases the sample is a random one i.e. the observations making up the
sample are chosen at random from the population. If a second sample were
selected it would almost certainly be different from the ﬁrst. Each member of the
population has a particular probability of being in the sample in simple random
sampling the probability is the same for all members of the population. To
understand sampling procedures and the implications for statistical inference
we must therefore ﬁrst examine the theory of probability.
As an illustration of this suppose we wish to know if a coin is fair i.e. equally
likely to fall heads or tails. The coin is tossed 10 times and 10 heads are recorded.
This constitutes a random sample of tosses of the coin. What can we infer about
the coin If it is fair the probability of getting ten heads is 1 in 1024 so a fairly
unlikely event seems to have happened. We might reasonably infer therefore
that the coin is biased towards heads.
The deﬁnition of probability
The ﬁrst task is to deﬁne precisely what is meant by probability. This is not
as easy as one might imagine and there are a number of different schools of
thought on the subject. Consider the following questions:
● What is the probability of ‘heads’ occurring on the toss of a coin
● What is the probability of a driver having an accident in a year of driving
● What is the probability of a country such as Peru defaulting on its inter-
national loan repayments as Mexico did in the 1980s
We shall use these questions as examples when examining the different schools
of thought on probability.
STFE_C02.qxd 26/02/2009 09:05 Page 81

slide 99:

Chapter 2 • Probability
82
The frequentist view
Considering the ﬁrst question above the frequentist view would be that the
probability is equal to the proportion of heads obtained from a coin in the long
run i.e. if the coin were tossed many times. The ﬁrst few results of such an
experiment might be
H T T H H H T H T . . .
After a while the proportion of heads settles down at some particular fraction
and subsequent tosses will individually have an insigniﬁcant effect upon the
value. Figure 2.1 shows the result of tossing a coin 250 times and recording the
proportion of heads actually this was simulated on a computer: life is too short
to do it for real.
This shows the proportion settling down at a value of about 0.50 which
indicates an unbiased coin or rather an unbiased computer in this case. This
value is the probability according to the frequentist view. To be more precise
the probability is deﬁned as the proportion of heads obtained as the number
of tosses approaches inﬁnity. In general we can deﬁne PrH the probability of
event H in this case heads occurring as
PrH as the number of trials approaches inﬁnity.
In this case each toss of the coin constitutes a trial.
This deﬁnition gets round the obvious question of how many trials are
needed before the probability emerges but means that the probability of an
event cannot strictly be obtained in ﬁnite time.
Although this approach appears attractive in theory it does have its prob-
lems. One could not actually toss the coin an inﬁnite number of times. Or what
if one took a different coin would the results from the ﬁrst coin necessarily
apply to the second
Perhaps more seriously the deﬁnition is of less use for the second and third
questions posed above. Calculating the probability of an accident is not too
number of occurrences of H
number of trials
Figure 2.1
The proportion of
heads in 250 tosses
of a fair coin
STFE_C02.qxd 26/02/2009 09:05 Page 82

slide 100:

The deﬁnition of probability
83
problematic: it may be deﬁned as the proportion of all drivers having an accid-
ent during the year. However this may not be relevant for a particular driver
since drivers vary so much in their accident records. And how would you answer
the third question There is no long run that we can appeal to. We cannot re-
run history over and over again to see in what proportion of cases the country
defaults. Yet this is what lenders want to know and credit-rating agencies have
to assess. Maybe another approach is needed.
The subjective view
According to the subjective view probability is a degree of belief that someone
holds about the likelihood of an event occurring. It is inevitably subjective and
therefore some argue that it should be the degree of belief that it is rational
to hold but this just shifts the argument to what is meant by ‘rational’. Some
progress can be made by distinguishing between prior and posterior beliefs.
The former are those held before any evidence is considered the latter are the
modiﬁed probabilities in the light of the evidence. For example one might
initially believe a coin to be fair the prior probability of heads is one-half but
not after seeing only ﬁve heads in ﬁfty tosses the posterior probability would be
less than a half.
Although it has its attractions this approach which is the basis of Bayesian
statistics also has its drawbacks. It is not always clear how one should arrive
at the prior beliefs particularly when one really has no prior information.
Also these methods often require the use of sophisticated mathematics which
may account for the limited use made of them. The development of more
powerful computers and user-friendly software may increase the popularity of
the Bayesian approach.
There is not universal agreement therefore as to the precise deﬁnition of prob-
ability. We do not have space here to explore the issue further so we will ignore
the problem The probability of an event occurring will be deﬁned as a certain
value and we will not worry about the precise origin or meaning of that value.
This is an axiomatic approach: we simply state what the probability is without
justifying it and then examine the consequences.
a Deﬁne the probability of an event according to the frequentist view.
b Deﬁne the probability of an event according to the subjective view.
For the following events suggest how their probability might be calculated. In each
case consider whether you have used the frequentist or subjective view of probability
or possibly some mixture.
a The Republican party winning the next US election.
b The number 5 being the first ball drawn in next week’s lottery.
c A repetition of the 2004 Asian tsunami.
d Your train home being late.
Exercise 2.1
Exercise 2.2
STFE_C02.qxd 26/02/2009 09:05 Page 83

slide 101:

Chapter 2 • Probability
84
Probability theory: the building blocks
We start with a few deﬁnitions to establish a vocabulary that we will subse-
quently use.
● An experiment is an action such as ﬂipping a coin which has a number of
possible outcomes or events such as heads or tails.
● A trial is a single performance of the experiment with a single outcome.
● The sample space consists of all the possible outcomes of the experiment. The
outcomes for a single toss of a coin are heads tails for example and these
constitute the sample space for a toss of a coin. The outcomes in the sample
space are mutually exclusive which means that the occurrence of one rules
out all the others. One cannot have both heads and tails in a single toss of a
coin. As a further example if a single card is drawn at random from a pack
then the sample space may be drawn as in Figure 2.2. Each point represents
one card in the pack and there are 52 points altogether. The sample space
could be set out in alternative ways. For instance one could write a list of all
the cards: ace of spades king of spades... two of clubs. One can choose the
representation most suitable for the problem at hand.
● With each outcome in the sample space we can associate a probability which
is the chance of that outcome occurring. The probability of heads is one-half
the probability of drawing the ace of spades from a pack of cards is one in 52 etc.
There are restrictions upon the probabilities we can associate with the outcomes
in the sample space. These are needed to ensure that we do not come up with
self-contradictory results for example it would be odd to arrive at the conclu-
sion that we could expect heads more than half the time and tails more than
half the time. To ensure our results are always consistent the following rules
apply to probabilities:
● The probability of an event must lie between 0 and 1 i.e.
0 PrA 1 for any event A 2.1
The explanation is straightforward. If A is certain to occur it occurs in 100
of all trials and so its probability is 1. If A is certain not to occur then its prob-
ability is 0 since it never happens however many trials there are. As one
cannot be more certain than certain probabilities of less than 0 or more than
1 can never occur and equation 2.1 follows.
● The sum of the probabilities associated with all the outcomes in the sample
space is 1. Formally
∑P
i
1 2.2
Figure 2.2
The sample space for
drawing from a pack
of cards
STFE_C02.qxd 26/02/2009 09:05 Page 84

slide 102:

Probability theory: the building blocks
85
where P
i
is the probability of event i occurring. This follows from the fact that
one and only one of the outcomes must occur since they are mutually
exclusive and also exhaustive i.e. they deﬁne all the possibilities.
● Following on from equation 2.2 we may deﬁne the complement of an event
as everything in the sample space apart from that event. The complement of
heads is tails for example. If we write the complement of A as not-A then it
follows that PrA + Prnot-A 1 and hence
Prnot-A 1 − PrA 2.3
Compound events
Most practical problems require the calculation of the probability of a set of
outcomes rather than just a single one or the probability of a series of outcomes
in separate trials. For example the probability of drawing a spade at random
from a pack of cards encompasses 13 points in the sample space one for each
spade. This probability is 13 out of 52 or one-quarter which is fairly obvious
but for more complex problems the answer is not immediately evident. We refer
to such sets of outcomes as compound events. Some examples are getting a ﬁve
or a six on a throw of a die or drawing an ace and a queen to complete a ‘straight’
in a game of poker.
It is sometimes possible to calculate the probability of a compound event by
examining the sample space as in the case of drawing a spade above. However
in many cases this is not so for the sample space is too complex or even imposs-
ible to write down. For example the sample space for three draws of a card from
a pack consists of over 140 000 points A typical point might be for example
the ten of spades eight of hearts and three of diamonds. An alternative method
is needed. Fortunately there are a few simple rules for manipulating probabilities
which help us to calculate the probabilities of compound events.
If the previous examples are examined closely it can be seen that outcomes
are being compounded using the words ‘or’ and ‘and’: ‘. . . ﬁve or six on a single
throw . . .’ ‘. . . an ace and a queen . . .’. ‘And’ and ‘or’ act as operators and
compound events are made up of simple events compounded by these two
operators. The following rules for manipulating probabilities show how to use
these operators and thus how to calculate the probability of a compound event.
The addition rule
This rule is associated with ‘or’. When we want the probability of one outcome
or another occurring we add the probabilities of each. More formally the
probability of A or B occurring is given by
PrA or B PrA + PrB 2.4
So for example the probability of a ﬁve or a six on a roll of a die is
Pr5 or 6 Pr5 + Pr6 1/6 + 1/6 1/3 2.5
This answer can be veriﬁed from the sample space as shown in Figure 2.3. Each
dot represents a simple event one to six. The compound event is made up of
two of the six points shaded in Figure 2.3 so the probability is 2/6 or 1/3.
STFE_C02.qxd 26/02/2009 09:05 Page 85

slide 103:

Chapter 2 • Probability
86
However equation 2.4 is not a general solution to this type of problem i.e.
it does not always work as can be seen from the following example. What is the
probability of a queen or a spade in a single draw from a pack of cards PrQ
4/52 four queens in the pack and PrS 13/52 13 spades so applying equa-
tion 2.4 gives
PrQ or S PrQ + PrS 4/52 + 13/52 17/52 2.6
However if the sample space is examined the correct answer is found to be
16/52 as in Figure 2.4. The problem is that one point in the sample space the
one representing the queen of spades is double-counted once as a queen and
again as a spade. The event ‘drawing a queen and a spade’ is possible and gets
double-counted. Equation 2.4 has to be modiﬁed by subtracting the probab-
ility of getting a queen and a spade to eliminate this double counting. The
correct answer is obtained from
PrQ or S PrQ + PrS − PrQ and S 2.7
4/52 + 13/52 − 1/52
16/52
The general rule is therefore
PrA or B PrA + PrB − PrA and B 2.8
Rule 2.4 worked for the die example because Pr5 and 6 0 since a ﬁve
and a six cannot simultaneously occur. The double counting did not affect the
calculation of the probability.
In general therefore one should use equation 2.8 but when two events are
mutually exclusive the rule simpliﬁes to equation 2.4.
The multiplication rule
The multiplication rule is associated with use of the word ‘and’ to combine
events. Consider a mother with two children. What is the probability that they
are both boys This is really a compound event: a boy on the ﬁrst birth and a
boy on the second. Assume that in a single birth a boy or girl is equally likely
so Prboy Prgirl 0.5. Denote by PrB1 the probability of a boy on the ﬁrst
birth and by PrB2 the probability of a boy on the second. Thus the question
asks for PrB1 and B2 and this is given by
Figure 2.3
The sample space for
rolling a die
Figure 2.4
The sample space for
drawing a queen or
a spade
STFE_C02.qxd 26/02/2009 09:05 Page 86

slide 104:

Probability theory: the building blocks
87
PrB1 and B2 PrB1 × PrB2 0.5 × 0.5 2.9
0.25
Intuitively the multiplication rule can be understood as follows. One-half of
mothers have a boy on their ﬁrst birth and of these one-half will again have a boy
on the second. Therefore a quarter a half of one-half of mothers have two boys.
Like the addition rule the multiplication rule requires slight modiﬁcation
before it can be applied generally and give the right answer in all circumstances.
The example assumes ﬁrst and second births to be independent events i.e. that
having a boy on the ﬁrst birth does not affect the probability of a boy on the
second. This assumption is not always valid.
Write PrB2|B1 to indicate the probability of the event B2 given that the
event B1 has occurred. This is known as the conditional probability more pre-
cisely the probability of B2 conditional upon B1. Let us drop the independence
assumption and suppose the following
PrB1 PrG1 0.5 2.10
i.e. boys and girls are equally likely on the ﬁrst birth and
PrB2|B1 PrG2|G1 0.6 2.11
i.e. a boy is more likely to be followed by another boy and a girl by another girl.
It is easy to work out PrB2|G1 and PrG2|B1. What are they
Now what is the probability of two boys Half of all mothers have a boy ﬁrst
and of these 60 have another boy. Thus 30 60 of 50 of mothers have
two boys. This is obtained from the rule
PrB1 and B2 PrB1 × PrB2|B1 2.12
0.5 × 0.6
0.3
Thus in general we have
PrA and B PrA × PrB|A 2.13
which simpliﬁes to
PrA and B PrA × PrB 2.14
if A and B are independent.
Independence may therefore be deﬁned as follows: two events A and B are
independent if the probability of one occurring is not inﬂuenced by the fact of
the other having occurred. Formally if A and B are independent then
PrB| A PrB|not A PrB 2.15
and
PrA| B PrA|not B PrA 2.16
The concept of independence is an important one in statistics as it usually
simpliﬁes problems considerably. If two variables are known to be independent
then we can analyse the behaviour of one without worrying about what is happen-
ing to the other variable. For example sales of computers are independent of
temperature so if one is trying to predict sales next month one does not need to
STFE_C02.qxd 26/02/2009 09:05 Page 87

slide 105:

Chapter 2 • Probability
88
STATISTICS
IN
PR AC TIC E
··
worry about the weather. In contrast ice cream sales do depend on the weather
so predicting sales accurately requires one to forecast the weather ﬁrst.
Intuition does not always work with probabilities
Counter-intuitive results frequently arise in probability which is why it is wise to
use the rules to calculate probabilities in tricky situations rather than rely on
intuition. Take the following questions:
● What is the probability of obtaining two heads HH in two tosses of a coin
● What is the probability of obtaining tails followed by heads TH
● If a coin is tossed until either HH or TH occurs what are the probabilities of
each sequence occurring ﬁrst
The answers to the ﬁrst two are easy:
1
/2 ×
1
/2
1
/4 in each case. You might there-
fore conclude that each sequence is equally likely to be the ﬁrst observed but you
would be wrong
Unless HH occurs on the ﬁrst two tosses then TH must occur ﬁrst. HH is therefore
the ﬁrst sequence only if it occurs on the ﬁrst two tosses which has a probability
of
1
/4. The probability that TH is ﬁrst is therefore
3
/4. The probabilities are unequal
a strange result. Now try the same thing with HHH and THH and three tosses of a coin.
Combining the addition and multiplication rules
More complex problems can be solved by suitable combinations of the addition
and multiplication formulae. For example what is the probability of a mother
having one child of each sex This could occur in one of two ways: a girl
followed by a boy or a boy followed by a girl. It is important to note that these
are two different routes to the same outcome. Therefore we have assuming
non-independence according to equation 2.11
Pr1 girl 1 boy PrG1 and B2 or B1 and G2
PrG1 × PrB2|G1 + PrB1 × PrG2|B1
0.5 × 0.4 + 0.5 × 0.4
0.4
The answer can be checked if we remember equation 2.2 stating that prob-
abilities must sum to 1. We have calculated the probability of two boys 0.3
and of a child of each sex 0.4. The only other possibility is of two girls. This
probability must be 0.3 the same as two boys since boys and girls are treated
symmetrically in this problem even with the non-independence assumption.
The sum of the three possibilities two boys one of each or two girls is there-
fore 0.3 + 0.4 + 0.3 1 as it should be. This is often a useful check to make
especially if one is unsure that one’s calculations are correct.
Note that the problem would have been different if we had asked for the
probability of the mother having one girl with a younger brother.
Tree diagrams
The preceding problem can be illustrated using a tree diagram which often
helps to clarify a problem. A tree diagram is an alternative way of enumerating
STFE_C02.qxd 26/02/2009 09:05 Page 88

slide 106:

Probability theory: the building blocks
89
all possible outcomes in the sample space with the associated probabilities.
The diagram for two children is shown in Figure 2.5.
The diagram begins at the left and the ﬁrst node shows the possible altern-
atives boy girl at that point and the associated probabilities 0.5 0.5. The next
two nodes show the alternatives and probabilities for the second birth given the
sex of the ﬁrst child. The ﬁnal four nodes show the possible results: boy boy
boy girl girl boy and girl girl.
To ﬁnd the probability of two girls using the tree diagram follow the lowest
path multiplying the probabilities along it to give 0.5 × 0.6 0.3. To ﬁnd the
probability of one child of each sex it is necessary to follow all the routes which
lead to such an outcome. There are two in this case: leading to boy girl and to
girl boy. Each of these has a probability of 0.2 obtained by multiplying the prob-
abilities along that branch of the tree. Adding these together since either one or
the other leads to the desired outcome yields the answer giving 0.2 + 0.2 0.4.
This provides a graphical alternative to the formulae used above and may help
comprehension.
The tree diagram can obviously be extended to cover third and subsequent
children although the number of branches rapidly increases in geometric pro-
gression. The difﬁculty then becomes not just the calculation of the probab-
ility attached to each outcome but sorting out which branches should be taken
into account in the calculation. Suppose we consider a family of ﬁve children
of whom three are girls. To simplify matters we again assume independence
of probabilities. The appropriate tree diagram has 2
5
32 end-points each with
probability 1/32. How many of these relate to families with three girls and two
boys for example One can draw the diagram and count them yielding the
answer 10 but it takes considerable time and is prone to error. Far better
would be to use a formula. To develop this we use the ideas of combinations and
permutations.
Combinations and permutations
How can we establish the number of ways of having three girls and two boys
in a family of ﬁve children One way would be to write down all the possible
orderings:
Figure 2.5
Tree diagram for a
family with two children
STFE_C02.qxd 26/02/2009 09:05 Page 89

slide 107:

Chapter 2 • Probability
90
GGGBB GGBGB GGBBG GBGGB GBGBG
GBBGG BGGGB BGGBG BGBGG BBGGG
This shows that there are 10 such orderings so the probability of three girls
and two boys in a family of ﬁve children is 10/32. In more complex problems
this soon becomes difﬁcult or impossible. The record number of children born
to a British mother is 39 of whom 32 were girls. The appropriate tree diagram
has over ﬁve thousand billion ‘routes’ through it and drawing one line i.e. for
one child per second would imply 17 433 years to complete the task Rather
than do this we use the combinatorial formula to ﬁnd the answer. Suppose there
are n children r of them girls then the number of orderings denoted nCris
obtained from
1
nCr
2.17
In the above example n 5 r 3 so the number of orderings is
5C3 10 2.18
If there were four girls out of ﬁve children then the number of orderings or
combinations would be
5C4 5 2.19
This gives ﬁve possible orderings i.e. the single boy could be the ﬁrst second
third fourth or ﬁfth born.
Why does this formula work Consider ﬁve empty places to ﬁll correspond-
ing to the ﬁve births in chronological order. Take the case of three girls call
them Amanda Bridget and Caroline for convenience who have to ﬁll three
of the ﬁve places. For Amanda there is a choice of ﬁve empty places. Having
‘chosen’ one there remain four for Bridget so there are 5 × 4 20 possibilities
i.e. ways in which these two could choose their places. Three remain for
Caroline so there are 60 5 × 4 × 3 possible orderings in all the two boys take
the two remaining places. Sixty is the number of permutations of three named
girls in ﬁve births. This is written 5P3 or in general nPr. Hence
5P3 5 × 4 × 3
or in general
nPr n × n − 1 × ... × n − r + 1 2.20
A simpler formula is obtained by multiplying and dividing by n − r
nPr 2.21
n
n − r
n × n − r × ... × n − r + 1 × n − r
n − r
5 × 4 × 3 × 2 × 1
4 × 3 × 2 × 1 × 1
5 × 4 × 3 × 2 × 1
3 × 2 × 1 × 2 × 1
n × n − 1 × ... × 1
r × r − 1 × ... × 1 × n − r × n − r − 1 × ... × 1
n
rn − r
1
n is read ‘n factorial’ and is deﬁned as the product of all the integers up to and includ-
ing n. Thus for example 3 3 × 2 × 1 6.
STFE_C02.qxd 26/02/2009 09:05 Page 90

slide 108:

Bayes’ theorem
91
Exercise 2.3
Exercise 2.4
Exercise 2.5
Exercise 2.6
What is the difference between nPr and nCr The latter does not distinguish
between the girls the two cases Amanda Bridget Caroline boy boy and Bridget
Amanda Caroline boy boy are effectively the same three girls followed by two
boys. So nPr is larger by a factor representing the number of ways of ordering
the three girls. This factor is given by r 3 × 2 × 1 6 any of the three girls
could be ﬁrst either of the other two second and then the ﬁnal one. Thus to
obtain nCr one must divide nPr by r giving 2.17.
a A dart is thrown at a dartboard. What is the sample space for this experiment
b An archer has a 30 chance of hitting the bull’s eye on the target. What is the
complement to this event and what is its probability
c What is the probability of two mutually exclusive events both occurring
d A spectator reckons there is a 70 probability of an American rider winning the
Tour de France and a 40 probability of Frenchman winning. Comment.
a For the archer in Exercise 2.3b what is the probability that she hits the target
with one and only one of two arrows
b What is the probability that she hits the target with both arrows
c Explain the importance of the assumption of independence for the answers to
both parts a and b of this exercise.
d If the archer becomes more conﬁdent after a successful shot i.e. her probability
of a shot on target rises to 50 and less conﬁdent probability falls to 20 after
a miss how would this affect the answers to parts a and b
a Draw the tree diagrams associated with Exercise 2.4. You will need one for the
case of independence of events one for non-independence.
b Extend the diagram assuming independence to a third arrow. Use this to mark
out the paths with two successful shots out of three. Calculate the probability of
two hits out of three shots.
c Repeat part b for the case of non-independence. For this you may assume that
a hit raises the problem of success with the next arrow to 50. A miss lowers it
to 20.
a Show how the answer to Exercise 2.5b may be arrived at using algebra includ-
ing the use of the combinatorial formula.
b Repeat part a for the non-independence case.
Bayes’ theorem
Bayes’ theorem is a factual statement about probabilities which in itself is
uncontroversial. However the use and interpretation of the result is at the heart
of the difference between classical and Bayesian statistics. The theorem itself is
easily derived from ﬁrst principles. Equation 2.22 is similar to equation 2.13
covered earlier when discussing the multiplication rule
STFE_C02.qxd 26/02/2009 09:05 Page 91

slide 109:

Chapter 2 • Probability
92
PrA and B PrA|B × PrB 2.22
hence
PrA|B 2.23
Expanding both top and bottom of the right-hand side
PrA|B 2.24
Equation 2.24 is known as Bayes’ theorem and is a statement about the
probability of the event A conditional upon B having occurred. The following
example demonstrates its use.
Two bags contain red and yellow balls. Bag A contains six red and four yellow
balls bag B has three red and seven yellow balls. A ball is drawn at random from
one bag and turns out to be red. What is the probability that it came from bag
A Since bag A has relatively more red balls to yellow balls than does bag B it
seems bag A ought to be favoured. The probability should be more than 0.5. We
can check if this is correct.
Denoting
PrA 0.5 the probability of choosing bag A at random PrB
PrR| A 0.6 the probability of selecting a red ball from bag A etc.
we have
PrA| R 2.25
using Bayes’ theorem. Evaluating this gives
PrA| R 2.26
2
/3
You can check that PrB|R
1
/3 so that the sum of the probabilities is 1. As
expected this result is greater than 0.5.
Bayes’ theorem can be extended to cover more than two bags: if there are ﬁve
bags for example labelled A to E then
PrA| R 2.27
In Bayesian language PrA PrB etc. are known as the prior to the drawing
of the ball probabilities PrR| A PrR| B etc. are the likelihoods and PrA| R
PrB| R etc. are the posterior probabilities. Bayes’ theorem can alternatively be
expressed as
posterior probability 2.28
This is illustrated below by reworking the above example.
likelihood × prior probability
∑likelihoods × prior probabilites
PrR| A × PrA
PrR|A × PrA + PrR|B × PrB + ... + PrR|E × PrE
0.6 × 0.5
0.6 × 0.5 + 0.3 × 0.5
PrR|A × PrA
PrR|A × PrA + PrR|B × PrB
PrB|A × PrA
PrB|A × PrA + PrB|not A × Prnot A
PrA and B
PrB
STFE_C02.qxd 26/02/2009 09:05 Page 92

slide 110:

Decision analysis
93
Exercise 2.7
Prior probabilities Likelihoods Prior × likelihood Posterior probabilities
A 0.5 0.6 0.30 0.30/0.45 2/3
B 0.5 0.3 0.15 0.15/0.45 1/3
Total 0.45
The general version of Bayes’ theorem may be stated as follows. If there are n
events labelled E
l
... E
n
then the probability of the event E
i
occurring given
the sample evidence S is
PrE
i
|S 2.29
As stated earlier dispute arises over the interpretation of Bayes’ theorem.
In the above example there is no difﬁculty because the probability statements
can be interpreted as relative frequencies. If the experiment of selecting a bag at
random and choosing a ball from it were repeated many times then of those
occasions when a red ball is selected in two-thirds of them bag A will have been
chosen. However consider an alternative interpretation of the symbols:
A: a coin is fair
B: a coin is unfair
R: the result of a toss is a head.
Then given a toss or series of tosses of a coin this evidence can be used to
calculate the probability of the coin being fair. But this makes no sense accord-
ing to the frequentist school: either the coin is fair or not it is not a question of
probability. The calculated value must be interpreted as a degree of belief and be
given a subjective interpretation.
a Repeat the ‘balls in the bag’ exercise from the text but with bag A containing ﬁve
red and three yellow balls bag B containing one red and two yellow balls. The
single ball drawn is red. Before doing the calculation predict which bag is more
likely to be the source of the drawn ball. Explain why.
b Bag A now contains 10 red and six yellow balls i.e. twice as many as before but
in the same proportion. Does this alter the answer you obtained in part a
c Set out your answer to part b in the form of prior probabilities and likelihoods
in order to obtain the posterior probability.
Decision analysis
The study of probability naturally leads on to the analysis of decision making
where risk is involved. This is the realistic situation facing most ﬁrms and the
use of probability can help to illuminate the problem. To illustrate the topic we
use the example of a ﬁrm facing a choice of three different investment projects.
The uncertainty that the ﬁrm faces concerns the interest rate at which to
discount the future ﬂows of income. If the interest/discount rate is high then
projects which have income far in the future become less attractive relative to
PrS| E
i
× PrE
i
∑PrS| E
i
× PrE
i
STFE_C02.qxd 26/02/2009 09:05 Page 93

slide 111:

Chapter 2 • Probability
94
Table 2.1 Data for decision analysis: present values of three investment projects at
different interest rates £000
Project Future interest rate
4 5 6 7
A 1475 1363 1200 1115
B 1500 1380 1148 1048
C 1650 1440 1200 810
Probability 0.1 0.4 0.4 0.1
STATISTICS
IN
PR AC TIC E
··
projects with more immediate returns. A low rate reverses this conclusion. The
question is: which project should the ﬁrm select As we shall see there is no
unique right answer to the question but using probability theory we can see
why the answer might vary.
Table 2.1 provides the data required for the problem. The three projects are
imaginatively labelled A B and C. There are four possible states of the world i.e.
future scenarios each with a different interest rate as shown across the top of
the table. This is the only source of uncertainty otherwise the states of the
world are identical. The ﬁgures in the body of the table show the present value
of each income stream at the given discount rate.
Present value
The present value of future income is its value today and is obtained using the
interest rate. For example if the interest rate is 10 the present value i.e. today
of £110 received in one year’s time is £100. In other words one could invest
£100 today at 10 and have £110 in one year’s time. £100 today and £110 next year
are equivalent.
The present value of £110 received in two years’ time is smaller since one has
to wait longer to receive it. It is calculated as £110/1.1
2
90.91. Again £90.91
invested at 10 per annum will yield £110 in two years’ time. After one year it is
worth £90.91 × 1.1 100 and after a second year that £100 becomes £110. Notice
that if the interest rate rises the present value falls. For example if the interest
rate is 20 £110 next year is worth only £110/1.2 91.67 today.
The present value of £110 in one year’s time and another £110 in two years’
time is £110/1.1 + £110/1.1
2
£190.91. The present value of more complicated
streams of income can be calculated by extension of this principle. In the example
used in the text you do not need to worry about how the present value is arrived
at. Before reading on you may wish to do Exercise 2.8 to practise calculation of
present value.
Thus for example if the interest rate turns out to be 4 then project A has a
present value of £1 475 000 while B’s is £1 500 000. If the discount rate turns
out to be 5 the PV for A is £1 363 000 while for B it has changed to £1 380 000.
Obviously as the discount rate rises the present value of the return falls.
Alternatively we could assume that a higher interest rate increases the cost of
borrowing to ﬁnance the project which reduces its proﬁtability. We assume
STFE_C02.qxd 26/02/2009 09:05 Page 94

slide 112:

Decision analysis
95
that each project requires a certain initial outlay of £1 100 000 with which the
PV should be compared.
The ﬁnal row of the table shows the probabilities which the ﬁrm attaches to
each interest rate. These are obviously someone’s subjective probabilities and are
symmetric around a central value of 5.5.
a At an interest or discount rate of 10 what is the present value of £1200 received
in one year’s time
b If the interest rate rises to 15 how is the present value altered The interest
rate has risen by 50 from 10 to 15: how has the present value changed
c At an interest rate of 10 what is the present value of £1200 received in i two
years’ time and ii ﬁve years’ time
d An income of £500 is received at the end of years one two and three i.e. £1500
in total. What is its present value Assume r 10.
e Project A provides an income of £300 after one year and another £600 after two
years. Project B provides £400 and £488 at the same times. At a discount rate of
10 which project has the higher present value What happens if the discount
rate rises to 20
Decision criteria: maximising the expected value
We need to decide how a decision is to be made on the basis of these data. The
ﬁrst criterion involves the expected value of each project. Because of the uncer-
tainty about the interest rate there is no single present value for each project.
We therefore calculate the expected value using the E operator which was intro-
duced in Chapter 1. In other words we ﬁnd the expected present value of each
project by taking a weighted average of the PV ﬁgures the weights being the
probabilities. The project with the highest expected return is chosen.
The expected values are calculated in Table 2.2. The highest expected present
value is £1 302 000 associated with project C. On this criterion therefore C is
chosen. Is this a wise choice If the business always uses this rule to evaluate
many projects then in the long run it will earn the maximum proﬁts. However
you may notice that if the interest rate turns out to be 7 then C would be the
worst project to choose in this case and the ﬁrm would make a substantial loss
in such circumstances. Project C is the most sensitive to the discount rate it has
the greatest variance of PV values of the three projects and therefore the ﬁrm
faces more risk by opting for C. There is a trade-off between risk and return.
Exercise 2.8
Table 2.2 Expected values of the three projects
Project Expected value
A 1284.2
B 1266.0
C 1302.0
Note: 1284.2 is calculated as 1475 × 0.1 + 1363 × 0.4 + 1200 × 0.4 + 1115 × 0.1. This is the
weighted average of the four PV values. A similar calculation is performed for the other
projects.
STFE_C02.qxd 26/02/2009 09:05 Page 95

slide 113:

Chapter 2 • Probability
96
Perhaps some alternative criteria should be looked at. These we look at next in
particular the maximin maximax and minimax regret strategies.
Maximin maximax and minimax regret
The maximin criterion looks at the worst-case scenario for each project and then
selects the project which does best in these circumstances. It is inevitably a
pessimistic or cautious view therefore. Table 2.3 illustrates the calculation.
This time we observe that project A is preferred. In the worst case which occurs
when r 7 for all projects then A does best with a PV of £1 115 000 and
therefore a slight proﬁt. The maximin criterion may be a good one in busi-
ness where managers tend to over-optimism. Calculating the maximin may be
a salutary exercise even if it is not the ultimate deciding factor.
The opposite criterion is the optimistic one where the maximax criterion
is used. In this case one looks at the best circumstances for each project and
chooses the best-performing project. Each project does best when the interest
rate is at its lowest level 3. Examining the ﬁrst column of Table 2.1 shows that
project C PV 1650 performs best and is therefore chosen. Given the earlier
warning about over-optimistic managers this may not be suitable as the sole
criterion for making investment decisions.
A ﬁnal criterion is that of minimax regret. If project B were chosen but the
interest rate turns out to be 7 then we would regret not having chosen A the
best project under these circumstances. Our regret would be the extent of the
difference between the two a matter of 1115 − 1048 67. Similarly the regret
if we had chosen C would be 1115 − 810 305. We can calculate these regrets
at the other interest rates too always comparing the PV of a project with the
best PV given that interest rate. This gives us Table 2.4.
The ﬁnal column of the table shows the maximum regret for each project.
The minimax regret criterion is to choose the minimum of these ﬁgures. This is
Table 2.3 The maximin criterion
Project Minimum
A 1115
B 1048
C 810
Maximum 1115
Table 2.4 The costs of taking the wrong decision
Project 4 5 6 7 Maximum
A 175 77 0 0 175
B 150 60 52 67 150
C 0 0 0 305 305
Minimum 150
STFE_C02.qxd 26/02/2009 09:05 Page 96

slide 114:

Decision analysis
97
given at the bottom of the ﬁnal column it is 150 which is associated with pro-
ject B. A justiﬁcation for using this criterion might be that you do not want to
fall too far behind your competitors. If other ﬁrms are facing similar investment
decisions then the regret table shows the difference in PV and hence proﬁts
if they choose the best project while you do not. Choosing the minimax regret
solution ensures that you will not fall too far behind. During the internet bubble
of the 1990s it was important to gain market share and keep up with or surpass
your competitors. The minimax regret strategy might be a useful tool during
such times.
You will probably have noticed that we have managed to ﬁnd a justiﬁcation
for choosing all three projects No one project comes out best on all criteria.
Nevertheless the analysis might be of some help: if the investment project is
one of many small independent investments the ﬁrm is making then this
would justify use of the expected value criterion. On the other hand if this is a
big one-off project which could possibly bankrupt the ﬁrm if it goes wrong
then the maximin criterion would be appropriate.
The expected value of perfect information
Often a ﬁrm can improve its knowledge about future possibilities via research
which costs money. This effectively means buying information about the future
state of the world. The question arises: how much should a ﬁrm pay for such
information Perfect information would reveal the future state of the world with
certainty – in this case the future interest rate. In that case you could be sure of
choosing the right project given each state of the world. If interest rates turn out
to be 4 the ﬁrm would invest in C if 7 in A and so on.
In such circumstances the ﬁrm would expect to earn
0.1 × 1650 + 0.4 × 1440 + 0.4 × 1200 + 0.1 × 1115 1332.5
i.e. the probability of each state of the world is multiplied by the PV of the best
project for that state. This gives a ﬁgure which is greater than the expected value
calculated earlier without perfect information 1302. The expected value of per-
fect information is therefore the difference between these two 30.5. This sets a
maximum to the value of information for it is unlikely in the real world that any
information about the future is going to be perfect.
a Evaluate the three projects detailed in the table below using the criteria of
expected value maximin maximax and minimax regret. The probability of a 4
interest rate is 0.3 of 6 is 0.4 and of 8 is 0.3.
Project 4 6 8
A 100 80 70
B 90 85 75
C 120 60 40
b What would be the value of perfect information about the interest rate
Exercise 2.9
STFE_C02.qxd 26/02/2009 09:05 Page 97

slide 115:

Chapter 2 • Probability
98
Summary
● The theory of probability forms the basis of statistical inference: the drawing
of inferences on the basis of a random sample of data. The reason for this is
the probability basis of random sampling.
● A convenient deﬁnition of the probability of an event is the number of times
the event occurs divided by the number of trials occasions when the event
could occur.
● For more complex events their probabilities can be calculated by combining
probabilities using the addition and multiplication rules.
● The probability of events A or B occurring is calculated according to the addi-
tion rule.
● The probability of A and B occurring is given by the multiplication rule.
● If A and B are not independent then PrA and B PrA × PrB| A where
PrB| A is the probability of B occurring given that A has occurred the con-
ditional probability.
● Tree diagrams are a useful technique for enumerating all the possible paths in
series of probability trials but for large numbers of trials the huge number of
possibilities makes the technique impractical.
● For experiments with a large number of trials e.g. obtaining 20 heads in 50
tosses of a coin the formulae for combinations and permutations can be used.
● The combinatorial formula nCr gives the number of ways of combining r
similar objects among n objects e.g. the number of orderings of three girls
and hence implicitly two boys also in ﬁve children.
● The permutation formula nPr gives the number of orderings of r distinct
objects among n e.g. three named girls among ﬁve children.
● Bayes’ theorem provides a formula for calculating a conditional probability e.g.
the probability of someone being a smoker given they have been diagnosed
with cancer. It forms the basis of Bayesian statistics allowing us to calculate
the probability of a hypothesis being true based on the sample evidence and
prior beliefs. Classical statistics disputes this approach.
● Probabilities can also be used as the basis for decision making in conditions of
uncertainty using as decision criteria expected value maximisation maximin
maximax or minimax regret.
addition rule
Bayes’ theorem
combinations
complement
compound event
conditional probability
exhaustive
expected value of perfect information
frequentist approach
independent events
maximin
minimax
minimax regret
multiplication rule
mutually exclusive
outcome or event
permutations
probability experiment
probability of an event
sample space
subjective approach
tree diagram
Key terms and concepts
STFE_C02.qxd 26/02/2009 09:05 Page 98

slide 116:

99
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
2.1 Given a standard pack of cards calculate the following probabilities:
a drawing an ace
b drawing a court card i.e. jack queen or king
c drawing a red card
d drawing three aces without replacement
e drawing three aces with replacement.
2.2 The following data give duration of unemployment by age in July 1986.
Age Duration of unemployment weeks Total Economically active
8 8–26 26–52 52
000s 000s
Percentage ﬁgures
16–19 27.2 29.8 24.0 19.0 273.4 1270
20–24 24.2 20.7 18.3 36.8 442.5 2000
25–34 14.8 18.8 17.2 49.2 531.4 3600
35–49 12.2 16.6 15.1 56.2 521.2 4900
50–59 8.9 14.4 15.6 61.2 388.1 2560
60 18.5 29.7 30.7 21.4 74.8 1110
The ‘economically active’ column gives the total of employed not shown plus unemployed
in each age category.
a In what sense may these ﬁgures be regarded as probabilities What does the ﬁgure
27.2 top-left cell mean following this interpretation
b Assuming the validity of the probability interpretation which of the following state-
ments are true
i The probability of an economically active adult aged 25–34 drawn at random
being unemployed is 531.4/3600.
ii If someone who has been unemployed for over one year is drawn at random the
probability that they are aged 16–19 is 19.
iii For those aged 35–49 who became unemployed before July 1985 the probability
of their still being unemployed is 56.2.
iv If someone aged 50–59 is drawn at random from the economically active popula-
tion the probability of their being unemployed for eight weeks or less is 8.9.
v The probability of someone aged 35–49 drawn at random from the economically
active population being unemployed for between 8 and 26 weeks is 0.166 ×
521.2/4900.
c A person is drawn at random from the population and found to have been unemployed
for over one year. What is the probability that they are aged between 16 and 19
Problems
Problems
STFE_C02.qxd 26/02/2009 09:05 Page 99

slide 117:

Chapter 2 • Probability
100
2.3 ‘Odds’ in horserace betting are deﬁned as follows: 3/1 three-to-one against means a
horse is expected to win once for every three times it loses 3/2 means two wins out of ﬁve
races 4/5 ﬁve to four on means ﬁve wins for every four defeats etc.
a Translate the above odds into ‘probabilities’ of victory.
b In a three-horse race the odds quoted are 2/1 6/4 and 1/1. What makes the odds
different from probabilities Why are they different
c Discuss how much the bookmaker would expect to win in the long run at such odds
assuming each horse is backed equally.
2.4 a Translate the following odds to ‘probabilities’: 13/8 2/1 on 100/30.
b In the 2.45 race at Plumpton on 18/10/94 the odds for the ﬁve runners were:
Philips Woody 1/1
Gallant Effort 5/2
Satin Noir 11/2
Victory Anthem 9/1
Common Rambler 16/1
Calculate the ‘probabilities’ and their sum.
c Should the bookmaker base his odds on the true probabilities of each horse winning
or on the amount bet on each horse
2.5 How might you estimate the probability of Peru defaulting on its debt repayments next year
2.6 How might you estimate the probability of a corporation reneging on its bond payments
2.7 Judy is 33 unmarried and assertive. She is a graduate in political science and involved in
union activities and anti-discrimination movements. Which of the following statements do
you think is more probable
a Judy is a bank clerk.
b Judy is a bank clerk active in the feminist movement.
2.8 In March 1994 a news item revealed that a London ‘gender’ clinic which reportedly
enables you to choose the sex of your child had just set up in business. Of its ﬁrst six
births two were of the ‘wrong’ sex. Assess this from a probability point of view.
2.9 A newspaper advertisement reads ‘The sex of your child predicted or your money back’
Discuss this advertisement from the point of view of a the advertiser and b the client.
2.10 ‘Roll six sixes to win a Mercedes’ is the announcement at a fair. You have to roll six dice.
If you get six sixes you win the car valued at £20 000. The entry ticket costs £1. What
is your expected gain or loss on this game The organisers of the fair have to take out
insurance against the car being won. This costs £250 for the day. Does this seem a fair
premium If not why not
2.11 At another stall you have to toss a coin numerous times. If a head does not appear in
20 tosses you win £1 bn. The entry fee for the game is £100.
a What are your expected winnings
b Would you play
STFE_C02.qxd 26/02/2009 09:05 Page 100

slide 118:

101
2.12 A four-engine plane can ﬂy as long as at least two of its engines work. A two-engine plane
ﬂies as long as at least one engine works. The probability of an individual engine failure
is 1 in 1000.
a Would you feel safer in a four- or two-engine plane and why Calculate the probab-
ilities of an accident for each type.
b How much safer is one type than the other
c What crucial assumption are you making in your calculation Do you think it is
valid
2.13 Which of the following events are independent
a Two ﬂips of a fair coin.
b Two ﬂips of a biased coin.
c Rainfall on two successive days.
d Rainfall on St Swithin’s day and rain one month later.
2.14 Which of the following events are independent
a A student getting the ﬁrst two questions correct in a multiple-choice exam.
b A driver having an accident in successive years.
c IBM and Dell earning positive proﬁts next year.
d Arsenal Football Club winning on successive weekends.
How is the answer to b reﬂected in car insurance premiums
2.15 Manchester United beat Liverpool 4–2 at soccer but you do not know the order in which
the goals were scored. Draw a tree diagram to display all the possibilities and use it to
ﬁnd a the probability that the goals were scored in the order L MU MU MU L MU and
b the probability that the score was 2–2 at some stage.
2.16 An important numerical calculation on a spacecraft is carried out independently by
three computers. If all arrive at the same answer it is deemed correct. If one dis-
agrees it is overruled. If there is no agreement then a fourth computer does the
calculation and if its answer agrees with any of the others it is deemed correct. The
probability of an individual computer getting the answer right is 99. Use a tree diagram
to ﬁnd:
a the probability that the ﬁrst three computers get the right answer
b the probability of getting the right answer
c the probability of getting no answer
d the probability of getting the wrong answer.
2.17 The French national lottery works as follows. Six numbers from the range 0 to 49 are
chosen at random. If you have correctly guessed all six you win the ﬁrst prize. What
are your chances of winning if you are only allowed to choose six numbers A single entry
like this costs A1. For A210 you can choose 10 numbers and you win if the six selected
numbers are among them. Is this better value than the single entry
Problems
STFE_C02.qxd 26/02/2009 09:05 Page 101

slide 119:

Chapter 2 • Probability
102
2.18 The UK national lottery works as follows. You choose six different numbers in the range
1 to 49. If all six come up in the draw in any order you win the ﬁrst prize expected to be
around £2m which could be shared if someone else chooses the six winning numbers.
a What is your chance of winning with a single ticket
b You win a second prize if you get ﬁve out of six right and your ﬁnal chosen number
matches the ‘bonus’ number in the draw also in the range 1 to 49. What is the
probability of winning a second prize
c Calculate the probabilities of winning a third fourth or ﬁfth prize where a third prize
is won by matching ﬁve out of the six numbers a fourth prize by matching four out of
six and a ﬁfth prize by matching three out of six.
d What is the probability of winning a prize
e The prizes are as follows:
Prize Value
First £2 m expected possibly shared
Second £100 000 expected for each winner
Third £1500 expected for each winner
Fourth £65 expected for each winner
Fifth £10 guaranteed for each winner
Comment upon the distribution of the fund between ﬁrst second etc. prizes.
f Why is the ﬁfth prize guaranteed whereas the others are not
g In the ﬁrst week of the lottery 49 million tickets were sold. There were 1 150 000
winners of which 7 won a share of the jackpot 39 won a second prize 2139 won a
third prize and 76 731 a fourth prize. Are you surprised by these results or are they
as you would expect
2.19 A coin is either fair or has two heads. You initially assign probabilities of 0.5 to each
possibility. The coin is then tossed twice with two heads appearing. Use Bayes’ theorem
to work out the posterior probabilities of each possible outcome.
2.20 A test for AIDS is 99 successful i.e. if you are HIV+ it will detect it in 99 of all tests and
if you are not it will again be right 99 of the time. Assume that about 1 of the popula-
tion are HIV+. You take part in a random testing procedure which gives a positive result.
What is the probability that you are HIV+ What implications does your result have for
AIDS testing
2.21 a Your initial belief is that a defendant in a court case is guilty with probability 0.5.
A witness comes forward claiming he saw the defendant commit the crime. You know
the witness is not totally reliable and tells the truth with probability p. Use Bayes’
theorem to calculate the posterior probability that the defendant is guilty based on
the witness’s evidence.
b A second witness equally unreliable comes forward and claims she saw the defendant
commit the crime. Assuming the witnesses are not colluding what is your posterior
probability of guilt
c If p 0.5 compare the answers to a and b. How do you account for this curious
result
STFE_C02.qxd 26/02/2009 09:05 Page 102

slide 120:

103
2.22 A man is mugged and claims that the mugger had red hair. In police investigations of such
cases the victim was able correctly to identify the assailant’s hair colour 80 of the time.
Assuming that 10 of the population have red hair what is the probability that the
assailant in this case did in fact have red hair Guess the answer ﬁrst then ﬁnd the
right answer using Bayes’ theorem. What are the implications of your results for juries’
interpretation of evidence in court particularly in relation to racial minorities
2.23 A ﬁrm has a choice of three projects with proﬁts as indicated below dependent upon the
state of demand.
Project Demand
Low Middle High
A 100 140 180
B 130 145 170
C 110 130 200
Probability 0.25 0.45 0.3
a Which project should be chosen on the expected value criterion
b Which project should be chosen on the maximin and maximax criteria
c Which project should be chosen on the minimax regret criterion
d What is the expected value of perfect information to the ﬁrm
2.24 A ﬁrm can build a small medium or large factory with anticipated proﬁts from each
dependent upon the state of demand as in the table below.
Factory Demand
Low Middle High
Small 300 320 330
Medium 270 400 420
Large 50 250 600
Probability 0.3 0.5 0.2
a Which project should be chosen on the expected value criterion
b Which project should be chosen on the maximin and maximax criteria
c Which project should be chosen on the minimax regret criterion
d What is the expected value of perfect information to the ﬁrm
2.25 There are 25 people at a party. What is the probability that there are at least two with a
birthday in common
Hint: the complement is much easier to calculate.
2.26 This problem is tricky but amusing. Three gunmen A B and C are shooting at each
other. The probabilities that each will hit what they aim at are respectively 1 0.75 0.5.
They take it in turns to shoot in alphabetical order and continue until only one is left
alive. Calculate the probabilities of each winning the contest. Assume they draw lots for
the right to shoot ﬁrst.
Hint 1: Start with one-on-one gunﬁghts e.g. the probability of A beating B or of B beating C.
Hint 2: You’ll need the formula for the sum of an inﬁnite series given in Chapter 1.
Problems
STFE_C02.qxd 26/02/2009 09:05 Page 103

slide 121:

Chapter 2 • Probability
104
2.27 The BMAT test see http://www.ucl.ac.uk/lapt/bmat/ is an on-line test for prospective
medical students. It uses ‘certainty based marking’. After choosing your answer from the
alternatives available you then have to give your level of conﬁdence that your answer is
correct: low medium or high. If you choose low you get one mark for the correct answer
zero if it is wrong. For medium conﬁdence you get +2 or −2 marks for correct or incorrect
answers. If you choose high you get +3 or −6.
a If you are 60 conﬁdent your answer is correct i.e. you think there is a 60 probability
you are right which certainty level should you choose
b Over what range of probabilities is ‘medium’ the best choice
c If you were 85 conﬁdent how many marks would you expect to lose by opting for one
of the wrong choices
2.28 A multiple choice test involves 20 questions with four choices for each answer.
a If you guessed the answers to all questions at random what mark out of 20 would you
expect to get
b If you know the correct answer to eight of the questions what is your expected score
out of 20
c The examiner wishes to correct the bias due to students guessing answers. They
decide to award a negative mark for incorrect answers with 1 for a correct answer
and 0 for no answer given. What negative mark would ensure that the overall mark
out of 20 is a true reﬂection of the student’s ability
STFE_C02.qxd 26/02/2009 09:05 Page 104

slide 122:

Answers to exercises
105
Answers to exercises
Exercise 2.1
Answer in text.
Exercise 2.2
a A subjective view would have to be taken informed by such things as opinion polls.
b 1/49 a frequentist view. Some people do add their own subjective evaluations
e.g. that 5 must come up as it has not been drawn for several weeks but these
are often unwarranted according to the frequentist approach.
c A mixture of objective and subjective criteria might be used here. Historical data on
the occurrence of tsunamis might give a frequentist baseline ﬁgure to which might
be added subjective considerations such as the amount of recent seismic activity.
d A mixture again. Historical data give a benchmark possibly of little relevance
while immediate factors such as the weather might alter one’s subjective judge-
ment. As I write it is snowing outside which seems to have a huge impact on
British trains
Exercise 2.3
a 1 2 3... 20 21 a triple seven 22 double eleven 24 25 outer bull 26 27 28
30 32 33 34 36 38 39 40 42 45 48 50 51 54 57 60. Or it could miss altogether
b The complement is missing the target with probability 1 − 0.3 70.
c Zero it is impossible.
d Impossible the probabilities sum to more than one.
Exercise 2.4
a 0.3 × 0.7 + 0.7 × 0.3 0.42. This is a hit followed by a miss or a miss followed by
a hit.
b 0.3 × 0.3 0.09.
c It is assumed that the probability of the second arrow hitting the target is the
same as the ﬁrst. Altering this assumption would affect both answers.
d Part a becomes 0.3 × 1 − 0.5 + 0.7 × 0.2 0.29. Part b becomes 0.3 × 0.5
0.15.
Exercise 2.5
a Independent case:
STFE_C02.qxd 26/02/2009 09:05 Page 105

slide 123:

Chapter 2 • Probability
106
Dependent case:
Exercise 2.6
a Pr2 hits PrH and H and M × 3C2 0.3 × 0.3 × 0.7 × 3 0.189.
b This cannot be done using the combinatorial formula because of the non-
independence of probabilities. Instead one has to calculate PrH and H and M
+ PrH and M and H + PrM and H and H yielding the answer 0.175.
b
c
STFE_C02.qxd 26/02/2009 09:05 Page 106

slide 124:

Answers to exercises
107
Exercise 2.7
a Bag A has proportionately more red balls than bag B hence should be the favoured
bag from which the single red ball was drawn. Performing the calculation
PrA|R
− 0.556
b The result is the same as PrR|A 0.625 as before. The number of balls does not
enter the calculation.
c Prior probabilities Likelihoods Prior × likelihood Posterior probabilities
A 0.5 0.625 0.3125 0.3125/0.5625 0.556
B 0.5 0.5 0.25 0.25/0.5625 0.444
Total 0.5625
Exercise 2.8
a 1200/1.1 1090.91.
b 1200/1.15 1043.48. The PV has only changed by 4.3. This is calculated as
1.1/1.15 − 1 −0.043.
c 1200/1.1
2
991.74 1200/1.1
5
745.11.
d PV 500/1.1 + 500/1.1
2
+ 500/1.1
3
1243.43.
e At 10: project A yields a PV of 300/1.1 + 600/1.1
2
768.6. Project B yields
400/1.1 + 488/1.1
2
766.9. At 20 the PVs are 666.7 and 672.2 reversing
the rankings. A’s large beneﬁts in year 2 are penalised by the higher discount
rate.
Exercise 2.9
a Project Expected value Minimum Maximum
A 0.3 × 100 + 0.4 × 80 + 0.3 × 70 83 70 100
B 0.3 × 90 + 0.4 × 85 + 0.3 × 75 83.5 75 90
C 0.3 × 120 + 0.4 × 60 + 0.3 × 40 72 40 120
The maximin is 75 associated with project B and the maximax is 120 associated
with project C. The regret values are given by
4 6 8 Max
A20 5 5 20
B30 0 0 30
C 0 25 35 35
Min 20
The minimax regret is 20 associated with project A.
b With perfect information the ﬁrm could eam 0.3 × 120 + 0.4 × 85 + 0.3 × 75
92.5. The highest expected value is 83.5 so the value of perfect information is
92.5 − 83.5 9.
0.625 × 0.5
0.625 × 0.5 + 0.5 × 0.5
PrR|A × PrA
PrR|A × PrA + PrR|B × PrB
STFE_C02.qxd 26/02/2009 09:05 Page 107

slide 125:

Probability distributions
3
Contents
Learning outcomes 108
Introduction 109
Random variables 110
The Binomial distribution 111
The mean and variance of the Binomial distribution 115
The Normal distribution 117
The sample mean as a Normally distributed variable 125
Sampling from a non-Normal population 129
The relationship between the Binomial and Normal distributions 131
Binomial distribution method 131
Normal distribution method 132
The Poisson distribution 132
Summary 135
Key terms and concepts 136
Problems 137
Answers to exercises 142
By the end of this chapter you should be able to:
● recognise that the result of most probability experiments e.g. the score on a
die can be described as a random variable
● appreciate how the behaviour of a random variable can often be summarised by
a probability distribution a mathematical formula
● recognise the most common probability distributions and be aware of their
uses
● solve a range of probability problems using the appropriate probability
distribution.
Learning
outcomes
108
Complete your diagnostic test for Chapter 3 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C03.qxd 26/02/2009 09:08 Page 108

slide 126:

Introduction
109
Introduction
In this chapter the probability concepts introduced in Chapter 2 are generalised
by using the idea of a probability distribution. A probability distribution lists
in some form all the possible outcomes of a probability experiment and the
probability associated with each one. For example the simplest experiment
is tossing a coin for which the possible outcomes are heads or tails each with
probability one-half. The probability distribution can be expressed in a variety
of ways: in words or in a graphical or mathematical form. For tossing a coin the
graphical form is shown in Figure 3.1 and the mathematical form is
PrH
PrT
The different forms of presentation are equivalent but one might be more
suited to a particular purpose.
1
2
1
2
Some probability distributions occur often and so are well known. Because of
this they have names so we can refer to them easily for example the Binomial
distribution or the Normal distribution. In fact each constitutes a family of dis-
tributions. A single toss of a coin gives rise to one member of the Binomial
distribution family two tosses would give rise to another member of that fam-
ily. These two distributions differ in the number of tosses. If a biased coin were
tossed this would lead to yet another Binomial distribution but it would differ
from the previous two because of the different probability of heads.
Members of the Binomial family of distributions are distinguished either by
the number of tosses or by the probability of the event occurring. These are the
two parameters of the distribution and tell us all we need to know about the
distribution. Other distributions might have different numbers of parameters with
different meanings. Some distributions for example have only one parameter.
We will come across examples of different types of distribution throughout the
rest of this book.
In order to understand fully the idea of a probability distribution a new
concept is ﬁrst introduced that of a random variable. As will be seen later in the
chapter an important random variable is the sample mean and to understand
Figure 3.1
The probability distribution
for the toss of a coin
STFE_C03.qxd 26/02/2009 09:08 Page 109

slide 127:

Chapter 3 • Probability distributions
110
how to draw inferences from the sample mean it is important to recognise it as
a random variable.
Random variables
Examples of random variables have already been encountered in Chapter 2 for
example the result of the toss of a coin or the number of boys in a family of
ﬁve children. A random variable is one whose outcome or value is the result of
chance and is therefore unpredictable although the range of possible outcomes
and the probability of each outcome may be known. It is impossible to know
in advance the outcome of a toss of a coin for example but it must be either
heads or tails each with probability one-half. The number of heads in 250 tosses
is another random variable which can take any value between zero and 250
although values near 125 are the most likely. You are very unlikely to get 250
heads from tossing a fair coin
Intuitively most people would ‘expect’ to get 125 heads from 250 tosses of
the coin since heads comes up half the time on average. This suggests we could
use the expected value notation introduced in Chapter 1 and write EX 125
where X represents the number of heads obtained from 250 tosses. This usage
is indeed valid and we will explore this further below. It is a very convenient
shorthand notation.
The time of departure of a train is another example of a random variable.
It may be timetabled to depart at 11.15 but it probably almost certainly will
not leave at exactly that time. If a sample of ten basketball players were taken
and their average height calculated this would be a random variable. In this
latter case it is the process of taking a sample that introduces the variability
which makes the resulting average a random variable. If the experiment were
repeated a different sample and a different value of the random variable would
be obtained.
The above examples can be contrasted with some things which are not
random variables. If one were to take all basketball players and calculate their
average height the result would not be a random variable. This time there is no
sampling procedure to introduce variability into the result. If the experiment
were repeated the same result would be obtained since the same people would
be measured the second time this assumes that the population does not change
of course. Just because the value of something is unknown does not mean it
qualiﬁes as a random variable. This is an important distinction to bear in mind
since it is legitimate to make probability statements about random variables
‘the probability that the average height of a sample of basketball players is over
195 cm is 60’ but not about parameters ‘the probability that the Pope is
over six feet is 60’. Here again there is a difference of opinion between fre-
quentist and subjective schools of thought. The latter group would argue that it
is possible to make probability statements about the Pope’s height. It is a way of
expressing lack of knowledge about the true value. The frequentists would say
the Pope’s height is a fact that we do not happen to know that does not make
it a random variable.
STFE_C03.qxd 26/02/2009 09:08 Page 110

slide 128:

The Binomial distribution
111
The Binomial distribution
One of the simplest distributions which a random variable can have is the
Binomial. The Binomial distribution arises whenever the underlying probability
experiment has just two possible outcomes for example heads or tails from the
toss of a coin. Even if the coin is tossed many times so one could end up with
one two three... etc. heads in total the underlying experiment has only two
outcomes so the Binomial distribution should be used. A counter-example
would be the rolling of die which has six possible outcomes in this case
the Multinomial distribution not covered in this book would be used. Note
however that if we were interested only in rolling a six or not we could use the
Binomial by deﬁning the two possible outcomes as ‘six’ and ‘not-six’. It is often
the case in statistics that by suitable transformation of the data we can use
different distributions to tackle the same problem. We will see more of this later
in the chapter.
The Binomial distribution can therefore be applied to the type of problem
encountered in the previous chapter concerning the sex of children. It provides
a general formula for calculating the probability of r boys in n births or in more
general terms the probability of r ‘successes’ in n trials.
1
We shall use it to
calculate the probabilities of 0 1...5 boys in ﬁve births.
For the Binomial distribution to apply we ﬁrst need to assume independence
of successive events and we shall assume that for any birth
Prboy P
It follows that
Prgirl 1 − Prboy 1 − P
Although we have P in this example the Binomial distribution can be applied
for any value of P between 0 and 1.
First we consider the case of r 5 n 5 i.e. ﬁve boys in ﬁve births. This prob-
ability is found using the multiplication rule
Prr 5 P × P × P × P × P P
5
5
1/32
The probability of four boys and then implicitly one girl is
Prr 4 P × P × P × P × 1 − P 1/32
But this gives only one possible ordering of the four boys and one girl. Our
original statement of the problem did not specify a particular ordering of the
children. There are ﬁve possible orderings the single girl could be in any of
ﬁve positions in rank order. Recall that we can use the combinatorial formula
nCr to calculate the number of orderings giving 5C4 5. Hence the probability
of four boys and one girl in any order is 5/32. Summarising the formula for four
boys and one girl is
Prr 4 5C4 × P
4
× 1 − P
1
2
1
2
1
2
1
2
1
The identiﬁcation of a boy with ‘success’ is a purely formal one and is not meant to be
pejorative
STFE_C03.qxd 26/02/2009 09:08 Page 111

slide 129:

Chapter 3 • Probability distributions
112
For three boys and two girls we obtain
Prr 3 5C3 × P
3
× 1 − P
2
10 × 1/8 × 1/4 10/32
In a similar manner
Prr 2 5C2 × P
2
× 1 − P
3
10/32
Prr l 5C1 × P
1
× 1 − P
4
5/32
Prr 0 5C0 × P
0
× 1 − P
5
1/32
As a check on our calculations we may note that the sum of the probabilities
equals 1 as they should do as we have enumerated all possibilities.
A fairly clear pattern emerges. The probability of r boys in n births is given by
Prr nCr × P
r
× 1 − P
n−r
and this is known as the Binomial formula or distribution. The Binomial distribu-
tion is appropriate for analysing problems with the following characteristics:
● There is a number n of trials.
● Each trial has only two possible outcomes ‘success’ with probability P and
‘failure’ probability 1 − P and the outcomes are independent between trials.
● The probability P does not change between trials.
The probabilities calculated by the Binomial formula may be illustrated in a
diagram as shown in Figure 3.2. This is very similar to the relative frequency
distribution which was introduced in Chapter 1. That distribution was based on
empirical data to do with wealth while the Binomial probability distribution
is a theoretical construction built up from the basic principles of probability
theory.
As stated earlier the Binomial is in fact a family of distributions and
each member of this family is distinguished by two parameters n and P. The
Binomial is thus a distribution with two parameters and once their values are
known the distribution is completely determined i.e. Prr can be calculated for
all values of r. To illustrate the difference between members of the family of the
Binomial distribution Figure 3.3 presents three other Binomial distributions
for different values of P and n. It can be seen that for the value of P the
1
2
Figure 3.2
Probability distribution of
the number of boys in ﬁve
children
STFE_C03.qxd 26/02/2009 09:08 Page 112

slide 130:

The Binomial distribution
113
Figure 3.3
Binomial distributions
with different
parameter values
STFE_C03.qxd 26/02/2009 09:08 Page 113

slide 131:

Chapter 3 • Probability distributions
114
STATISTICS
IN
PR AC TIC E
··
distribution is symmetric while for all other values it is skewed to either the left
or the right. Part b of the ﬁgure illustrates the distribution relating to the
worked example of rolling a die described below.
Since the Binomial distribution depends only upon the two values n and P a
shorthand notation can be used rather than using the formula itself. A random
variable r which has a Binomial distribution with the parameters n and P can
be written in general terms as
r Bn P 3.1
Thus for the previous example of children where r represents the number of
boys
r B5
This is simply a brief and convenient way of writing down the information
available it involves no new problems of a conceptual nature. Writing
r Bn P
is just a shorthand for
Prr nCr × P
r
× 1 − P
n−r
Teenage weapons
This is a nice example of how knowledge of the Binomial distribution can help our
interpretation of events in the news.
‘One in ﬁve teens carry weapon’. link on main BBC news web site 23 July 2007
Following the link to the text of the story we read:
‘One in ﬁve young teenagers say that their friends are carrying knives and weapons says a
major annual survey of schoolchildren’s health and wellbeing’.
With concerns about knife crime among teenagers this survey shows that a
ﬁfth of youngsters are ‘fairly sure’ or ‘certain’ that their male friends are carrying
a weapon.’
Notice incidentally how the story subtly changes. The headline suggests 20
of teenagers carry a weapon. The text then says this is what young teenagers
report of their friends. It then reveals that some are only ‘fairly sure’ and that it
applies to boys not girls. By now our suspicions should be aroused. What is
the truth
Note that you are more likely to know someone who carries a weapon than
to carry one yourself. Let p be the proportion who truly carry a weapon. Assume
also that each person has 10 friends. What is the probability that a person
selected at random has no friends who carry a weapon Assuming independence
this is given by 1 − p
10
. Hence the probability of at least one friend with a weapon
is 1 − 1 − p
10
. This is proportion of people who will report having at least one
friend with a weapon. How does this vary with p This is set out in the following
table:
1
2
STFE_C03.qxd 26/02/2009 09:08 Page 114

slide 132:

The Binomial distribution
115
P 1 friend with weapon
p 1 − 1 − p
10
0.0 0
0.5 5
1.0 10
1.5 14
2.0 18
2.5 22
3.0 26
3.5 30
4.0 34
Thus a true proportion of just over 2 carrying weapons will generate a report
suggesting 20 know someone carrying a weapon This is much less alarming
and less newsworthy than in the original story.
You might like to test the assumptions. What happens if there are more than
10 friends assumed What happens if events are not independent i.e. having one
friend with a weapon increases the probability of another friend with a weapon
The mean and variance of the Binomial distribution
In Chapter 1 we calculated the mean and variance of a set of data of the dis-
tribution of wealth. The picture of that distribution Figure 1.9 looks not too
dissimilar to one of the Binomial distributions shown in Figure 3.3 above. This
suggests that we can calculate the mean and variance of a Binomial distribution
just as we did for the empirical distribution of wealth. Calculating the mean
would provide the answer to a question such as ‘If we have a family with ﬁve
children how many do we expect to be boys’. Intuitively the answer seems
clear 2.5 even though such a family could not exist. The Binomial formula
allows us to conﬁrm this intuition.
The mean and variance are most easily calculated by drawing up a relative
frequency table based on the Binomial frequencies. This is shown in Table 3.1
for the values n 5 and P . Note that r is equivalent to x in our usual nota-
tion and Prr the relative frequency is equivalent to f x/∑fx. The mean of
this distribution is given by
Er 2.5 3.2
80/32
32/32
∑r × Prr
∑Prr
1
2
Table 3.1 Calculating the mean and variance of the Binomial distribution
r Prr r × × Prr r
2
× × Prr
0 1/32 0 0
1 5/32 5/32 5/32
2 10/32 20/32 40/32
3 10/32 30/32 90/32
4 5/32 20/32 80/32
5 1/32 5/32 25/32
Totals 32/32 80/32 240/32
STFE_C03.qxd 26/02/2009 09:08 Page 115

slide 133:

Chapter 3 • Probability distributions
116
and the variance is given by
Vr − μ
2
− 2.5
2
1.25 3.3
The mean value tells us that in a family of ﬁve children we would expect on
average two and a half boys. Obviously no single family can be like this it
is the average over all such families. The variance is more difﬁcult to interpret
intuitively but it tells us something about how the number of boys in different
families will be spread around the average of 2.5.
There is a quicker way to calculate the mean and variance of the Binomial
distribution. It can be shown that the mean can be calculated as nP i.e. the
number of trials times the probability of success. For example in a family with
ﬁve children and an equal probability that each child is a boy or a girl then we
expect nP 5 ×
1
/ 2 2.5 to be boys.
The variance can be calculated as nP1 − P. This gives 5 ×
1
/ 2 ×
1
/ 2 1.25 as
found above by extensive calculation.
Worked example 3.1 Rolling a die
If a die is thrown four times what is the probability of getting two or more
sixes This is a problem involving repeated experiments rolling the die
with but two types of outcome for each roll: success a six or failure any-
thing but a six. Note that we combine several possibilities scores of 1 2 3
4 or 5 together and represent them all as failure. The probability of success
one-sixth does not vary from one experiment to another and so use of the
Binomial distribution is appropriate. The values of the parameters are n 4
and P 1/6. Denoting by r the random variable ‘the number of sixes in four
rolls of the die’ then
r B4
Hence
Prr nCr × P
r
1 − P
n−r
where P and n 4. The probabilities of two three and four sixes are then
given by
Prr 2 4C2
2
2
0.116
Prr 3 4C3
3
1
0.015
Prr 4 4C4
4
0
0.00077
Since these events are mutually exclusive the probabilities can simply be
added together to achieve the desired result which is 0.132 or 13.2. This is
the probability of two or more sixes in four rolls of a die.
This result can be illustrated diagrammatically as part of the area under the
appropriate Binomial distribution shown in Figure 3.4.
The shaded areas represent the probabilities of two or more sixes and
together their area represents 13.2 of the whole distribution. This illustrates
an important principle: that probabilities can be represented by areas under
an appropriate probability distribution. We shall see more of this later.
5
6
1
6
5
6
1
6
5
6
1
6
1
6
1
6
240/32
32/32
∑r
2
× Prr
∑Prr
STFE_C03.qxd 26/02/2009 09:08 Page 116

slide 134:

The Normal distribution
117
Figure 3.4
Probability of two or more
sixes in four rolls of a die
Exercise 3.1
Exercise 3.2
a The probability of a randomly drawn individual having blue eyes is 0.6. What is the
probability that four people drawn at random all have blue eyes
b What is the probability that two of the sample of four have blue eyes
c For this particular example write down the Binomial formula for the probability
of r blue-eyed individuals for r 0 . . . 4. Conﬁrm that the probabilities sum to one.
a Calculate the mean and variance of the number of blue-eyed individuals in the
previous exercise.
b Draw a graph of this Binomial distribution and on it mark the mean value and the
mean value +/− one standard deviation.
Having introduced the concept of probability distributions using the Binomial
we now move on to the most important of all probability distributions – the
Normal.
The Normal distribution
The Binomial distribution applies when there are two possible outcomes to an
experiment but not all problems fall into this category. For instance the
random arrival time of a train is a continuous variable and cannot be analysed
using the Binomial. There are many probability distributions in statistics devel-
oped to analyse different types of problem. Several of them are covered in this
book and the most important of them is the Normal distribution which we
now turn to. It was discovered by the German mathematician Gauss in the
nineteenth century hence it is also known as the Gaussian distribution in the
course of his work on regression see Chapter 7.
Many random variables turn out to be Normally distributed. Men’s or
women’s heights are Normally distributed. IQ the measure of intelligence is
also Normally distributed. Another example is of a machine producing say
bolts with a nominal length of 5 cm which will actually produce bolts of slightly
varying length these differences would probably be extremely small due to
STFE_C03.qxd 26/02/2009 09:08 Page 117

slide 135:

Chapter 3 • Probability distributions
118
factors such as wear in the machinery slight variations in the pressure of the
lubricant etc. These would result in bolts whose length varies in accordance
with the Normal distribution. This sort of process is extremely common with
the result that the Normal distribution often occurs in everyday situations.
The Normal distribution tends to arise when a random variable is the result
of many independent random inﬂuences added together none of which
dominates the others. A man’s height is the result of many genetic inﬂuences
plus environmental factors such as diet etc. As a result height is Normally dis-
tributed. If one takes the height of men and women together the result is not
a Normal distribution however. This is because there is one inﬂuence which
dominates the others: gender. Men are on average taller than women. Many
variables familiar in economics are not Normal however – incomes for example
although the logarithm of income is approximately Normal. We shall learn
techniques to deal with such circumstances in due course.
Having introduced the idea of the Normal distribution what does it look
like It is presented below in graphical and then mathematical forms. Unlike the
Binomial the Normal distribution applies to continuous random variables such
as height and a typical Normal distribution is illustrated in Figure 3.5. Since the
Normal distribution is a continuous one it can be evaluated for all values of x
not just for integers. The ﬁgure illustrates the main features of the distribution:
● It is unimodal having a single central peak. If this were men’s heights it
would illustrate the fact that most men are clustered around the average
height with a few very tall and a few very short people.
● It is symmetric the left and right halves being mirror images of each other.
● It is bell-shaped.
● It extends continuously over all the values of x from minus inﬁnity to plus
inﬁnity although the value of fx becomes extremely small as these values
are approached the pages of this book being of only ﬁnite width this last
characteristic is not faithfully reproduced. This also demonstrates that most
empirical distributions such as men’s heights can only be an approximation
to the theoretical ideal although the approximation is close and good enough
for practical purposes.
Note that we have labelled the y-axis ‘fx’ rather than ‘Prx’ as we did for the
Binomial distribution. This is because it is areas under the curve that represent
probabilities not the heights. With the Binomial which is a discrete distribu-
tion one can legitimately represent probabilities by the heights of the bars. For
the Normal although fx does not give the probability per se it does give an
Figure 3.5
The Normal distribution
STFE_C03.qxd 26/02/2009 09:08 Page 118

slide 136:

The Normal distribution
119
indication: you are more likely to encounter values from the middle of the
distribution where fx is greater than from the extremes.
In mathematical terms the formula for the Normal distribution is x is the
random variable
3.4
The mathematical formulation is not so formidable as it appears. μ and σ are
the parameters of the distribution such as n and P for the Binomial though
they have different meanings π is 3.1416 and e is 2.7183. If the formula is
evaluated using different values of x the values of fx obtained will map out a
Normal distribution. Fortunately as we shall see we do not need to use the
mathematical formula in most practical problems.
Like the Binomial the Normal is a family of distributions differing from one
another only in the values of the parameters μ and σ. Several Normal distribu-
tions are drawn in Figure 3.6 for different values of the parameters.
Whatever value of μ is chosen turns out to be the centre of the distribution.
As the distribution is symmetric μ is its mean. The effect of varying σ is to
narrow small σ or widen large σ the distribution. σ turns out to be the stand-
ard deviation of the distribution. The Normal is another two-parameter family
of distributions like the Binomial and once the mean μ and the standard devia-
tion σ or equivalently the variance σ
2
are known the whole of the distribution
can be drawn.
fx
x
−
− ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ 1
2
1
2
2
σπ
μ
σ
e
Figure 3.6a
The Normal distribution
μ 20 σ 5
Figure 3.6b
The Normal distribution
μ 15 σ 2
STFE_C03.qxd 26/02/2009 09:08 Page 119

slide 137:

Chapter 3 • Probability distributions
120
Figure 3.6c
The Normal distribution
μ 0 σ 4
Figure 3.7
Illustration of men’s
height distribution
The shorthand notation for a Normal distribution is
x Nμ σ
2
3.5
meaning ‘the variable x is Normally distributed with mean μ and variance σ
2
’.
This is similar in form to the expression for the Binomial distribution though
the meanings of the parameters are different.
Use of the Normal distribution can be illustrated using a simple example. The
height of adult males is Normally distributed with mean height μ 174 cm and
standard deviation σ 9.6 cm. Let x represent the height of adult males then
x N174 92.16 3.6
and this is illustrated in Figure 3.7. Note that equation 3.6 contains the variance
rather than the standard deviation.
What is the probability that a randomly selected man is taller than 180 cm
If all men are equally likely to be selected this is equivalent to asking what pro-
portion of men are over 180 cm in height. This is given by the area under the
Normal distribution to the right of x 180 i.e. the shaded area in Figure 3.7.
The further from the mean of 174 the smaller the area in the tail of the dis-
tribution. One way to ﬁnd this area would be to make use of equation 3.4 but
this requires the use of sophisticated mathematics.
Since this is a frequently encountered problem the answers have been set out
in the tables of the standard Normal distribution. We can simply look up the
solution. However since there is an inﬁnite number of Normal distributions
one for every combination of μ and σ
2
it would be an impossible task to tabulate
STFE_C03.qxd 26/02/2009 09:08 Page 120

slide 138:

The Normal distribution
121
them all. The standard Normal distribution which has a mean of zero and
variance of one is therefore used to represent all Normal distributions. Before
the table can be consulted therefore the data have to be transformed so that
they accord with the standard Normal distribution.
The required transformation is the z score which was introduced in Chapter
1. This measures the distance between the value of interest 180 and the mean
measured in terms of standard deviations. Therefore we calculate
z 3.7
and z is a Normally distributed random variable with mean 0 and variance 1 i.e.
z N0 1. This transformation shifts the original distribution μ units to the left
and then adjusts the dispersion by dividing through by σ resulting in a mean
of 0 and variance 1. z is Normally distributed because x is Normally distributed.
The transformation in equation 3.7 retains the Normal distribution shape
despite the changes to mean and variance. If x followed some other distribution
then z would not be Normal either.
It is easy to verify the mean and variance of z using the rules for E and V oper-
ators encountered in Chapter 1
Ez E Ex − μ 0 since Ex μ
Vz V Vx 1
Evaluating the z score from our data we obtain
z 0.63 3.8
This shows that 180 is 0.63 standard deviations above the mean 174 of the
distribution. This is a measure of how far 180 is from 174 and allows us to look
up the answer in tables. The task now is to ﬁnd the area under the standard
Normal distribution to the right of 0.63 standard deviations above the mean.
This answer can be read off directly from the table of the standard Normal dis-
tribution included as Table A2 in the appendix to this book. An excerpt from
Table A2 see page 414 is presented in Table 3.2.
The left-hand column gives the z score to one place of decimals. The appro-
priate row of the table to consult is the one for z 0.6 which is shaded. For the
second place of decimals 0.03 we consult the appropriate column also shaded.
At their intersection we ﬁnd the value 0.2643 which is the desired area and
180 − 174
9.6
σ
2
σ
2
1
σ
2
D
F
x − μ
σ
A
C
1
σ
D
F
x − μ
σ
A
C
x − μ
σ
Table 3.2 Areas of the standard Normal distribution excerpt from Table A2
z 0.00 0.01 0.02 0.03 . . . 0.09
0.0 0.5000 0.4960 0.4920 0.4880 . . . 0.4641
0.1 0.4602 0.4562 0.4522 0.4483 . . . 0.4247
3 3333 ... 3
0.5 0.3085 0.3050 0.3015 0.2981 . . . 0.2776
0.6 0.2743 0.2709 0.2676 0.2643 . . . 0.2451
0.7 0.2420 0.2389 0.2358 0.2327 . . . 0.2148
STFE_C03.qxd 26/02/2009 09:08 Page 121

slide 139:

Chapter 3 • Probability distributions
122
therefore probability i.e. 26.43 of the distribution lies to the right of 0.63 standard
deviations above the mean. Therefore 26.43 of men are over 180 cm in height.
Use of the standard Normal table is possible because although there is an
inﬁnite number of Normal distributions they are all fundamentally the sameso
that the area to the right of 0.63 standard deviations above the mean is the same
for all of them. As long as we measure the distance in terms of standard devia-
tions then we can use the standard Normal table. The process of standardisation
turns all Normal distributions into a standard Normal distribution with a mean
of zero and a variance of one. This process is illustrated in Figure 3.8.
The area in the right-hand tail is the same for both distributions. It is the
standard Normal distribution in Figure 3.8b which is tabulated in Table A2. To
demonstrate how standardisation turns all Normal distributions into the standard
Normal the earlier problem is repeated but taking all measurements in inches. The
answer should obviously be the same. Taking 1 inch 2.54 cm the ﬁgures are
x 70.87 σ 3.78 μ 68.50
What proportion of men are over 70.87 inches in height The appropriate
Normal distribution is now
x N68.50 3.78
2
3.9
The z score is
z 0.63 3.10
which is the same z score as before and therefore gives the same probability.
70.87 − 68.50
3.78
Figure 3.8b
The standard Normal
distribution corresponding
to Figure 3.8a
Figure 3.8a
The Normal distribution
STFE_C03.qxd 26/02/2009 09:08 Page 122

slide 140:

The Normal distribution
123
Worked example 3.2
Packets of cereal have a nominal weight of 750 grams but there is some
variation around this as the machines ﬁlling the packets are imperfect. Let
us assume that the weights follow a Normal distribution. Suppose that the
standard deviation around the mean of 750 is 5 grams. What proportion of
packets weigh more than 760 grams
Summarising our information we have x N750 25 where x represents
the weight. We wish to ﬁnd Prx 760. To be able to look up the answer
we need to measure the distance between 760 and 750 in terms of standard
deviations. This is
z
2.0
Looking up z 2.0 in Table A2 reveals an area of 0.0228 in the tail of the
distribution. Thus 2.28 of packets weigh more than 760 grams.
Since a great deal of use is made of the standard Normal tables it is worth
working through a couple more examples to reinforce the method. We have so
far calculated that Prz 0.63 0.2643. Since the total area under the graph
equals one i.e. the sum of probabilities must be one the area to the left of
z 0.63 must equal 0.7357 i.e. 73.57 of men are under 180 cm. It is fairly easy
to manipulate areas under the graph to arrive at any required area. For example
what proportion of men are between 174 and 180 cm in height It is helpful to
refer to Figure 3.9 at this point.
The size of area A is required. Area B has already been calculated as 0.2643.
Since the distribution is symmetric the area A + B must equal 0.5 since 174 is at
the centre mean of the distribution. Area A is therefore 0.5 − 0.2643 0.2357.
23.57 is the desired result.
760 − 750
5
Figure 3.9
The proportion of men
between 174 cm and
180 cm in height
STFE_C03.qxd 27/02/2009 13:12 Page 123

slide 141:

Chapter 3 • Probability distributions
124
Table A2 see page 414 indicates that the area in the right-hand tail beyond
z 0.42 is 0.3372 so area D 0.5 − 0.3372 0.1628. For C the z score is
z
C
−0.83 3.12
The minus sign indicates that it is the left-hand tail of the distribution below the
mean which is being considered. Since the distribution is symmetric it is the same
as if it were the right-hand tail so the minus sign may be ignored when con-
sulting the table. Looking up z 0.83 in Table A2 gives an area of 0.2033 in the tail
so area C is therefore 0.5 − 0.2033 0.2967. Adding areas C and D gives 0.1628
+ 0.2967 0.4595. So nearly half of all men are between 166 and 178 cm in height.
An alternative interpretation of the results obtained above is that if a man is
drawn at random from the adult population the probability that he is over
180 cm tall is 26.43. This is in line with the frequentist school of thought.
Since 26.43 of the population is over 180 cm in height that is the probability
of a man over 180 cm being drawn at random.
a The random variable x is distributed Normally with x N40 36. Find the prob-
ability that x 50.
b Find Prx 45.
c Find Pr36 x 44.
166 − 174
9.6
Figure 3.10
The proportion of men
between 166 cm and
178 cm in height
Exercise 3.3
STATISTICS
IN
PR AC TIC E
··
Using software to ﬁnd areas under the standard Normal distribution
If you use a spreadsheet program you can look up the z-distribution directly and
hence dispense with tables. In Excel for example the function ‘NORMSDIST0.63’
gives the answer 0.7357 i.e. the area to the left of the z score. The area in the right-
hand tail is then obtained by subtracting this value from 1 i.e. 1 − 0.7357 0.2643.
Entering the formula ‘ 1 − NORMSDIST0.63’ in a cell will give the area in the
right-hand tail directly.
As a ﬁnal exercise consider the question of what proportion of men are
between 166 and 178 cm tall. As shown in Figure 3.10 area C + D is wanted. The
only way to ﬁnd this is to calculate the two areas separately and then add them
together. For area D the z score associated with 178 is
z
D
0.42 3.11
178 − 174
9.6
STFE_C03.qxd 26/02/2009 09:08 Page 124

slide 142:

The sample mean as a Normally distributed variable
125
Exercise 3.4
Exercise 3.5
Theorem
The mean +/− 0.67 standard deviations cuts off 25 in each tail of the Normal dis-
tribution. Hence the middle 50 of the distribution lies within +/− 0.67 standard
deviations of the mean. Use this fact to calculate the inter-quartile range for the
distribution x N200 256.
As suggested in the text the logarithm of income is approximately Normally dis-
tributed. Suppose the log to the base 10 of income has the distribution x N4.18 2.56.
Calculate the inter-quartile range for x and then take anti-logs to ﬁnd the inter-
quartile range of income.
The sample mean as a Normally distributed variable
One of the most important concepts in statistical inference is the probability
distribution of the mean of a random sample since we often use the sample
mean to tell us something about an associated population. Suppose that from
the population of adult males a random sample of size n 36 is taken their
heights measured and the mean height of the sample calculated. What can we
infer from this about the true average height of the population To do this we
need to know about the statistical properties of the sample mean. The sample
mean is a random variable because of the chance element of random sampling
different samples would yield different values of the sample mean. Since the
sample mean is a random variable it must have associated with it a probability
distribution.
We therefore need to know ﬁrst what is the appropriate distribution and
second what are its parameters. From the deﬁnition of the sample mean we have
X x
1
+ x
2
+ ... + x
n
3.13
where each observation x
i
is itself a Normally distributed random variable
with x
i
Nμ σ
2
because each comes from the parent distribution with such
characteristics. We stated earlier that men’s heights are Normally distributed.
We now make use of the following theorem to demonstrate that X is Normally
distributed:
Any linear combination of independent Normally distributed random
variables is itself Normally distributed.
A linear combination of two variables x
1
and x
2
is of the form w
1
x
1
+ w
2
x
2
where w
1
and w
2
are constants. This can be generalised to any number of x
values. It is clear that the sample mean satisﬁes these conditions and is a linear
combination of the individual x values with the weight on each observation
equal to 1/n. As long as the observations are independently drawn therefore
the sample mean is Normally distributed.
We now need the parameters mean and variance of the distribution. For this
we use the E and V operators once again
EX Ex
1
+ Ex
2
+ ... + Ex
n
μ + μ + ... + μ nμ μ 3.14
1
n
1
n
1
n
1
n
STFE_C03.qxd 26/02/2009 09:08 Page 125

slide 143:

Chapter 3 • Probability distributions
126
VX V x
1
+ x
2
+ ... + x
n
3.15
Vx
1
+ Vx
2
+ ... + Vx
n
σ
2
+ σ
2
+ ... + σ
2
nσ
2
Putting all this together we have
2
X N μ 3.16
This we may summarise in the following theorem:
The sample mean : drawn from a population which has a Normal
distribution with mean μ μ and variance σ σ
2
has a sampling distribution
which is Normal with mean μ μ and variance σ σ
2
/n where n is the
sample size.
The meaning of this theorem is as follows. First of all it is assumed that the
population from which the samples are to be drawn is itself Normally distributed
this assumption will be relaxed in a moment with mean μ and variance σ
2
. From
this population many samples are drawn each of sample size n and the mean
of each sample is calculated. The samples are independent meaning that the
observations selected for one sample do not inﬂuence the selection of observations
in the other samples. This gives many sample means X
1
X
2
etc. If these sample
means are treated as a new set of observations then the probability distribution
of these observations can be derived. The theorem states that this distribution is
Normal with the sample means centred around μ the population mean and
with variance σ
2
/n. The argument is set out diagrammatically in Figure 3.11.
Intuitively this theorem can be understood as follows. If the height of adult
males is a Normally distributed random variable with mean μ 174 cm and
D
F
σ
2
n
A
C
σ
2
n
1
n
2
1
n
2
1
n
2
D
F
1
n
A
C
Theorem
Figure 3.11
The parent distribution
and the distribution of
sample means
2
Don’t worry if you didn’t follow the derivation of this formula just accept that it is correct.
Note: The distribution of X is drawn for a sample size of n 9. A larger sample size would
narrow the X distribution a smaller sample size would widen it.
STFE_C03.qxd 26/02/2009 09:08 Page 126

slide 144:

The sample mean as a Normally distributed variable
127
variance σ
2
92.16 then it would be expected that a random sample of say
nine males would yield a sample mean height of around 174 cm perhaps a little
more perhaps a little less. In other words the sample mean is centred around
174 cm or the mean of the distribution of sample means is 174 cm.
The larger is the size of the individual samples i.e. the larger n the closer the
sample mean would tend to be to 174 cm. For example if the sample size is only
two a sample of two very tall people is quite possible with a high sample mean
as a result well over 174 cm e.g. 182 cm. But if the sample size were 20 it is
very unlikely that 20 very tall males would be selected and the sample mean is
likely to be much closer to 174. This is why the sample size n appears in the
formula for the variance of the distribution of the sample mean σ
2
/n.
Note that once again we have transformed one or more random variables
the x
i
values with a particular probability distribution into another random
variable X with a slightly different distribution. This is common practice in
statistics: transforming a variable will often put it into a more useful form for
example one whose probability distribution is well known.
The above theorem can be used to solve a range of statistical problems.
For example what is the probability that a random sample of nine men will
have a mean height greater than 180 cm The height of all men is known to be
Normally distributed with mean μ 174 cm and variance σ
2
92.16. The
theorem can be used to derive the probability distribution of the sample mean.
For the population we have
X Nμ σ
2
i.e. X N174 92.16
Hence for the sample mean
X Nμ σ
2
/n i.e. X N174 92.16/9
This is shown diagrammatically in Figure 3.12.
To answer the question posed the area to the right of 180 shaded in Figure 3.11
has to be found. This should by now be a familiar procedure. First the z score is
calculated
3.17 z
n
/
./
.
−
−
X μ
σ
2
180 174
92 16 9
188
Figure 3.12
The proportion of sample
means greater than
X 180
STFE_C03.qxd 26/02/2009 09:08 Page 127

slide 145:

Chapter 3 • Probability distributions
128
STATISTICS
IN
PR AC TIC E
··
Note that the z score formula is subtly different because we are dealing with the
sample mean X rather than x itself. In the numerator we use X rather than x and
in the denominator we use σ
2
/n not σ
2
. This is because X has a variance σ
2
/n
not σ
2
which is the population variance. is known as the standard error
to distinguish it from σ the standard deviation of the population. The principle
behind the z score is the same however: it measures how far is a sample mean
of 180 from the population mean of 174 measured in terms of standard deviations.
Looking up the value of z 1.88 in Table A2 gives an area of 0.0311 in the
right-hand tail of the Normal distribution. Thus 3.11 of sample means will
be greater than or equal to 180 cm when the sample size is nine. The desired
probability is therefore 3.11.
As this probability is quite small we might consider the reasons for this.
There are two possibilities:
a through bad luck the sample collected is not very representative of the
population as a whole
b the sample is representative of the population but the population mean is
not 174 cm after all.
Only one of these two possibilities can be correct. How to decide between them
will be taken up later on in Chapter 5 on hypothesis testing.
It is interesting to examine the difference between the answer for a sample
size of nine 3.11 and the one obtained earlier for a single individual 26.43.
The latter may be considered as a sample of size one from the population. The
examples illustrate the fact that the larger the sample size the closer the sample
mean is likely to be to the population mean. Thus larger samples tend to give
better estimates of the population mean.
Oil reserves
An interesting application of probability distributions is to the estimation of oil
reserves. The quantity of oil in an oil ﬁeld is not known for certain but is subject
to uncertainty. The proven oil reserve of a ﬁeld is the amount recoverable with
probability of 90 known as P90 in the oil industry. One can then add up the
proven oil reserves around the world to get a total of proven reserves.
However using probability theory we can see this might be misleading.
Suppose we have 50 ﬁelds where the recoverable quantity of oil is distributed as
x N100 81 in each. From tables we note that X − 1.28s cuts off the bottom 10
of the Normal distribution 88.48 in this case. This is the proven reserve for a ﬁeld.
Summing across the 50 ﬁelds gives 4424 as total reserves. But is there a 90
probability of recovering at least this amount
Using the ﬁrst theorem above the total quantity of oil y is distributed Normally
with mean Ey Ex
1
+ ... + Ex
50
5000 and variance Vy Vx
1
+ ... + Vx
50
4050 assuming independence of the oil ﬁelds. Hence we have y N5000 4050.
Again the bottom 10 is cut off by Y − 1.28s which is 4919. This is 11 larger than
the 4424 calculated above. Adding up the proven reserves of each ﬁeld individu-
ally underestimates the true total proven reserves. In fact the probability of total
proven reserves being greater than 4424 is almost 100.
Note that the numbers given here are for illustration purposes and don’t reﬂect
the actual state of affairs. The principle of the calculation is correct however.
σ
2
/n
STFE_C03.qxd 26/02/2009 09:08 Page 128

slide 146:

The sample mean as a Normally distributed variable
129
Sampling from a non-Normal population
The previous theorem and examples relied upon the fact that the population
followed a Normal distribution. But what happens if it is not Normal After all
it is not known for certain that the heights of all adult males are exactly
Normally distributed and there are many populations which are not Normal
e.g. wealth as shown in Chapter 1. What can be done in these circumstances
The answer is to use another theorem about the distribution of sample means
presented without proof. This is known as the Central Limit Theorem:
The sample mean : drawn from a population with mean μ μ and
variance σ σ
2
has a sampling distribution which approaches a Normal
distribution with mean μ μ and variance σ σ
2
/n as the sample size
approaches inﬁnity.
This is very useful since it drops the assumption that the population is Norm-
ally distributed. Note that the distribution of sample means is only Normal as
long as the sample size is inﬁnite for any ﬁnite sample size the distribution is
only approximately Normal. However the approximation is close enough for
practical purposes if the sample size is larger than 25 or so observations. If the
population distribution is itself nearly Normal then a smaller sample size would
sufﬁce. If the population distribution is particularly skewed then more than 25
observations would be desirable. Twenty-ﬁve observations constitutes a rule of
thumb that is adequate in most circumstances. This is another illustration of
statistics as an inexact science. It does not provide absolutely clear-cut answers
to questions but used carefully helps us to arrive at sensible conclusions.
As an example of the use of the Central Limit Theorem we return to the
wealth data of Chapter 1. Recall that the mean level of wealth was 146.984
measured in £000 and the variance 56 803. Suppose that a sample of n 50
people were drawn from this population. What is the probability that the
sample mean is greater than 160 i.e. £160 000
On this occasion we know that the parent distribution is highly skewed so it
is fortunate that we have 50 observations. This should be ample for us to justify
applying the Central Limit Theorem. The distribution of X is therefore
X Nμ σ
2
/n 3.18
and inserting the parameter values this gives
3
X N146.984 56 803/50 3.19
To ﬁnd the area beyond a sample mean of 160 the z score is ﬁrst calculated
3.20
Referring to the standard Normal tables the area in the tail is then found to
be 34.83. This is the desired probability. So there is a probability of 34.83 of
ﬁnding a mean of £160 000 or greater with a sample of size 50. This demonstrates
z
.
/
.
−
160 146 984
56 803 50
039
Theorem
3
Note that if we used 146 984 for the mean we would have 56 803 000 000 as the variance.
Using £000 keeps the numbers more manageable. The z score is the same in both cases.
STFE_C03.qxd 26/02/2009 09:08 Page 129

slide 147:

Chapter 3 • Probability distributions
130
Exercise 3.6
that there is quite a high probability of getting a sample mean which is a rela-
tively long way from £146 984. This is a consequence of the high degree of
dispersion in the distribution of wealth.
Extending this example we can ask what is the probability of the sample
mean lying within say £66 000 either side of the true mean of £146 984
i.e. between £80 984 and £212 984 Figure 3.13 illustrates the situation with
the desired area shaded. By symmetry areas A and B must be equal so we only
need ﬁnd one of them. For B we calculate the z score
3.21
From the standard Normal table this cuts off approximately 2.5 in the upper
tail so area B 0.475. Areas A and B together make up 95 of the distribution
therefore. There is thus a 95 probability of the sample mean falling within the
range 80 984 212 984 and we call this the 95 probability interval for the
sample mean. We write this
Pr80 984 X 212 984 0.95 3.22
or in terms of the formulae we have used
4
Prμ − 1.96 X μ + 1.96 0.95 3.23
The 95 probability interval and the related concept of the 95 conﬁdence
interval which will be introduced in Chapter 4 play important roles in statistical
inference. We deliberately designed the example above to arrive at an answer of
95 for this reason.
a If x is distributed as x N50 64 and samples of size n 25 are drawn what is
the distribution of the sample mean X
b If the sample size doubles to 50 how is the standard error of X altered
c Using the sample size of 25 i what is the probability of X 51 ii What is
PrX 48 iii What is Pr49 X 50.5
σ
2
/n σ
2
/n
z
. .
/
.
−
212 984 146 984
56 803 50
1 958
4
1.96 is the precise value cutting off 2.5 in each tail.
Figure 3.13
The probability of X lying
within £66 000 either side
of £146 984
STFE_C03.qxd 26/02/2009 09:08 Page 130

slide 148:

The relationship between the Binomial and Normal distributions
131
The relationship between the Binomial and Normal distributions
Many statistical distributions are related to one another in some way. This
means that many problems can be solved by a variety of different methods
using different distributions though usually one is more convenient or more
accurate than the others. This point may be illustrated by looking at the rela-
tionship between the Binomial and Normal distributions.
Recall the experiment of tossing a coin repeatedly and noting the number of
heads. We said earlier that this can be analysed via the Binomial distribution.
But note that the number of heads a random variable is inﬂuenced by many
independent random events the individual tosses added together. Furthermore
each toss counts equally none dominates. These are just the conditions under
which a Normal distribution arises so it looks like there is a connection between
the two distributions.
This idea is correct. Recall that if a random variable r follows a Binomial
distribution then
r Bn P
and the mean of the distribution is nP and the variance nP1 − P. It turns out
that as n increases the Binomial distribution becomes approximately the same
as a Normal distribution with mean nP and variance nP1 − P. This approxima-
tion is sufﬁciently accurate as long as nP 5 and n1 − P 5 so the approxima-
tion may not be very good even for large values of n if P is very close to zero
or one. For the coin tossing experiment where P 0.5 10 tosses should be
sufﬁcient. Note that this approximation is good enough with only 10 observa-
tions even though the underlying probability distribution is nothing like a
Normal distribution.
To demonstrate the following problem is solved using both the Binomial and
Normal distributions. Forty students take an exam in statistics which is simply
graded pass/fail. If the probability P of any individual student passing is 60
what is the probability of at least 30 students passing the exam
The sample data are
P 0.6
1 − P 0.4
n 40
Binomial distribution method
To solve the problem using the Binomial distribution it is necessary to ﬁnd the prob-
ability of exactly 30 students passing plus the probability of 31 passing plus the
probability of 32 passing etc. up to the probability of 40 passing the fact that
the events are mutually exclusive allows this. The probability of 30 passing is
Prr 30 nCr × P
r
1 − P
n−r
40C
30
× 0.6
30
× 0.4
10
0.020
Note: This calculation assumes that the probabilities are independent i.e. no
copying This by itself is quite a tedious calculation but Pr31 Pr32 etc. still
STFE_C03.qxd 26/02/2009 09:08 Page 131

slide 149:

Chapter 3 • Probability distributions
132
Exercise 3.7
have to be calculated. Calculating these and summing them gives the result of
3.52 as the probability of at least 30 passing. It would be a useful exercise for
you to do if only to appreciate how long it takes.
Normal distribution method
As stated above the Binomial distribution can be approximated by a Normal dis-
tribution with mean nP and variance nP1 − P. nP in this case is 24 40 × 0.6
and n1 − P is 16 both greater than 5 so the approximation can be safely used. Thus
r NnP nP1 − P
and inserting the parameter values gives
r N24 9.6
The usual methods are then used to ﬁnd the appropriate area under the dis-
tribution. However before doing so there is one adjustment to be made this only
applies when approximating the Binomial distribution by the Normal. The
Normal distribution is a continuous one while the Binomial is discrete. Thus 30
in the Binomial distribution is represented by the area under the Normal dis-
tribution between 29.5 and 30.5. 31 is represented by 30.5 to 31.5 etc. Thus it
is the area under the Normal distribution to the right of 29.5 not 30 which
must be calculated. This is known as the continuity correction. Calculating the
z score gives
3.24
This gives an area of 3.75 not far off the correct answer as calculated by the
Binomial distribution. The time saved and ease of calculation would seem to be
worth the slight loss in accuracy.
Other examples can be constructed to test this method using different values
of P and n. Small values of n or values of nP or n1 − P less than 5 will give poor
results i.e. the Normal approximation to the Binomial will not be very good.
a A coin is tossed 20 times. What is the probability of more than 14 heads Perform
the calculation using both the Binomial and Normal distributions and compare
results.
b A biased coin for which PrH 0.7 is tossed 6 times. What is the probability of
more than 4 heads Compare Binomial and Normal methods in this case. How
accurate is the Normal approximation
c Repeat part b but for more than 5 heads.
The Poisson distribution
The section above showed how the Binomial distribution could be approximated
by a Normal distribution under certain circumstances. The approximation does
not work particularly well for very small values of P when nP is less than 5. In
z
.
.
.
−
29 5 24
96
178
STFE_C03.qxd 26/02/2009 09:08 Page 132

slide 150:

The Poisson distribution
133
these circumstances the Binomial may be approximated instead by the Poisson
distribution which is given by the formula
Prx 3.25
where μ is the mean of the distribution similar to μ for the Normal distribution
and nP for the Binomial. Like the Binomial but unlike the Normal the Poisson
is a discrete probability distribution so that equation 3.25 is only deﬁned for
integer values of x. Furthermore it is applicable to a series of trials which are
independent as in the Binomial case.
The use of the Poisson distribution is appropriate when the probability of
‘success’ is very small and the number of trials large. Its use is illustrated by
the following example. A manufacturer gives a two-year guarantee on the TV
screens it makes. From past experience it knows that 0.5 of its screens will be
faulty and fail within the guarantee period. What is the probability that of
a consignment of 500 screens a none will be faulty b more than three are
faulty
The mean of the Poisson distribution in this case is μ 2.5 0.5 of 500.
Therefore
Prx 0 0.082 3.26
giving a probability of 8.2 of no failures. The answer to this problem via the
Binomial method is
Prr 0 0.995
500
0.0816
Thus the Poisson method gives a reasonably accurate answer. The Poisson
approximation to the Binomial is satisfactory if nP is less than about 7.
The probability of more than three screens expiring is calculated as
Prx 3 1 − Prx 0 − Prx 1 − Prx 2 − Prx 3
Prx 1 0.205
Prx 2 0.256
Prx 3 0.214
So
Prx 3 1 − 0.082 − 0.205 − 0.256 − 0.214 0.242
Thus there is a probability of about 24 of more than three failures. The Binomial
calculation is much more tedious but gives an answer of 24.2 also.
The Poisson distribution is also used in problems where events occur over
time such as goals scored in a football match see Problem 3.25 or queuing-
type problems e.g. arrivals at a bank cash machine. In these problems there
is no natural ‘number’ of trials but it is clear that if we take a short interval
2.5
3
e
−2.5
3
2.5
2
e
−2.5
2
2.5
1
e
−2.5
1
2.5
0
e
−2.5
0
μ
x
e
−μ
x
STFE_C03.qxd 26/02/2009 09:08 Page 133

slide 151:

Chapter 3 • Probability distributions
134
Exercise 3.8
STATISTICS
IN
PR AC TIC E
··
of time the probability of an event occurring is small. We can then consider
the number of trials to be the number of time intervals. This is illustrated by the
following example. A football team scores on average two goals every game
you can vary the example by using your own favourite team plus their scoring
record. What is the probability of the team scoring zero or one goal during
a game
The mean of the distribution is 2 so we have using the Poisson distribution
Prx 0 0.135
Prx 1 0.271
You should continue to calculate the probabilities of 2 or more goals and verify
that the probabilities sum to 1.
A queuing-type problem is the following. If a shop receives on average
20 customers per hour what is the probability of no customers within a ﬁve-
minute period while the owner takes a coffee break
The average number of customers per ﬁve-minute period is 20 × 5/60 1.67.
The probability of a free ﬁve-minute spell is therefore
Prx 0 0.189
a probability of about 19. Note that this problem cannot be solved by the
Binomial method since n and P are not known separately only their product.
a The probability of winning a prize in a lottery is 1 in 50. If you buy 50 tickets what
is the probability that i 0 tickets win ii 1 ticket wins iii 2 tickets win. iv What
is the probability of winning at least one prize
b On average a person buys a lottery ticket in a supermarket every 5 minutes.
What is the probability that 10 minutes will pass with no buyers
Railway accidents
Andrew Evans of University College London used the Poisson distribution to
examine the numbers of fatal railway accidents in Britain between 1967 and 1997.
Since railway accidents are fortunately rare the probability of an accident in any
time period is very small and so use of the Poisson distribution is appropriate. He
found that the average number of accidents has been falling over time and by
1997 had reached 1.25 per annum. This ﬁgure is therefore used as the mean μ of
the Poisson distribution and we can calculate the probabilities of 0 1 2 etc.
accidents each year. Using μ 1.25 and inserting this into equation 3.26 we obtain
the following table:
Number of accidents 0 1 2345 6
Probability 0.287 0.358 0.224 0.093 0.029 0.007 0.002
and this distribution can be graphed:
1.67
0
e
−1.67
0
2
1
e
−2
1
2
0
e
−2
0
STFE_C03.qxd 26/02/2009 09:08 Page 134

slide 152:

Summary
135
Thus the most likely outcome is one fatal accident per year and anything over
four is extremely unlikely. In fact Evans found that the Poisson was not a perfect
ﬁt to the data: the actual variation was less than that predicted by the model.
Source: A. W. Evans Fatal train accidents on Britain’s mainline railways J. Royal Statistical Society
Series A 2000 163 1 99–119.
Summary
● The behaviour of many random variables e.g. the result of the toss of a coin
can be described by a probability distribution in this case the Binomial
distribution.
● The Binomial distribution is appropriate for problems where there are only
two possible outcomes of a chance event e.g. heads/tails success/failure and
the probability of success is the same each time the experiment is conducted.
● The Normal distribution is appropriate for problems where the random
variable has the familiar bell-shaped distribution. This often occurs when
the variable is inﬂuenced by many independent factors none of which
dominates the others. An example is men’s heights which are Normally
distributed.
● The Poisson distribution is used in circumstances where there is a very low
probability of ‘success’ and a high number of trials.
● Each of these distributions is actually a family of distributions differing in
the parameters of the distribution. Both the Binomial and Normal distribu-
tions have two parameters: n and P in the former case μ and σ
2
in the latter.
The Poisson distribution has one parameter its mean μ.
● The mean of a random sample follows a Normal distribution because it is
inﬂuenced by many independent factors the sample observations none of
which dominates in the calculation of the mean. This statement is always
true if the population from which the sample is drawn follows a Normal
distribution.
STFE_C03.qxd 26/02/2009 09:08 Page 135

slide 153:

Chapter 3 • Probability distributions
136
● If the population is not Normally distributed then the Central Limit Theorem
states that the sample mean is Normally distributed in large samples. In this
case ‘large’ means a sample of about 25 or more.
Binomial distribution
Central Limit Theorem
Normal distribution
parameters of a distribution
Poisson distribution
probability distribution
random variable
standard error
standard Normal distribution
Key terms and concepts
STFE_C03.qxd 26/02/2009 09:08 Page 136

slide 154:

137
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
3.1 Two dice are thrown and the sum of the two scores is recorded. Draw a graph of the
resulting probability distribution of the sum and calculate its mean and variance. What is
the probability that the sum is 9 or greater
3.2 Two dice are thrown and the absolute difference of the two scores recorded. Graph the
resulting probability distribution and calculate its mean and variance. What is the proba-
bility that the absolute difference is 4 or more
3.3 Sketch the probability distribution for the likely time of departure of a train. Locate the
timetabled departure time on your chart.
3.4 A train departs every half hour. You arrive at the station at a completely random moment.
Sketch the probability distribution of your waiting time. What is your expected waiting
time
3.5 Sketch the probability distribution for the number of accidents on a stretch of road in
one day.
3.6 Sketch the probability distribution for the number of accidents on the same stretch of
road in one year. How and why does this differ from your previous answer
3.7 Six dice are rolled and the number of sixes is noted. Calculate the probabilities of
0 1 . . . 6 sixes and graph the probability distribution.
3.8 If the probability of a boy in a single birth is and is independent of the sex of previous
babies then the number of boys in a family of 10 children follows a Binomial distribution
with mean 5 and variance 2.5. In each of the following instances describe how the
distribution of the number of boys differs from the Binomial described above.
a The probability of a boy is .
b The probability of a boy is but births are not independent. The birth of a boy makes
it more than an even chance that the next child is a boy.
c As b above except that the birth of a boy makes it less than an even chance that the
next child will be a boy.
d The probability of a boy is on the ﬁrst birth. The birth of a boy makes it a more than
even chance that the next baby will be a boy.
6
10
1
2
6
10
1
2
Problems
Problems
STFE_C03.qxd 26/02/2009 09:08 Page 137

slide 155:

Chapter 3 • Probability distributions
138
3.9 A ﬁrm receives components from a supplier in large batches for use in its produc-
tion process. Production is uneconomic if a batch containing 10 or more defective
components is used. The ﬁrm checks the quality of each incoming batch by taking
a sample of 15 and rejecting the whole batch if more than one defective component
is found.
a If a batch containing 10 defectives is delivered what is the probability of its being
accepted
b How could the ﬁrm reduce this probability of erroneously accepting bad batches
c If the supplier produces a batch with 3 defective what is the probability of the ﬁrm
sending back the batch
d What role does the assumption of a ‘large’ batch play in the calculation
3.10 The UK record for the number of children born to a mother is 39 32 of them girls.
Assuming the probability of a girl in a single birth is 0.5 and that this probability is
independent of previous births:
a Find the probability of 32 girls in 39 births you’ll need a scientiﬁc calculator or a
computer to help with this.
b Does this result cast doubt on the assumptions
3.11 Using equation 3.5 describing the Normal distribution and setting μ 0 and σ
2
1 graph
the distribution for the values x −2 −1.5 −1 −0.5 0 0.5 1 1.5 2.
3.12 Repeat the previous Problem for the values μ 2 and σ
2
3. Use values of x from −2 to
+6 in increments of 1.
3.13 For the standard Normal variable z ﬁnd
a Prz 1.64
b Prz 0.5
c Prz −1.5
d Pr−2 z 1.5
e Prz −0.75.
For a and d shade in the relevant areas on the graph you drew for Problem 3.11.
3.14 Find the values of z which cut off
a the top 10
b the bottom 15
c the middle 50
of the standard Normal distribution.
3.15 If x N10 9 ﬁnd
a Prx 12
b Prx 7
c Pr8 x 15
d Prx 10.
STFE_C03.qxd 26/02/2009 09:08 Page 138

slide 156:

139
3.16 IQ the intelligence quotient is Normally distributed with mean 100 and standard
deviation 16.
a What proportion of the population has an IQ above 120
b What proportion of the population has IQ between 90 and 110
c In the past about 10 of the population went to university. Now the proportion is
about 30. What was the IQ of the ‘marginal’ student in the past What is it now
3.17 Ten adults are selected at random from the population and their IQ measured. Assume
a population mean of 100 and s.d. of 16 as in Problem 3.16.
a What is the probability distribution of the sample average IQ
b What is the probability that the average IQ of the sample is over 110
c If many such samples were taken in what proportion would you expect the average IQ
to be over 110
d What is the probability that the average IQ lies within the range 90 to 110 How
does this answer compare to the answer to part b of Problem 16 Account for the
difference.
e What is the probability that a random sample of ten university students has an average
IQ greater than 110
f The ﬁrst adult sampled has an IQ of 150. What do you expect the average IQ of the
sample to be
3.18 The average income of a country is known to be £10 000 with standard deviation £2500.
A sample of 40 individuals is taken and their average income calculated.
a What is the probability distribution of this sample mean
b What is the probability of the sample mean being over £10 500
c What is the probability of the sample mean being below £8000
d If the sample size were 10 why could you not use the same methods to ﬁnd the
answers to a–c
3.19 A coin is tossed 10 times. Write down the distribution of the number of heads:
a exactly using the Binomial distribution
b approximately using the Normal distribution
c Find the probability of four or more heads using both methods. How accurate is the
Normal method with and without the continuity correction
3.20 A machine producing electronic circuits has an average failure rate of 15 they’re
difﬁcult to make. The cost of making a batch of 500 circuits is £8400 and the good ones
sell for £20 each. What is the probability of the ﬁrm making a loss on any one batch
3.21 An experienced invoice clerk makes an error once in every 100 invoices on average.
a What is the probability of ﬁnding a batch of 100 invoices without error
b What is the probability of ﬁnding such a batch with more than two errors
Calculate the answers using both the Binomial and Poisson distributions. If you try to
solve the problem using the Normal method how accurate is your answer
Problems
STFE_C03.qxd 26/02/2009 09:08 Page 139

slide 157:

Chapter 3 • Probability distributions
140
3.22 A ﬁrm employing 100 workers has an average absenteeism rate of 4. On a given day
what is the probability of a no workers b one worker c more than six workers being
absent
3.23 Computer project This problem demonstrates the Central Limit Theorem at work. In
your spreadsheet use the RAND function to generate a random sample of 25 observa-
tions I suggest entering this function in cells A4:A28 for example. Copy these cells
across 100 columns to generate 100 samples. In row 29 calculate the mean of each
sample. Now examine the distribution of these sample means.
Hint: you will ﬁnd the RAND function recalculates automatically every time you perform
an operation in the spreadsheet. This makes it difﬁcult to complete the analysis. The solu-
tion is to copy and then use ‘Edit Paste Special Values’ to create a copy of the values of
the sample means. These will remain stable.
a What distribution would you expect them to have
b What is the parent distribution from which the samples are drawn
c What are the parameters of the parent distribution and of the sample means
d Do your results accord with what you would expect
e Draw up a frequency table of the sample means and graph it. Does it look as you
expected
f Experiment with different sample sizes and with different parent distributions to see
the effect that these have.
3.24 Project An extremely numerate newsagent with a spreadsheet program as you will
need is trying to work out how many copies of a newspaper he should order. The cost to
him per copy is 15p which he then sells at 45p. Sales are distributed Normally with an
average daily sale of 250 and variance 625. Unsold copies cannot be returned for credit or
refund he has to throw them away losing 15p per copy.
a What do you think the seller’s objective should be
b How many copies should he order
c What happens to the variance of proﬁt as he orders more copies
d Calculate the probability of selling more than X copies. Create an extra column in the
spreadsheet for this. What is the value of this probability at the optimum number of
copies ordered
e What would the price–cost ratio have to be to justify the seller ordering X copies
f The wholesaler offers a sale or return deal but the cost per copy is 16p. Should the
seller take up this new offer
g Are there other considerations which might inﬂuence the seller’s decision
Hints:
Set up your spreadsheet as follows:
Col. A: cells A10:A160 175 176 . . . up to 325 in unit increments to represent
sales levels.
Col. B: cells B10:B160 the probability of sales falling between 175 and 176
between 176 and 177 etc. up to 325 − 326. Excel has the ‘ NORMDIST’
function to do this – see the help facility.
Col. C: cells C10:C160 total cost 0.15 × number ordered. Put the latter in cell
F3 so you can reference it and change its value.
STFE_C03.qxd 26/02/2009 09:08 Page 140

slide 158:

141
Col. D: cells D10:D160 total revenue ‘MINsales number ordered × 0.45’.
Col. E: profit revenue − cost.
Col. F: profit × probability i.e. col. E × col. B.
Cell F161: the sum of F10:F160 this is the expected proﬁt.
Now vary the number ordered cell F3 to ﬁnd the maximum value in F161. You can also
calculate the variance of proﬁt fairly simply using an extra column.
3.25 Project Using a weekend’s football results from the Premier or other league see if
the number of goals per game can be adequately modelled by a Poisson process. First
calculate the average number of goals per game for the whole league then derive the
distribution of goals per game using the Poisson distribution. Do the actual numbers of
goals per game follow this distribution You might want to take several weeks’ results to
obtain more reliable results.
Problems
STFE_C03.qxd 26/02/2009 09:08 Page 141

slide 159:

Chapter 3 • Probability distributions
142
Answers to exercises
Exercise 3.1
a 0.6
4
0.1296 or 12.96.
b 0.6
2
× 0.4
2
× 4C2 0.3456.
c Prr 0.6
r
× 0.4
4−r
4Cr. The probabilities of r 0...4 are respectively 0.0256
0.1536 0.3456 0.3456 0.1296 which sum to one.
Exercise 3.2
a r Pr r × Pr r
2
× Pr
0 0.0256 0 0
1 0.1536 0.1536 0.1536
2 0.3456 0.6912 1.3824
3 0.3456 1.0368 3.1104
4 0.1296 0.5184 2.0736
Totals 1 2.4 6.72
The mean 2.4/1 2.4 and the variance 6.72/1 − 2.4
2
0.96. Note that these
are equal to nP and nP1 − P.
b
Exercise 3.3
a and the area beyond z 1.67 is 4.75.
b z −0.83 so area is 20.33.
c This is symmetric around the mean z ±0.67 and the area within these two
bounds is 49.72.
Exercise 3.4
To obtain the IQR we need to go 0.67 s.d.s above and below the mean giving
200 ± 0.67 × 16 189.28 210.72.
Exercise 3.5
The IQR in logs is within 4.18 ± 0.67 ×√2.56 3.11 5.25. Translated out of logs
using 10
x
yields 1288.2 177 827.9.
z / . − 50 40 36 1 67
STFE_C03.qxd 26/02/2009 09:08 Page 142

slide 160:

Answers to exercises
143
Exercise 3.6
a e N50 64/25.
b The s.e. gets smaller. It is 1/√2 times its previous value.
c i . Hence area in tail 26.5. ii z −1.25 hence
area 10.56. iii z values are −0.625 and +0.3125 giving tail areas of 26.5 and
37.8 totalling 64.3. The area between the limits is therefore 35.7.
Exercise 3.7
a Binomial method: Prr 0.5
r
× 0.5
20−r
× 20Cr. This gives probabilities of 15
16 etc. heads of 0.0148 0.0046 etc. which total 0.0207 or 2.1. By the Normal
approximation r N10 5 and z 14.5 − 10/√5 2.01. The area in the tail
is then 2.22 not far off the correct value a 10 error. Note that nP 10
n1 − P.
b Binomial method: Pr5 or 6 heads 0.302 + 0.118 0.420 or 42. By the
Normal r N4.2 1.26 z 0.267 and the area is 39.36 still reasonably close
to the correct answer despite the fact that n1 − P 1.8.
c By similar methods the answers are 11.8 Binomial and 12.3 Normal.
Exercise 3.8
a i μ 1 in this case 1/50 × 50 so Prx 0 1
0
e
−1
/0 0.368. ii Prx 1
1
1
e
−1
/1 0.368. iii 1
2
e
−1
/2 0.184. iv 1 − 0.368 0.632.
b The average number of customer per 10 minutes is 2 10/5. Hence Prx 0
2
0
e
−2
/0 0.135.
z / / . − 51 50 64 25 0 625
STFE_C03.qxd 26/02/2009 09:08 Page 143

slide 161:

Estimation and conﬁdence intervals
4
Contents
Learning outcomes 144
Introduction 145
Point and interval estimation 145
Rules and criteria for ﬁnding estimates 146
Bias 146
Precision 147
The trade-off between bias and precision: the Bill Gates effect 148
Estimation with large samples 149
Estimating a mean 150
Precisely what is a conﬁdence interval 153
Estimating a proportion 154
Estimating the difference between two means 156
Estimating the difference between two proportions 158
Estimation with small samples: the t distribution 160
Estimating a mean 161
Estimating the difference between two means 163
Estimating proportions 164
Summary 165
Key terms and concepts 165
Problems 166
Answers to exercises 169
Appendix: Derivations of sampling distributions 170
By the end of this chapter you should be able to:
● recognise the importance of probability theory in drawing valid inferences or
deriving estimates from a sample of data
● understand the criteria for constructing a good estimate
● construct estimates of parameters of interest from sample data in a variety of
different circumstances
● appreciate that there is uncertainty about the accuracy of any such estimate
● provide measures of the uncertainty associated with an estimate
● recognise the relationship between the size of a sample and the precision of an
estimate derived from it.
Learning
outcomes
144
Complete your diagnostic test for Chapter 4 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C04.qxd 26/02/2009 09:09 Page 144

slide 162:

Point and interval estimation
145
Introduction
We now come to the heart of the subject of statistical inference. Until now the
following type of question has been examined: given the population parameters
μ and σ
2
what is the probability of the sample mean X from a sample of size n
being greater than some speciﬁed value or within some range of values The
parameters μ and σ
2
are assumed to be known and the objective is to try to form
some conclusions about possible values of X. However in practice it is usually
the sample values X and s
2
that are known while the population parameters μ
and σ
2
are not. Thus a more interesting question to ask is: given the values of X
and s
2
what can be said about μ and σ
2
Sometimes the population variance is
known and inferences have to be made about μ alone. For example if a sample
of 50 British families ﬁnds an average weekly expenditure on food X of £37.50
with a standard deviation s of £6.00 what can be said about the average
expenditure μ of all British families
Schematically this type of problem is shown as follows:
Sample information Population parameters
X s
2
inferences about μ σ
2
−−−−−→
This chapter covers the estimation of population parameters such as μ and σ
2
while Chapter 5 describes testing hypotheses about these parameters. The two
procedures are very closely related.
Point and interval estimation
There are basically two ways in which an estimate of a parameter can be
presented. The ﬁrst of these is a point estimate i.e. a single value which is the
best estimate of the parameter of interest. The point estimate is the one which
is most prevalent in everyday usage for example the average Briton surfs the
internet for 30 minutes per day. Although this is presented as a fact it is actu-
ally an estimate obtained from a survey of people’s use of personal computers.
Since it is obtained from a sample there must be some doubt about its accuracy:
the sample will probably not exactly represent the whole population. For this
reason interval estimates are also used which give some idea of the likely
accuracy of the estimate. If the sample size is small for example then it is quite
possible that the estimate is not very close to the true value and this would be
reﬂected in a wide interval estimate for example that the average Briton spends
between 5 and 55 minutes surﬁng the net per day. A larger sample or a better
method of estimation would allow a narrower interval to be derived and thus
a more precise estimate of the parameter to be obtained such as an average
surﬁng time of between 20 and 40 minutes. Interval estimates are better for the
consumer of the statistics since they not only show the estimate of the parameter
but also give an idea of the conﬁdence which the researcher has in that estimate.
The following sections describe how to construct both types of estimate.
STFE_C04.qxd 26/02/2009 09:09 Page 145

slide 163:

Chapter 4 • Estimation and conﬁdence intervals
146
Rules and criteria for ﬁnding estimates
In order to estimate a parameter such as the population mean a rule or set of
rules is required which describes how to derive the estimate of the parameter
from the sample data. Such a rule is known as an estimator. An example of an
estimator for the population mean is ‘use the sample mean’. It is important to
distinguish between an estimator a rule and an estimate which is the value
derived as a result of applying the rule to the data.
There are many possible estimators for any parameter so it is important to be
able to distinguish between good and bad estimators. The following examples
provide some possible estimators of the population mean:
1 the sample mean
2 the smallest sample observation
3 the ﬁrst sample observation.
A set of criteria is needed for discriminating between good and bad estimators.
Which of the above three estimators is ‘best’ Two important criteria by which
to judge estimators are bias and precision.
Bias
It is impossible to know if a single estimate of a parameter derived by applying
a particular estimator to the sample data gives a correct estimate of the para-
meter or not. The estimate might be too low or too high and since the parameter
is unknown it is impossible to check this. What is possible however is to say
whether an estimator gives the correct answer on average. An estimator which
gives the correct answer on average is said to be unbiased. Another way of ex-
pressing this is to say that an unbiased estimator does not systematically mislead
the researcher away from the correct value of the parameter. It is however
important to remember that even using an unbiased estimator does not
guarantee that a single use of the estimator will yield a correct estimate of the
parameter. Bias or the lack of it is a theoretical property.
Formally an estimator is unbiased if its expected value is equal to the para-
meter being estimated. Consider trying to estimate the population mean using
the three estimators suggested above. Taking the sample mean ﬁrst we have
already learned see equation 3.15 that its expected value is μ i.e.
EX μ
which immediately shows that the sample mean is an unbiased estimator.
The second estimator the smallest observation in the sample can easily be
shown to be biased using the result derived above. Since the smallest sample
observation must be less than the sample mean its expected value must be less
than μ. Denote the smallest observation by x
s
then
Ex
s
μ
so this estimator is biased downwards. It underestimates the population mean.
The size of the bias is simply the difference between the expected value of the
estimator and the value of the parameter so the bias in this case is
STFE_C04.qxd 26/02/2009 09:09 Page 146

slide 164:

Rules and criteria for ﬁnding estimates
147
Bias Ex
s
− μ 4.1
For the sample mean X the bias is obviously zero.
Turning to the third rule the ﬁrst sample observation this can be shown to
be another unbiased estimator. Choosing the ﬁrst observation from the sample
is equivalent to taking a random sample of size one from the population in the
ﬁrst place. Thus the single observation may be considered as the sample mean
from a random sample of size one. Since it is a sample mean it is unbiased as
demonstrated earlier.
Precision
Two of the estimators above were found to be unbiased and in fact there
are many unbiased estimators the sample median is another. Some way of
choosing between the set of all unbiased estimators is therefore required which
is where the criterion of precision helps. Unlike bias precision is a relative
concept comparing one estimator to another. Given two estimators A and B
A is more precise than B if the estimates it yields from all possible samples are
less spread out than those of estimator B. A precise estimator will tend to give
similar estimates for all possible samples.
Consider the two unbiased estimators found above: how do they compare on
the criteria of precision It turns out that the sample mean is the more precise
of the two and it is not difﬁcult to understand why. Taking just a single sample
observation means that it is quite likely to be unrepresentative of the population
as a whole and thus leads to a poor estimate of the population mean. The
sample mean on the other hand is based on all the sample observations and it
is unlikely that all of them are unrepresentative of the population. The sample
mean is therefore a good estimator of the population mean being more precise
than the single observation estimator.
Just as bias was related to the expected value of the estimator so precision can
be deﬁned in terms of the variance. One estimator is more precise than another
if it has a smaller variance. Recall that the probability distribution of the sample
mean is
X Nμ σ
2
/n 4.2
in large samples so the variance of the sample mean is
VX σ
2
/n
As the sample size n becomes larger the variance of the sample mean
becomes smaller so the estimator becomes more precise. For this reason large
samples give better estimates than small samples and so the sample mean is
a better estimator than taking just one observation from the sample. The two
estimators can be compared in a diagram see Figure 4.1 which draws the
probability distributions of the two estimators.
It is easily seen that the sample mean yields estimates which are on average
closer to the population mean.
A related concept is that of efﬁciency. The efﬁciency of one unbiased estimator
relative to another is given by the ratio of their sampling variances. Thus the
efﬁciency of the ﬁrst observation estimator relative to the sample mean is given by
STFE_C04.qxd 26/02/2009 09:09 Page 147

slide 165:

Chapter 4 • Estimation and conﬁdence intervals
148
Efﬁciency 4.3
Thus the efﬁciency is determined by the relative sample sizes in this case. Other
things being equal a more efﬁcient estimator is to be preferred.
Similarly the variance of the median can be shown to be for a Normal
distribution π/2 × σ
2
/n. The efﬁciency of the median is therefore 2/ π ≈ 64.
The trade-off between bias and precision: the Bill Gates effect
It should be noted that just because an estimator is biased does not necessarily
mean that it is imprecise. Sometimes there is a trade-off between an unbiased
but imprecise estimator and a biased but precise one. Figure 4.2 illustrates this.
Although estimator A is biased it will nearly always yield an estimate which
is fairly close to the true value even though the estimate is expected to be wrong
it is not likely to be far wrong. Estimator B although unbiased can give estim-
ates which are far away from the true value so that A might be the preferred
estimator.
1
n
σ
2
/n
σ
2
varX
varx
1
Figure 4.1
The sampling
distribution of two
estimators
Figure 4.2
The trade-off between
bias and precision
Note: Curve A shows the distribution of sample means which is the more precise estimator.
B shows the distribution of estimates using a single observation.
STFE_C04.qxd 26/02/2009 09:09 Page 148

slide 166:

Estimation with large samples
149
As an example of this suppose we are trying to estimate the average wealth
of the US population. Consider the following two estimators:
1 use the mean wealth of a random sample of Americans
2 use the mean wealth of a random sample of Americans but if Bill Gates is
in the sample omit him from the calculation.
Bill Gates is the Chairman of Microsoft and one of the world’s richest men.
Because of this he is a dollar billionaire about 50bn according to recent
reports – it varies with the stock market. His presence in a sample of say
30 observations would swamp the sample and give a highly misleading result.
Assuming Bill Gates has 50bn and the others each have 200 000 of wealth the
average wealth would be estimated at about 1.6bn which is surely wrong.
The ﬁrst rule could therefore give us a wildly incorrect answer although the
rule is unbiased. The second rule is clearly biased but does rule out the possib-
ility of such an unlucky sample. We can work out the approximate bias. It is
the difference between the average wealth of all Americans and the average
wealth of all Americans except Bill Gates. If the true average of all 250 million
Americans is 200 000 then total wealth is 50 000bn. Subtracting Bill’s 50bn
leaves 49 950bn shared among the rest giving 199 800 each a difference of
0.1. This is what we would expect the bias to be.
It might seem worthwhile therefore to accept this degree of bias in order to
improve the precision of the estimate. Furthermore if we did use the biased rule
we could always adjust the sample mean upwards by 0.1 to get an approxim-
ately unbiased estimate.
Of course this point applies to any exceptionally rich person not just Bill
Gates. It points to the need to ensure that the rich are not over- nor under-
represented in the sample. Chapter 9 on sampling methods investigates this
point in more detail. In the rest of this book only unbiased estimators are con-
sidered the most important being the sample mean.
Estimation with large samples
For the type of problem encountered in this chapter the method of estimation
differs according to the size of the sample. ‘Large’ samples by which is meant
sample sizes of 25 or more are dealt with ﬁrst using the Normal distribution.
Small samples are considered in a later section where the t distribution is used
instead of the Normal. The differences are relatively minor in practical terms and
there is a close theoretical relationship between the t and Normal distributions.
With large samples there are three types of estimation problem we will
consider.
1 The estimation of a mean from a sample of data.
2 The estimation of a proportion on the basis of sample evidence. This would
consider a problem such as estimating the proportion of the population
intending to buy an iPhone based on a sample of individuals. Each person
in the sample would simply indicate whether they have bought or intend
to buy an iPhone. The principles of estimation are the same as in the ﬁrst
case but the formulae used for calculation are slightly different.
STFE_C04.qxd 26/02/2009 09:09 Page 149

slide 167:

Chapter 4 • Estimation and conﬁdence intervals
150
3 The estimation of the difference of two means or proportions for example
a problem such as estimating the difference between men and women’s
expenditure on clothes. Once again the principles are the same the formulae
different.
Estimating a mean
To demonstrate the principles and practice of estimating the population mean
we shall take the example of estimating the average wealth of the UK popula-
tion the full data for which were given in Chapter 1. Suppose that we did not
have this information but were required to estimate the average wealth from
a sample of data. In particular let us suppose that the sample size is n 100 the
sample mean is X 130 in £000 and the sample variance is s
2
50 000.
Obviously this sample has got fairly close to the true values see Chapter 1 but
we could not know that from the sample alone. What can we infer about the
population mean μ from the sample data alone
For the point estimate of μ the sample mean is a good candidate since it is
unbiased and it is more precise than other sample statistics such as the median.
The point estimate of μ is simply £130 000 therefore.
The point estimate does not give an idea of the uncertainty associated with the
estimate. We are not absolutely sure that the mean is £130 000 in fact it isn’t –
it is £146 984. The interval estimate gives some idea of the uncertainty. It is centred
on the sample mean but gives a range of values to express the uncertainty.
To obtain the interval estimate we ﬁrst require the probability distribution of
X ﬁrst established in Chapter 3 equation 3.18
X Nμ σ
2
/n 4.4
From this it was calculated that there is a 95 probability of the sample mean
lying within 1.96 standard errors of μ
1
i.e.
Prμ − 1.96 X μ + 1.96 0.95
We can manipulate each of the inequalities within the brackets to make μ the
subject of the expression
μ − 1.96 X implies μ X + 1.96
Similarly
X μ + 1.96 implies X − 1.96 μ
Combining these two new expressions we obtain
X − 1.96 μ X + 1.96 4.5
We have transformed the probability interval. Instead of saying X lies within
1.96 standard errors of μ we now say μ lies within 1.96 standard errors of X.
Figure 4.3 illustrates this manipulation. Figure 4.3a shows μ at the centre of a
probability interval for X. Figure 4.3b shows a sample mean X at the centre of
an interval relating to the possible positions of μ.
σ
2
/n σ
2
/n
σ
2
/n σ
2
/n
σ
2
/n σ
2
/n
σ
2
/n σ
2
/n
1
See equation 3.23 in Chapter 3 to remind yourself of this. Remember that ±1.96 is the
z score which cuts off 2.5 in each tail of the normal distribution.
STFE_C04.qxd 26/02/2009 09:09 Page 150

slide 168:

Estimation with large samples
151
The interval shown in equation 4.5 is called the 95 conﬁdence interval and
this is the interval estimate for μ. In this example the value of σ
2
is unknown
but in large n 25 samples it can be replaced by s
2
from the sample. s
2
is here
used as an estimate of σ
2
which is unbiased and sufﬁciently precise in large
n 25 or so samples. The 95 conﬁdence interval is therefore
X − 1.96 μ X + 1.96
130 − 1.96 130 + 1.96
86.2 173.8 4.6
Thus we are 95 conﬁdent that the true average level of wealth lies between
£86 200 and £173 800. It should be noted that £130 000 lies exactly at the
centre of the interval
2
because of the symmetry of the Normal distribution.
By examining equation 4.6 one can see that the conﬁdence interval is wider
● the smaller the sample size
● the greater the standard deviation of the sample.
50 000 100 / 50 000 100 /
sn
2
/ sn
2
/
Figure 4.3a
The 95 probability
interval for around the
population mean μ
Figure 4.3b
The 95 conﬁdence
interval for μ around
the sample mean
2
The two values are the lower and upper limits of the interval separated by a comma.
This is the standard way of writing a conﬁdence interval.
STFE_C04.qxd 26/02/2009 09:09 Page 151

slide 169:

Chapter 4 • Estimation and conﬁdence intervals
152
The greater uncertainty which is associated with smaller sample sizes is
manifested in a wider conﬁdence interval estimate of the population mean. This
occurs because a smaller sample has more chance of being unrepresentative just
because of an unlucky sample.
Greater variation in the sample data also leads to greater uncertainty about
the population mean and a wider conﬁdence interval. Greater sample variation
suggests greater variation in the population so again a given sample could
include observations which are a long way off the mean. Note that in this exam-
ple there is great variation of wealth in the population and hence in the sample
also. This means that a sample of 100 is not very informative the conﬁdence
interval is quite wide. We would need a substantially larger sample to obtain a
more precise estimate.
Note that the width of the conﬁdence interval does not depend upon the
population size – a sample of 100 observations reveals as much about a popula-
tion of 10 000 as it does about a population of 10 000 000. This point will be
discussed in more detail in Chapter 9 on sampling methods. This is a result that
often surprises people who generally believe that a larger sample is required if
the population is larger.
Worked example 4.1
A sample of 50 school students found that they spent 45 minutes doing
homework each evening with a standard deviation of 15 minutes. Estimate
the average time spent on homework by all students.
The sample data are X 45 s 15 and n 50. If we can assume the sample
is representative we may use X as an unbiased estimate of μ the population
mean. The point estimate is therefore 45 minutes.
The 95 conﬁdence interval is given by equation 4.6
X − 1.96 μ X + 1.96
45 − 1.96 μ 45 + 1.96
40.8 49.2
We are 95 conﬁdent the true answer lies between 40.8 and 49.2 minutes.
a A sample of 100 is drawn from a population. The sample mean is 25 and the
sample standard deviation is 50. Calculate the point and 95 conﬁdence interval
estimates for the population mean.
b If the sample size were 64 how would this alter the point and interval estimates
A sample of size 40 is drawn with sample mean 50 and standard deviation 30. Is it
likely that the true population mean is 60
15 50
2
/ 15 50
2
/
sn
2
/ sn
2
/
Exercise 4.1
Exercise 4.2
STFE_C04.qxd 26/02/2009 09:09 Page 152

slide 170:

Precisely what is a conﬁdence interval
153
Precisely what is a conﬁdence interval
There is often confusion over what a conﬁdence interval actually means. This
is not really surprising since the obvious interpretation turns out to be wrong.
It does not mean that there is a 95 chance that the true mean lies within
the interval. We cannot make such a probability statement because of our
deﬁnition of probability based on the frequentist view of a probability. That
view states that one can make a probability statement about a random variable
such as X but not about a parameter such as μ. μ either lies within the inter-
val or it does not – it cannot lie 95 within it. Unfortunately we just do not
know what the truth is.
It is for this reason that we use the term ‘conﬁdence interval’ rather than
‘probability interval’. Unfortunately words are not as precise as numbers or
algebra and so most people fail to recognise the distinction. A precise explana-
tion of the 95 conﬁdence interval runs as follows. If we took many samples all
the same size from a population with mean μ and calculated a conﬁdence inter-
val from each we would ﬁnd that μ lies within 95 of the calculated intervals.
Of course in practice we do not take many samples usually just one. We do not
know and cannot know if our one sample is one of the 95 or one of the 5
that miss the mean.
Figure 4.4 illustrates the point. It shows 95 conﬁdence intervals calculated
from 20 samples drawn from a population with a mean of 5. As expected we see
that 19 of these intervals contain the true mean while the interval calculated
from sample 18 does not contain the true value. This is the expected result but
is not guaranteed. You might obtain all 20 intervals containing the true mean
or fewer than 19. In the long run with lots of estimates we would expect 95
of the calculated intervals to contain the true mean.
A second question is why use a probability and hence a conﬁdence level
of 95 In fact one can choose any conﬁdence level and thus conﬁdence
Figure 4.4
Conﬁdence intervals
calculated from 20
samples
STFE_C04.qxd 26/02/2009 09:09 Page 153

slide 171:

Chapter 4 • Estimation and conﬁdence intervals
154
interval. The 90 conﬁdence interval can be obtained by ﬁnding the z score
which cuts off 10 of the Normal distribution 5 in each tail. From Table A2
see page 414 this is z 1.64 so the 90 conﬁdence interval is
X − 1.64 μ X + 1.64 4.7
130 − 1.64 130 + 1.64
93.3 166.7
Notice that this is narrower than the 95 conﬁdence level. The greater the
degree of conﬁdence required the wider the interval has to be. Any conﬁdence
level may be chosen and by careful choice of this level the conﬁdence interval
can be made as wide or as narrow as wished. This would seem to undermine
the purpose of calculating the conﬁdence interval which is to obtain some idea
of the uncertainty attached to the estimate. This is not the case however
because the reader of the results can interpret them appropriately as long as
the conﬁdence level is made clear. To simplify matters the 95 and 99 con-
ﬁdence levels are the most commonly used and serve as conventions. Beware of
the researcher who calculates the 76 conﬁdence interval – this may have been
chosen in order to obtain the desired answer rather than in the spirit of scientiﬁc
enquiry The general formula for the 100 − α conﬁdence interval is
X − z
α
X + z
α
4.8
where z
α
is the z score which cuts off the extreme α of the Normal distribution.
Estimating a proportion
It is often the case that we wish to estimate the proportion of the population
that has a particular characteristic e.g. is unemployed rather than wanting
an average. Given what we have already learned this is fairly straightforward
and is based on similar principles. Suppose that following Chapter 1 we wish
to estimate the proportion of educated men who are unemployed. We have a
random sample of 200 men of whom 15 are unemployed. What can we infer
The sample data are
n 200 and
p 0.075 15/200
where p is the sample proportion unemployed. We denote the population pro-
portion by the Greek letter π and it is this that we are trying to estimate using
data from the sample.
The key to solving this problem is recognising p as a random variable just
like the sample mean. This is because its value depends upon the sample drawn
and will vary from sample to sample. Once the probability distribution of this
random variable is established the problem is quite easy to solve using the same
methods as were used for the mean. The sampling distribution of p is
3
p N π 4.9
D
F
π 1 − π
n
A
C
sn
2
/ sn
2
/
50 000 100 / 50 000 100 /
sn
2
/ sn
2
/
3
See the Appendix to this chapter page 170 for the derivation of this formula.
STFE_C04.qxd 26/02/2009 09:09 Page 154

slide 172:

Precisely what is a conﬁdence interval
155
4
As usual the 95 conﬁdence interval limits are given by the point estimate plus and
minus 1.96 standard errors.
This tells us that the sample proportion is centred around the true value but will
vary around it varying from sample to sample. This variation is expressed by
the variance of p whose formula is π 1 − π/n. Having derived the probability
distribution of p the same methods of estimation can be used as for the sample
mean. Since the expected value of p is π the sample proportion is an unbiased
estimate of the population parameter. The point estimate of π is simply p there-
fore. Thus it is estimated that 7.5 of all educated men are unemployed.
Given the sampling distribution for p in equation 4.9 above the formula for
the 95 conﬁdence interval
4
for π can immediately be written down as
4.10
As the value of π is unknown the conﬁdence interval cannot yet be calcu-
lated so the sample value of 0.075 has to be used instead of the unknown π.
Like the substitution of s
2
for σ
2
in the case of the sample mean above this is
acceptable in large samples. Thus the 95 conﬁdence interval becomes
4.11
0.075 − 0.037 0.075 + 0.037
0.038 0.112
We say that we are 95 conﬁdent that the true proportion of unemployed
educated men lies between 3.8 and 11.2.
It can be seen that these two cases apply a common method. The 95 con-
ﬁdence interval is given by the point estimate plus or minus 1.96 standard errors.
For a different conﬁdence level 1.96 would be replaced by the appropriate value
from the standard Normal distribution.
With this knowledge two further cases can be swiftly dealt with.
Worked example 4.2 Music down the phone
Do you get angry when you try to phone an organisation and you get an
automated reply followed by music while you hang on Well you are not
alone. Mintel a consumer survey company asked 1946 adults what they
thought of music played to them while they were trying to get through on
the phone. 36 reported feeling angered by the music and more than one
in four were annoyed by the automated voice response.
With these data we can calculate a conﬁdence interval for the true pro-
portion of people who dislike the music. First we assume that the sample
is a truly random one. This is probably not strictly true so our calculated
conﬁdence interval will only be an approximate one. With p 0.36 and
n 1946 we obtain the following 95 interval
pp .
. .
.
. .
−
−
+
−
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ 196
0 075 1 0 075
200
196
0 075 1 0 075
200
p
n
p
n
.
.
−
−
+
−
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ 196
1
196
1 ππ ππ
➔
STFE_C04.qxd 26/02/2009 09:09 Page 155

slide 173:

Chapter 4 • Estimation and conﬁdence intervals
156
0.36 ± 0.021 0.339 0.381
Mintel further estimated that 2800 million calls were made by customers to
call centres per year so we can be approximately 95 conﬁdent that between
949 million and 1067 million of those calls have an unhappy customer on
the line
Source: The Times 10 July 2000.
Estimating the difference between two means
We now move on to estimating differences. In this case we have two samples
and want to know whether there is a difference between their respective popula-
tions. One sample might be of men the other of women or we could be com-
paring two different countries etc. A point estimate of the difference is easy to
obtain but once again there is some uncertainty around this ﬁgure because it is
based on samples. Hence we measure that uncertainty via a conﬁdence interval.
All we require are the appropriate formulae. Consider the following example.
Sixty pupils from school 1 scored an average mark of 62 in an exam with
a standard deviation of 18 35 pupils from school 2 scored an average of 70
with standard deviation 12. Estimate the true difference between the two
schools in the average mark obtained.
This is a more complicated problem than those previously treated since it
involves two samples rather than one. An estimate has to be found for μ
1
− μ
2
the true difference in the mean marks of the schools in the form of both point
and interval estimates. The pupils taking the exams may be thought of as
samples of all pupils in the schools who could potentially take the exams.
Notice that this is a problem about sample means not proportions even
though the question deals in percentages. The point is that each observation in
the sample i.e. each student’s mark can take a value between 0 and 100 and
one can calculate the standard deviation of the marks. For this to be a problem
of sample proportions the mark for each pupil would each have to be of the
pass/fail type so that one could only calculate the proportion who passed.
It might be thought that the way to approach this problem is to derive
one conﬁdence interval for each sample along the lines set out above and
then to somehow combine them for example the degree of overlap of the two
conﬁdence intervals could be assessed. This is not the best approach however.
It is sometimes a good strategy when faced with an unfamiliar problem to solve to
translate it into a more familiar problem and then solve it using known methods.
This is the procedure which will be followed here. The essential point is to keep
in mind the concept of a random variable and its probability distribution.
Problems involving a single random variable have already been dealt with
above. The current problem deals with two samples and therefore there are two
random variables to consider i.e. the two sample means X
1
and X
2
. Since the aim
is to estimate μ
1
− μ
2
an obvious candidate for an estimator is the difference
between the two sample means X
1
− X
2
. We can think of this as a single random
p
pp
n
.
. .
. .
±×
−
± ×
−
196
1
036 196
0361 036
1946
STFE_C04.qxd 26/02/2009 09:09 Page 156

slide 174:

Precisely what is a conﬁdence interval
157
variable even though two means are involved and use the methods we have
already learned. We therefore need to establish the sampling distribution of
X
1
− X
2
. This is derived in the Appendix to this chapter see page 170 and
results in equation 4.12
X
1
− X
2
N μ
1
− μ
2
− 4.12
This equation states that the difference in sample means will be centred on the
difference in the two population means with some variation around this as
measured by the variance. One assumption behind the derivation of equation
4.12 is that the two samples are independently drawn. This is likely in this
example it is difﬁcult to see how the samples from the two schools could be
connected. However one must always bear this possibility in mind when com-
paring samples. For example if one were comparing men’s and women’s
heights it would be dangerous to take samples of men and their wives as they
are unlikely to be independent. People tend to marry partners of a similar height
to themselves so this might bias the results.
The distribution of X
1
− X
2
is illustrated in Figure 4.5. Equation 4.12 shows
that X
1
− X
2
is an unbiased estimator of μ
1
− μ
2
. The difference between the
sample means will therefore be used as the point estimate of μ
1
− μ
2
. Thus the
point estimate of the true difference between the schools is
X
1
− X
2
62 − 70 −8
The 95 conﬁdence interval estimate is derived in the same manner as
before making use of the standard error of the random variable. The formula is
5
4.13
As the values of σ
2
are unknown they have been replaced in equation 4.13
by their sample values. As in the single sample case this is acceptable in large
samples. The 95 conﬁdence interval for μ
1
− μ
2
is therefore
. . XX XX
12
1
2
1
2
2
2
12
1
2
1
2
2
2
196 196 −− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ s
n
s
n
s
n
s
n
D
F
σ
2
2
n
2
σ
2
1
n
1
A
C
Figure 4.5
The distribution of
1
−
2
5
The term under the square root sign is the standard error for X
1
− X
2
STFE_C04.qxd 26/02/2009 09:09 Page 157

slide 175:

Chapter 4 • Estimation and conﬁdence intervals
158
−14.05 −1.95
The estimate is that school 2’s average mark is between 1.95 and 14.05 per-
centage points above that of school 1. Notice that the conﬁdence interval does
not include the value zero which would imply equality of the two schools’
marks. Equality of the two schools can thus be ruled out with 95 conﬁdence.
Worked example 4.3
A survey of holidaymakers found that on average women spent 3 hours
per day sunbathing men spent 2 hours. The sample sizes were 36 in each
case and the standard deviations were 1.1 hours and 1.2 hours respectively.
Estimate the true difference between men and women in sunbathing habits.
Use the 99 conﬁdence level.
The point estimate is simply one hour the difference of sample means. For
the conﬁdence interval we have
0.30 1.70
This evidence suggests women do spend more time sunbathing than men zero
is not in the conﬁdence interval. Note that we might worry the samples
might not be independent here – it could represent 36 couples. If so the
evidence is likely to underestimate the true difference if anything as couples
are likely to spend time sunbathing together.
Estimating the difference between two proportions
We move again from means to proportions. We use a simple example to illustrate
the analysis of this type of problem. Suppose that a survey of 80 Britons showed
that 60 owned personal computers. A similar survey of 50 Swedes showed 30
with computers. Are personal computers more widespread in Britain than Sweden
Here the aim is to estimate π
1
− π
2
the difference between the two population
proportions so the probability distribution of p
1
− p
2
is needed the difference
of the sample proportions. The derivation of this follows similar lines to those
set out above for the difference of two sample means so is not repeated. The
probability distribution is
p
1
− p
2
N π
1
− π
2
+ 4.14
D
F
π
2
1 − π
2
n
2
π
1
1 − π
1
n
1
A
C
.
.
.
.
.
.
32 257
11
36
12
36
32 257
11
36
12
36
22 22
−− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ μ
. . XX XX
12
1
2
1
2
2
2
12
1
2
1
2
2
2
257 2 57 −− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ s
n
s
n
s
n
s
n
μ
. . 62 70 1 96
18
60
12
35
62 70 1 96
18
60
12
35
22 22
−− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ STFE_C04.qxd 26/02/2009 09:09 Page 158

slide 176:

Precisely what is a conﬁdence interval
159
Again the two samples must be independently drawn for this to be correct it is
difﬁcult to see how they could not be in this case.
Since the difference between the sample proportions is an unbiased estimate
of the true difference this will be used for the point estimate. The point estimate
is therefore
p
1
− p
2
60/80 − 30/50
0.15
or 15. The 95 conﬁdence interval is given by
4.15
π
1
and π
2
are unknown so have to be replaced by p
1
and p
2
for purposes of
calculation so the interval becomes
0.016 0.316 4.16
The result is a fairly wide conﬁdence interval due to the relatively small sample
sizes. The interval does not include zero however so we can be 95 conﬁdent
there is a difference between the two countries.
a Seven people out of a sample of 50 are left-handed. Estimate the true proportion
of left-handed people in the population ﬁnding both point and interval estimates.
b Repeat part a but ﬁnd the 90 conﬁdence interval. How does the 90 interval
compare with the 95 interval
c Calculate the 99 interval and compare to the others.
Given the following data from two samples calculate the true difference between the
means. Use the 95 conﬁdence level.
X
1
25 X
2
30
s
1
18 s
2
25
n
1
36 n
2
49
A survey of 50 16-year old girls revealed that 40 had a boyfriend. A survey of
100 16-year old boys revealed 20 with a girlfriend. Estimate the true difference in
proportions between the sexes.
075 060 196
075 025
80
060 040
50
. . .
. .
. .
−+
×
+
×
⎤ ⎦ ⎥ ⎥
075 060 196
075 025
80
060 040
50
. . .
. .
. .
−−
×
+
×
⎡ ⎣ ⎢ ⎢
pp
nn
12
11
1
22
2
196
11
.
−+
−
+
−
⎤ ⎦ ⎥ ⎥ ππ π π
pp
nn
12
11
1
22
2
196
11
.
−−
−
+
−
⎡ ⎣ ⎢ ⎢ ππ π π
Exercise 4.3
Exercise 4.4
Exercise 4.5
STFE_C04.qxd 26/02/2009 09:09 Page 159

slide 177:

Chapter 4 • Estimation and conﬁdence intervals
160
Estimation with small samples: the t distribution
So far only large samples deﬁned as sample sizes in excess of 25 have been dealt
with which means that by the Central Limit Theorem the sampling distribu-
tion of X follows a Normal distribution whatever the distribution of the parent
population. Remember from the two theorems of Chapter 3 that:
● if the population follows a Normal distribution X is also Normally distributed
and
● if the population is not Normally distributed X is approximately Normally
distributed in large samples n 25.
In both cases conﬁdence intervals can be constructed based on the fact that
N0 1 4.17
and so the standard Normal distribution is used to ﬁnd the values which cut
off the extreme 5 of the distribution z ±1.96. In practical examples we
had to replace σ by its estimate s. Thus the conﬁdence interval was based on
the fact that
N0 1 4.18
in large samples. For small sample sizes equation 4.18 is no longer true.
Instead the relevant distribution is the t distribution and we have
6
t
n −1
4.19
The random variable deﬁned in equation 4.19 has a t distribution with n − 1
degrees of freedom. As the sample size increases the t distribution approaches
the standard Normal so the latter can be used for large samples.
The t distribution was derived by W.S. Gossett in 1908 while conducting tests
on the average strength of Guinness beer who says statistics has no impact on
the real world. He published his work under the pseudonym ‘Student’ since
the company did not allow its employees to publish under their own names so
the distribution is sometimes also known as the Student distribution.
The t distribution is in many ways similar to the standard Normal insofar as
it is:
● unimodal
● symmetric
● centred on zero
● bell-shaped
● extends from minus inﬁnity to plus inﬁnity.
X
/
− μ
s n
2
X
/
− μ
s n
2
X
/
− μ
σ
2
n
6
We also require the assumption that the parent population is Normally distributed for
equation 4.19 to be true.
STFE_C04.qxd 26/02/2009 09:09 Page 160

slide 178:

Estimation with small samples: the t distribution
161
The differences are that it is more spread out has a larger variance than the
standard Normal distribution and has only one parameter rather than two:
the degrees of freedom denoted by the Greek letter ν pronounced ‘nu’
7
. In
problems involving the estimation of a sample mean the degrees of freedom
are given by the sample size minus one i.e. ν n − 1.
The t distribution is drawn in Figure 4.6 for various values of the parameter
ν. Note that the fewer the degrees of freedom smaller sample size the more
dispersed is the distribution.
To summarise the argument so far when
● the sample size is small and
● the sample variance is used to estimate the population variance
then the t distribution should be used for constructing conﬁdence intervals not
the standard Normal. This results in a slightly wider interval than would be obtained
using the standard Normal distribution which reﬂects the slightly greater uncer-
tainty involved when s
2
is used as an estimate of σ
2
if the sample size is small.
Apart from this the methods are exactly as before and are illustrated by the
examples below. We look ﬁrst at estimating a single mean then at estimating
the difference of two means. The t distribution cannot be used for small sample
proportions explained below so these cases are not considered.
Estimating a mean
The following would seem to be an appropriate example. A sample of 15 bottles
of beer showed an average speciﬁc gravity of 1035.6 with standard deviation
2.7. Estimate the true speciﬁc gravity of the brew.
The sample information may be summarised as
X 1035.6
s 2.7
n 15
7
Once again the Greeks pronounce this differently as ‘ni’. They also pronounce π ‘pee’
rather than ‘pie’ as in English. This makes statistics lectures in English hard for Greeks to
understand
Figure 4.6
The t distribution drawn
for different degrees of
freedom
STFE_C04.qxd 26/02/2009 09:09 Page 161

slide 179:

Chapter 4 • Estimation and conﬁdence intervals
162
The sample mean is still an unbiased estimator of μ this is true regardless of
the distribution of the population and serves as point estimate of μ. The point
estimate of μ is therefore 1035.6.
Since σ is unknown the sample size is small and it can be assumed that the
speciﬁc gravity of all bottles of beer is Normally distributed numerous small
random factors affect the speciﬁc gravity we should use the t distribution. Thus
t
n−1
4.20
The 95 conﬁdence interval estimate is given by
4.21
where t
n−1
is the value of the t distribution which cuts off the extreme 5
2.5 in each tail of the t distribution with ν degrees of freedom. Table A3
see page 415 gives percentage points of the t distribution and part of it is
reproduced in Table 4.1.
The structure of the t distribution table is different from that of the standard
Normal table. The ﬁrst column of the table gives the degrees of freedom. In this
example we want the row corresponding to ν n − 1 14. The appropriate
column of the table is the one headed ‘0.025’ which indicates the area cut off in
each tail. At the intersection of this row and column we ﬁnd the appropriate
value t
14
2.145. Therefore the conﬁdence interval is given by
1035.6 − 2.145 1035.6 + 2.145
which when evaluated gives
1034.10 1037.10
We can be 95 conﬁdent that the true speciﬁc gravity lies within this range.
If the Normal distribution had incorrectly been used for this problem then
the t value of 2.145 would have been replaced by a z score of 1.96 giving a
conﬁdence interval of
1034.23 1036.97
27 15
2
./ 27 15
2
./
XX / / −+
⎣⎦
−−
tsn tsn
nn 1
2
1
2
X
/
− μ
s n
2
Table 4.1 Percentage points of the t distribution excerpt from Table A3
Area α in each tail
ν 0.4 0.25 0.10 0.05 0.025 0.01 0.005
1 0.325 1.000 3.078 6.314 12.706 31.821 63.656
2 0.289 0.816 1.886 2.920 4.303 6.965 9.925
33 3 3 3 3 3 3
13 0.259 0.694 1.350 1.771 2.160 2.650 3.012
14 0.258 0.692 1.345 1.761 2.145 2.624 2.977
15 0.258 0.691 1.341 1.753 2.131 2.602 2.947
Note: The appropriate t value for constructing the conﬁdence interval is found at the
intersection of the shaded row and column.
STFE_C04.qxd 26/02/2009 09:09 Page 162

slide 180:

Estimation with small samples: the t distribution
163
This underestimates the true conﬁdence interval and gives the impression of a
more precise estimate than is actually the case. Use of the Normal distribution
leads to a conﬁdence interval which is 8.7 too narrow in this case.
Estimating the difference between two means
As in the case of a single mean the t-distribution needs to be used in small
samples when the population variances are unknown. Again both parent
populations must be Normally distributed and in addition it must be assumed
that the population variances are equal i.e. σ
2
1
σ
2
2
this is required in the
mathematical derivation of the t distribution. This latter assumption was not
required in the large-sample case using the Normal distribution. Consider the
following example as an illustration of the method.
A sample of 20 Labour-controlled local authorities shows that they spend
an average of £175 per taxpayer on administration with a standard deviation of
£25. A similar survey of 15 Conservative-controlled authorities ﬁnds an average
ﬁgure of £158 with standard deviation of £30. Estimate the true difference in
expenditure between Labour and Conservative authorities.
The sample information available is
X
1
175 X
2
158
s
1
25 s
2
30
n
1
20 n
2
15
We wish to estimate μ
1
− μ
2
. The point estimate of this is X
1
− X
2
which is an
unbiased estimate. This gives 175 − 158 17 as the expected difference between
the two sets of authorities.
For the conﬁdence interval the t distribution has to be used since the sample
sizes are small and the population variances unknown. It is assumed that the
populations are Normally distributed and that the samples have been independ-
ently drawn. We also assume that the population variances are equal which
seems justiﬁed since s
1
and s
2
do not differ by much this kind of assumption is
tested in Chapter 6. The conﬁdence interval is given by the formula
4.22
where
S
2
4.23
is known as the pooled variance and
ν n
1
+ n
2
− 2
gives the degrees of freedom associated with the t distribution.
S
2
is an estimate of the common value of the population variances. It would
be inappropriate to have the differing values s
1
2
and s
2
2
in the formula for this t
distribution for this would be contrary to the assumption that σ
2
1
σ
2
2
which is
essential for the use of the t distribution. The estimate of the common popula-
tion variance is just the weighted average of the sample variances using degrees
n
1
− 1s
2
1
+ n
2
− 1s
2
2
n
1
+ n
2
− 2
XX XX
12
2
1
2
2
12
2
1
2
2
−− + − + +
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ t
S
n
S
n
t
S
n
S
n
νν
μ
STFE_C04.qxd 26/02/2009 09:09 Page 163

slide 181:

Chapter 4 • Estimation and conﬁdence intervals
164
of freedom as weights. Each sample has n − 1 degrees of freedom and the total
number of degrees of freedom for the problem is the sum of the degrees of
freedom in each sample. The degrees of freedom is thus 20 + 15 − 2 33
and hence the value t 2.042 cuts off the extreme 5 of the distribution. The
t table in the Appendix does not give the value for ν 33 so we have used ν 30
instead which will give a close approximation.
To evaluate the 95 conﬁdence interval we ﬁrst calculate S
2
S
2
741.6
Inserting this into equation 4.22 gives
−1.99 35.99
Thus the true difference is quite uncertain and the evidence is even con-
sistent with Conservative authorities spending more than Labour authorities.
The large degree of uncertainty arises because of the small sample sizes and the
quite wide variation within each sample.
One should be careful about the conclusions drawn from this test. The greater
expenditure on administration could be either because of inefﬁciency or because
of a higher level of services provided. To ﬁnd out which is the case would require
further investigation. The statistical test carried out here examines the levels of
expenditure but not whether they are productive or not.
Estimating proportions
Estimating proportions when the sample size is small cannot be done with
the t distribution. Recall that the distribution of the sample proportion p was
derived from the distribution of r the number of successes in n trials which
followed a Binomial distribution see the Appendix to this chapter page 170.
In large samples the distribution of r is approximately Normal thus giving
a Normally distributed sample proportion. In small samples it is inappropriate
to approximate the Binomial distribution with the t distribution and indeed is
unnecessary since the Binomial itself can be used. Small-sample methods for
the sample proportion should be based on the Binomial distribution therefore
as set out in Chapter 3. These methods are thus not discussed further here.
A sample of size n 16 is drawn from a population which is known to be Normally
distributed. The sample mean and variance are calculated as 74 and 121. Find the
99 conﬁdence interval estimate for the true mean.
Samples are drawn from two populations to see if they share a common mean. The
sample data are:
X
1
45 X
2
55
s
1
18 s
2
21
n
1
15 n
2
20
Find the 95 conﬁdence interval estimate of the difference between the two popula-
tion means.
17 2 042
741 6
20
741 6
15
17 2 042
741 6
20
741 6
15
.
.
.
.
.
.
−+ ++
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ 20 − 1 × 25
2
+ 15 − 1 × 30
2
20 + 15 − 2
Exercise 4.6
Exercise 4.7
STFE_C04.qxd 26/02/2009 09:09 Page 164

slide 182:

Key terms and concepts
165
Summary
● Estimation is the process of using sample information to make good estimates
of the value of population parameters for example using the sample mean to
estimate the mean of a population.
● There are several criteria for ﬁnding a good estimate. Two important ones are
the lack of bias and precision of the estimator. Sometimes there is a trade-
off between these two criteria – one estimator might have a smaller bias but
be less precise than another.
● An estimator is unbiased if it gives a correct estimate of the true value on
average. Its expected value is equal to the true value.
● The precision of an estimator can be measured by its sampling variance e.g.
s
2
/n for the mean of a sample.
● Estimates can be in the form of a single value point estimate or a range of
values conﬁdence interval estimate. A conﬁdence interval estimate gives
some idea of how reliable the estimate is likely to be.
● For unbiased estimators the value of the sample statistic e.g. X is used as the
point estimate.
● In large samples the 95 conﬁdence interval is given by the point estimate
plus or minus 1.96 standard errors e.g. X ± 1.96 for the mean.
● For small samples the t distribution should be used instead of the Normal i.e.
replace 1.96 by the critical value of the t distribution to construct conﬁdence
intervals of the mean.
sn
2
/
bias
conﬁdence level and interval
efﬁciency
estimator
inference
interval estimate
maximum likelihood
point estimate
precision
Key terms and concepts
STFE_C04.qxd 26/02/2009 09:09 Page 165

slide 183:

Chapter 4 • Estimation and conﬁdence intervals
166
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
4.1 a Why is an interval estimate better than a point estimate
b What factors determine the width of a conﬁdence interval
4.2 Is the 95 conﬁdence interval a twice as wide b more than twice as wide c less than
twice as wide as the 47.5 interval Explain your reasoning.
4.3 Explain the difference between an estimate and an estimator. Is it true that a good
estimator always leads to a good estimate
4.4 Explain why an unbiased estimator is not always to be preferred to a biased one.
4.5 A random sample of two observations x
1
and x
2
is drawn from a population. Prove that
w
1
x
1
+ w
2
x
2
gives an unbiased estimate of the population mean as long as w
1
+ w
2
1.
Hint: Prove that Ew
1
x
1
+ w
2
x
2
μ.
4.6 Following the previous question prove that the most precise unbiased estimate is
obtained by setting w
1
w
2
Hint: Minimise Vw
1
x
1
+ w
2
x
2
with respect to w
1
after substituting w
2
1 − w
1
. You will need
a knowledge of calculus to solve this.
4.7 Given the sample data
X 40 s 10 n 36
calculate the 99 conﬁdence interval estimate of the true mean. If the sample size were
20 how would the method of calculation and width of the interval be altered
4.8 A random sample of 100 record shops found that the average weekly sale of a particular
CD was 260 copies with standard deviation of 96. Find the 95 conﬁdence interval to
estimate the true average sale for all shops. To compile the CD chart it is necessary to
know the correct average weekly sale to within 5 of its true value. How large a sample
size is required
4.9 Given the sample data p 0.4 n 50 calculate the 99 conﬁdence interval estimate of
the true proportion.
4.10 A political opinion poll questions 1000 people. Some 464 declare they will vote
Conservative. Find the 95 conﬁdence interval estimate for the Conservative share of
the vote.
1
2
Problems
STFE_C04.qxd 26/02/2009 09:09 Page 166

slide 184:

167
Problems
4.11 Given the sample data
X
1
25 X
2
22
s
1
12 s
2
18
n
1
80 n
2
100
estimate the true difference between the means with 95 conﬁdence.
4.12 a A sample of 200 women from the labour force found an average wage of £6000 p.a.
with standard deviation £2500. A sample of 100 men found an average wage of £8000
with standard deviation £1500. Estimate the true difference in wages between men
and women.
b A different survey of men and women doing similar jobs obtained the following
results:
X
W
£7200 X
M
£7600
s
W
£1225 s
M
£750
n
W
75 n
M
50
Estimate the difference between male and female wages using these new data. What
can be concluded from the results of the two surveys
4.13 67 out of 150 pupils from school A passed an exam 62 of 120 pupils at school B
passed. Estimate the 99 conﬁdence interval for the true difference between the propor-
tions passing the exam.
4.14 a A sample of 954 adults in early 1987 found that 23 of them held shares. Given
a UK adult population of 41 million and assuming a proper random sample was
taken ﬁnd the 95 conﬁdence interval estimate for the number of shareholders in
the UK.
b A ‘similar’ survey the previous year had found a total of 7 million shareholders.
Assuming ‘similar’ means the same sample size ﬁnd the 95 conﬁdence interval
estimate of the increase in shareholders between the two years.
4.15 A sample of 16 observations from a Normally distributed population yields a sample
mean of 30 with standard deviation 5. Find the 95 conﬁdence interval estimate of the
population mean.
4.16 A sample of 12 families in a town reveals an average income of £15 000 with standard
deviation £6000. Why might you be hesitant about constructing a 95 conﬁdence interval
for the average income in the town
4.17 Two samples were drawn each from a Normally distributed population with the follow-
ing results
X
1
45 s
1
8 n
1
12
X
2
52 s
2
5 n
2
18
Estimate the difference between the population means using the 95 conﬁdence level.
STFE_C04.qxd 26/02/2009 09:09 Page 167

slide 185:

Chapter 4 • Estimation and conﬁdence intervals
168
4.18 The heights of 10 men and 15 women were recorded with the following results:
Mean Variance
Men 173.5 80
Women 162 65
Estimate the true difference between men’s and women’s heights. Use the 95
conﬁdence level.
4.19 Project Estimate the average weekly expenditure upon alcohol by students. Ask a
reasonably random sample of your fellow students for their weekly expenditure on
alcohol. From this calculate the 95 conﬁdence interval estimate of such spending by
all students.
STFE_C04.qxd 26/02/2009 09:09 Page 168

slide 186:

Answers to exercises
169
Answers to exercises
Exercise 4.1
a The point estimate is 25 and the 95 conﬁdence interval is 25 ± 1.96 × 50/√100
25 ± 9.8 15.2 34.8.
b The CI becomes larger as the sample size reduces. In this case we would have
25 ± 1.96 × 50/√64 25 ± 12.25 12.75 37.25. Note that the width of the
CI is inversely proportional to the square root of the sample size.
Exercise 4.2
The 95 CI is 50 ± 1.96 × 30/√40 50 ± 9.30 40.70 59.30. The value of 60 lies
just outside this CI so is unlikely to be the true mean.
Exercise 4.3
a The point estimate is 14 7/50. The 95 CI is given by
0.14 ± 1.96 × 0.14 ± 0.096.
b Use 1.64 instead of 1.96 giving 0.14 ± 0.080.
c 0.14 ± 0.126.
Exercise 4.4
X
1
− X
2
25 − 30 −5 is the point estimate. The interval estimate is given by
X
1
− X
2
± 1.96 −5 ± 1.96
−5 ± 9.14 −14.14 4.14
Exercise 4.5
The point estimate is 20. The interval estimate is
0.2 ± 1.96 × 0.2 ± 0.157 0.043 0.357
Exercise 4.6
The 99 CI is given by 74 ± t× 74 ± 2.947 × 2.75 74 ± 8.10 65.90
82.10.
Exercise 4.7
The pooled variance is given by
S
2
391.36
The 95 CI is therefore
45 − 55 ± 2.042 ×
−10 ± 13.80 −3.8 23.8
391 36
15
391 36
20
.
.
+
15 − 1 × 18
2
+ 20 − 1 × 21
2
15 + 20 − 2
121 16 /
04 06
50
02 08
100
. .
. . ×
+
×
18
36
25
49
22
+
s
n
s
n
1
2
1
2
2
2
+
014 1 014
50
. . ×−
STFE_C04.qxd 26/02/2009 09:09 Page 169

slide 187:

Chapter 4 • Estimation and conﬁdence intervals
170
Appendix Derivations of sampling distributions
Derivation of the sampling distribution of p
The sampling distribution of p is fairly straightforward to derive given what
we have already learned. The sampling distribution of p can be easily derived
from the distribution of r the number of successes in n trials of an experiment
since p r/n. The distribution of r for large n is approximately Normal from
Chapter 3
r NnP nP1 − P 4.24
Knowing the distribution of r is it possible to ﬁnd that of p Since p is
simply r multiplied by a constant 1/n it is also Normally distributed. The mean
and variance of the distribution can be derived using the E and V operators. The
expected value of p is
Ep Er/n Er nP P π 4.25
The expected value of the sample proportion is equal to the population propor-
tion note that the probability P and the population proportion π are the same
thing and may be used interchangeably. The sample proportion therefore gives
an unbiased estimate of the population proportion.
For the variance
Vp V Vr nP1 − P 4.26
Hence the distribution of p is given by
p N π 4.27
Derivation of the sampling distribution of X
1
− X
2
This is the difference between two random variables so is itself a random
variable. Since any linear combination of Normally distributed independent
random variables is itself Normally distributed the difference of sample means
follows a Normal distribution. The mean and variance of the distribution can be
found using the E and V operators. Letting
EX
1
μ
1
VX
1
σ
2
1
/n
1
and
EX
2
μ
2
VX
2
σ
2
2
/n
2
then
EX
1
− X
2
EX
1
− EX
2
μ
1
− μ
2
4.28
And
VX
1
− X
2
VX
1
+ VX
2
+ 4.29
σ
2
2
n
2
σ
2
1
n
1
D
F
π1 − π
n
A
C
π1 − π
n
1
n
2
1
n
2
D
F
r
n
A
C
1
n
1
n
STFE_C04.qxd 26/02/2009 09:09 Page 170

slide 188:

Appendix: Derivations of sampling distributions
171
Equation 4.29 assumes X
1
and X
2
are independent random variables. The
probability distribution of X
1
− X
2
can therefore be summarised as
X
1
− X
2
N μ
1
− μ
2
+ 4.30
This is equation 4.12 in the text.
D
F
σ
2
2
n
2
σ
2
1
n
1
A
C
STFE_C04.qxd 26/02/2009 09:09 Page 171

slide 189:

Hypothesis testing
5
Contents
Learning outcomes 172
Introduction 173
The concepts of hypothesis testing 173
One-tail and two-tail tests 176
The choice of signiﬁcance level 178
The Prob-value approach 180
Signiﬁcance effect size and power 181
Further hypothesis tests 183
Testing a proportion 183
Testing the difference of two means 184
Testing the difference of two proportions 185
Hypothesis tests with small samples 187
Testing the sample mean 187
Testing the difference of two means 188
Are the test procedures valid 189
Hypothesis tests and conﬁdence intervals 190
Independent and dependent samples 191
Two independent samples 191
Paired samples 192
Discussion of hypothesis testing 194
Summary 195
Key terms and concepts 196
Reference 196
Problems 197
Answers to exercises 201
By the end of this chapter you should be able to:
● understand the philosophy and scientiﬁc principles underlying hypothesis testing
● appreciate that hypothesis testing is about deciding whether a hypothesis is true
or false on the basis of a sample of data
● recognise the type of evidence which leads to a decision that the hypothesis is false
● carry out hypothesis tests for a variety of statistical problems
● recognise the relationship between hypothesis testing and a conﬁdence interval
● recognise the shortcomings of hypothesis testing.
172
Learning
outcomes
Complete your diagnostic test for Chapter 5 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C05.qxd 26/02/2009 09:10 Page 172

slide 190:

The concepts of hypothesis testing
173
Introduction
This chapter deals with issues very similar to those of the previous chapter on
estimation but examines them in a different way. The estimation of population
parameters and the testing of hypotheses about those parameters are similar
techniques indeed they are formally equivalent in a number of respects but
there are important differences in the interpretation of the results arising from
each method. The process of estimation is appropriate when measurement is
involved such as measuring the true average expenditure on food hypothesis
testing is better when decision making is involved such as whether to accept
that a supplier’s products are up to a speciﬁed standard. Hypothesis testing is
also used to make decisions about the truth or otherwise of different theories
such as whether rising prices are caused by rising wages and it is here that the
issues become contentious. It is sometimes difﬁcult to interpret correctly the
results of hypothesis tests in these circumstances. This is discussed further later
in this chapter.
The concepts of hypothesis testing
In many ways hypothesis testing is analogous to a criminal trial. In a trial there
is a defendant who is initially presumed innocent. The evidence against the defend-
ant is then presented and if the jury ﬁnds this convincing beyond all reasonable
doubt he is found guilty the presumption of innocence is overturned. Of
course mistakes are sometimes made: an innocent person is convicted or a
guilty person set free. Both of these errors involve costs not only in the monet-
ary sense either to the defendant or to society in general and the errors
should be avoided if at all possible. The laws under which the trial is held may
help avoid such errors. The rule that the jury must be convinced ‘beyond all
reasonable doubt’ helps to avoid convicting the innocent for instance.
The situation in hypothesis testing is similar. First there is a maintained or
null hypothesis which is initially presumed to be true. The empirical evidence
usually data from a random sample is then gathered and assessed. If the
evidence seems inconsistent with the null hypothesis i.e. it has a low probability
of occurring if the hypothesis were true then the null hypothesis is rejected in
favour of an alternative. Once again there are two types of error one can make
either rejecting the null hypothesis when it is really true or not rejecting it
when in fact it is false. Ideally one would like to avoid both types of error.
An example helps to clarify the issues and the analogy. Suppose that you
are thinking of taking over a small business franchise. The current owner claims
the weekly turnover of each existing franchise is £5000 and at this level you are
willing to take on a franchise. You would be more cautious if the turnover is less
than this ﬁgure. You examine the books of 26 franchises chosen at random and
ﬁnd that the average turnover was £4900 with standard deviation £280. What
do you do
The null hypothesis in this case is that average weekly turnover is £5000 or
more that would be even more to your advantage. The alternative hypothesis
STFE_C05.qxd 26/02/2009 09:10 Page 173

slide 191:

Chapter 5 • Hypothesis testing
174
is that turnover is strictly less than £5000 per week. We may write these more
succinctly as follows
H
0
: μ 5000
H
1
: μ 5000
H
0
is conventionally used to denote the null hypothesis H
1
the alternative.
Initially H
0
is presumed to be true and this presumption will be tested using
the sample evidence. Note that the sample evidence is not used as part of the
hypothesis.
You have to decide whether the owner’s claim is correct H
0
or not H
1
. The
two types of error you could make are as follows:
● Type I error – reject H
0
when it is in fact true. This would mean missing a
good business opportunity.
● Type II error – not rejecting H
0
when it is in fact false. You would go ahead
and buy the business and then ﬁnd out that it is not as attractive as claimed.
You would have overpaid for the business.
The situation is set out in Figure 5.1.
Obviously a good decision rule would give a good chance of making a correct
decision and rule out errors as far as possible. Unfortunately it is impossible
completely to eliminate the possibility of errors. As the decision rule is changed
to reduce the probability of a Type I error the probability of making a Type II
error inevitably increases. The skill comes in balancing these two types of error.
Again a diagram is useful in illustrating this. Assuming that the null hypo-
thesis is true then the sample observations are drawn from a population with
mean 5000 and some variance which we shall assume is accurately measured by
the sample variance. The distribution of X is then given by
X Nμ σ
2
/n or 5.1
X N5000 280
2
/26
Under the alternative hypothesis the distribution of X would be the same
except that it would be centred on a value less than 5000. These two situations
are illustrated in Figure 5.2. The distribution of X under H
1
is shown by a dashed
curve to signify that its exact position is unknown only that it lies to the left of
the distribution under H
0
.
A decision rule amounts to choosing a point or dividing line on the horizon-
tal axis in Figure 5.2. If the sample mean lies to the left of this point then H
0
is
rejected the sample mean is too far away from H
0
for it to be credible in favour
of H
1
and you do not buy the ﬁrm. If X lies above this decision point then H
0
is not rejected and you go ahead with the purchase. Such a decision point is
Figure 5.1
The two different
types of error
STFE_C05.qxd 26/02/2009 09:10 Page 174

slide 192:

The concepts of hypothesis testing
175
shown in Figure 5.2 denoted by X
D
. To the left of X
D
lies the rejection of H
0
region to the right lies the non-rejection region.
Based on this point we can see the probabilities of Type I and Type II errors.
The area under the H
0
distribution to the left of X
D
labelled I shows the prob-
ability of rejecting H
0
given that it is in fact true: a Type I error. The area under
the H
1
distribution to the right of X
D
labelled II shows the probability of a
Type II error: not rejecting H
0
when it is in fact false and H
1
is true.
Shifting the decision line to the right or left alters the balance of these prob-
abilities. Moving the line to the right increases the probability of a Type I error
but reduces the probability of a Type II error. Moving the line to the left has the
opposite effect.
The Type I error probability can be calculated for any value of X
D
. Suppose
we set X
D
to a value of 4950. Using the distribution of X given in equation 5.1
above the area under the distribution to the left of 4950 is obtained using the
z score
5.2
From the tables of the standard Normal distribution we ﬁnd that the prob-
ability of a Type I error is 18.1. Unfortunately the Type II error probability
cannot be established because the exact position of the distribution under H
1
is unknown. Therefore we cannot decide on the appropriate position of X
D
by
some balance of the two error probabilities.
The convention therefore is to set the position of X
D
by using a Type I error
probability of 5 known as the signiﬁcance level
1
of the test. In other words
we are prepared to accept a 5 probability of rejecting H
0
when it is in fact
true. This allows us to establish the position of X
D
. From Table A2 see page 414
we ﬁnd that z −1.64 cuts off the bottom 5 of the distribution so the decision
line should be 1.64 standard errors below 5000. The value −1.64 is known as the
critical value of the test. We therefore obtain
5.3
X
D
− ./ 5000 1 64 280 26 4910
2
z
sn
/
/
.
−
−
−
X
D
μ
22
4950 5000
280 26
091
Figure 5.2
The sampling
distributions of
under H
0
and H
1
1
The term size of the test is also used not to be confused with the sample size. We use
the term ‘signiﬁcance level’ in this text.
STFE_C05.qxd 26/02/2009 09:10 Page 175

slide 193:

Chapter 5 • Hypothesis testing
176
Since the sample mean of 4900 lies below 4910 we reject H
0
at the 5
signiﬁcance level or equivalently we reject with 95 conﬁdence. The signiﬁcance
level is generally denoted by the symbol α and the complement of this given
by 1 − α is known as the conﬁdence level as used in the conﬁdence interval.
An equivalent procedure would be to calculate the z score associated with the
sample mean known as the test statistic and then compare this to the critical
value of the test. This allows the hypothesis testing procedure to be broken
down into ﬁve neat steps.
1 Write down the null and alternative hypotheses:
H
0
: μ 5000
H
1
: μ 5000
2 Choose the signiﬁcance level of the test conventionally α 0.05 or 5.
3 Look up the critical value of the test from statistical tables based on the
chosen signiﬁcance level. z 1.64 is the critical value in this case.
4 Calculate the test statistic
5.4
5 Decision rule. Compare the test statistic with the critical value: if z −z
reject H
0
in favour of H
1
. Since −1.82 −1.64 H
0
is rejected with 95
conﬁdence. Note that we use −z here rather than +z because we are
dealing with the left-hand tail of the distribution.
Worked example 5.1
A sample of 100 workers found the average overtime hours worked in the
previous week was 7.8 with standard deviation 4.1 hours. Test the hypo-
thesis that the average for all workers is 5 hours or less.
We can set out the ﬁve steps of the answer as follows:
1 H
0
: μ 5
H
1
: μ 5
2 Signiﬁcance level α 5.
3 Critical value z 1.64.
4 Test statistic
5 Decision rule: 6.8 1.64 so we reject H
0
in favour of H
1
. Note that in this
case we are dealing with the right-hand tail of the distribution positive
values of z and z. Only high values of X reject H
0
.
One-tail and two-tail tests
In the above example the rejection region for the test consisted of one tail of the
distribution of X since the buyer was only concerned about turnover being less
z
sn
/
.
. /
.
−
−
X μ
22
78 5
4 1 100
68
z
sn
/
/
.
−
−
−
X μ
22
100
280 26
182
STFE_C05.qxd 26/02/2009 09:10 Page 176

slide 194:

The concepts of hypothesis testing
177
than claimed. For this reason it is known as a one-tail test. Suppose now that
an accountant is engaged to sell the franchise and wants to check the claim
about turnover before advertising the business for sale. In this case she would
be concerned about turnover being either below or above 5000.
This would now become a two-tail test with the null and alternative hypo-
theses being
H
0
: μ 5000
H
1
: μ ≠ 5000
Now there are two rejection regions for the test. Either a very low sample mean
or a very high one will serve to reject the null hypothesis. The situation is
presented graphically in Figure 5.3.
The distribution of X under H
0
is the same as before but under the alternative
hypothesis the distribution could be shifted either to the left or to the right as
depicted. If the signiﬁcance level is still chosen to be 5 then the complete
rejection region consist of the two extremes of the distribution under H
0
containing 2.5 in each tail hence 5 in total. This gives a Type I error prob-
ability of 5 as before.
The critical value of the test therefore becomes z 1.96 the value which
cuts off 2.5 in each tail of the standard Normal distribution. Only if the test
statistic falls into one of the rejection regions beyond 1.96 standard errors from
the mean is H
0
rejected.
Using data from the previous example the test statistic remains z −1.82 so
that the null hypothesis cannot be rejected in this case as −1.82 does not fall
beyond −1.96. To recap the ﬁve steps of the test are:
1 H
0
: μ 5000
H
1
: μ ≠ 5000
2 Choose the signiﬁcance level: α 0.05.
3 Look up the critical value: z 1.96.
4 Evaluate the test statistic
5 Compare test statistic and critical values: if z −z or z z reject H
0
in
favour of H
1
. In this case −1.82 −1.96 so H
0
cannot be rejected with 95
conﬁdence.
z
/
.
−
−
100
280 26
182
2
Figure 5.3
A two-tail hypothesis
test
STFE_C05.qxd 26/02/2009 09:10 Page 177

slide 195:

Chapter 5 • Hypothesis testing
178
One- and two-tail tests therefore differ only at steps 1 and 3. Note that we
have come to different conclusions according to whether a one- or two-tail test
was used with the same sample evidence. There is nothing wrong with this
however for there are different interpretations of the two results. If the investor
always uses his rule he will miss out on 5 of good investment opportunities
when sales are by chance low. He will never miss out on a good opportunity
because the investment appears too good i.e. sales by chance are very high. For
the accountant 5 of the ﬁrms with sales averaging £5000 will not be advert-
ised as such either because sales appear too low or because they appear too high.
It is tempting on occasion to use a one-tail test because of the sample
evidence. For example the accountant might look at the sample evidence above
and decide that the franchise operation can only have true sales less than or
equal to 5000. Therefore a one-tail test is used. This is a dangerous practice since
the sample evidence is being used to help formulate the hypothesis which is
then tested on that same evidence. This is going round in circles the hypo-
thesis should be chosen independently of the evidence which is then used to test
it. Presumably the accountant would also use a one-tail test with H
1
: μ 5000 as
the alternative hypothesis if it was noticed that the sample mean were above the
hypothesised value. In effect therefore the 10 signiﬁcance level would be used
not the 5 level since there would be 5 in each tail of the distribution. A Type I
error would be made on 10 of all occasions rather than 5.
It is acceptable to use a one-tail test when you have independent information
about what the alternative hypothesis should be or you are not concerned
about one side of the distribution such as the investor and can effectively add
that into the null hypothesis. Otherwise it is safer to use a two-tail test.
a Two political parties are debating crime ﬁgures. One party says that crime has
increased compared to the previous year. The other party says it has not. Write
down the null and alternative hypotheses.
b Explain the two types of error that could be made in this example and the possible
costs of each type of error.
a We test the hypothesis H
0
: μ 100 against H
1
: μ 100 by rejecting H
0
if our sample
mean is greater than 108. If in fact X N100 900/25 what is the probability of
making a Type I error
b If we wanted a 5 Type I error probability what decision rule should we adopt
c If we knew that μ could only take on the values 100 under H
0
or 112 under H
1
what would be the Type II error probability using the decision rule in part a
Test the hypothesis H
0
: μ 500 versus H
1
: μ ≠ 500 using the evidence X 530 s 90
from a sample of size n 30.
The choice of signiﬁcance level
We justiﬁed the choice of the 5 signiﬁcance level by reference to convention.
This is usually a poor argument for anything but it does have some justiﬁcation.
In an ideal world we would have precisely speciﬁed null and alternative
hypotheses e.g. we would test H
0
: μ 5000 against H
1
: μ 4500 these being the
Exercise 5.1
Exercise 5.2
Exercise 5.3
STFE_C05.qxd 26/02/2009 09:10 Page 178

slide 196:

The concepts of hypothesis testing
179
only possibilities. Then we could calculate the probabilities of both Type I and
Type II errors for any given decision rule. We could then choose the optimal
decision rule which gives the best compromise between the two types of error.
This is reﬂected in a court of law. In criminal cases the jury must be convinced
of the prosecution’s case beyond reasonable doubt because of the cost of com-
mitting a Type I error. In a civil case libel for example the jury need only be
convinced on the balance of probabilities. In a civil case the costs of Type I and
Type II error are more evenly balanced and so the burden of proof is lessened.
However in practice we usually do not have the luxury of two well-speciﬁed
hypotheses. As in the example the null hypothesis is precisely speciﬁed it
has to be or the test could not be carried out but the alternative hypothesis
is imprecise sometimes called a composite hypothesis because it encompasses
a range of values. Statistical inference is often used not so much as an aid to
decision making but to provide evidence for or against a particular theory to
alter one’s degree of belief in the truth of the theory. For example an economic
theory might assert that rising prices are caused by rising wages the cost–push
theory of inﬂation. The null and alternative hypotheses would be:
H
0
: there is no connection between rising wages and rising prices
H
1
: there is some connection between rising wages and rising prices.
Note that the null has ‘no connection’ since this is a precise statement.
‘Some connection’ is too vague to be the null hypothesis. Data could be
gathered to test this hypothesis the appropriate methods will be discussed in
the chapters on correlation and regression. But what decision rests upon the
result of this test It could be thought that government might make a decision
to impose a prices and incomes policy but if every academic study of inﬂation led
to the imposition or abandonment of a prices and incomes policy there would
have been an awful lot of policies In fact there were a lot of such policies but
not as many as the number of studies of inﬂation. No single study is decisive
‘more research is needed’ is a very common phrase but each does inﬂuence
the climate of opinion which may eventually lead to a policy decision. But if
a hypothesis test is designed to inﬂuence opinion how is the signiﬁcance level
to be chosen
It is difﬁcult to trade off the costs of Type I and Type II errors and the prob-
ability of making those errors. A Type I error in this case means concluding that
rising wages do cause rising prices when in fact they do not. So what would be
the cost of this error i.e. imposing a prices and incomes policy when in fact it
is not needed It is extremely difﬁcult if not impossible to put a ﬁgure on it. It
would depend on what type of prices and incomes policy were imposed – would
wages be frozen or allowed to rise with productivity how fast would prices be
allowed to rise would company dividends be frozen The costs of the Type II
error would also be problematic not imposing a needed prices and incomes
policy for they would depend among other things on what alternative policies
might be adopted.
The 5 signiﬁcance level really does depend upon convention therefore it
cannot be justiﬁed by reference to the relative costs of Type I and Type II errors
it is too much to believe that everyone does consider these costs and independ-
ently arrives at the conclusion that 5 is the appropriate signiﬁcance level.
However the 5 convention does impose some sort of discipline upon research
STFE_C05.qxd 26/02/2009 09:10 Page 179

slide 197:

Chapter 5 • Hypothesis testing
180
it sets some kind of standard which all theories hypotheses should be meas-
ured against. Beware the researcher who reports that a particular hypothesis is
rejected at the 8 signiﬁcance level it is likely that the signiﬁcance level was
chosen so that the hypothesis could be rejected which is what the researcher
was hoping for in the ﬁrst place
The Prob-value approach
Suppose a result is signiﬁcant at the 4.95 level i.e. it just meets the 5 con-
vention and the null hypothesis is rejected. A very slight change in the sample
data could have meant the result being signiﬁcant at only the 5.05 level
and the null hypothesis not being rejected. Would we really be happy to alter
our belief completely on such fragile results Most researchers but not all
would be cautious if their results were only just signiﬁcant or fell just short of
signiﬁcance.
This suggests an alternative approach: the signiﬁcance level of the test statistic
could be reported and the reader could make his own judgements about it. This
is known as the Prob-value approach the Prob-value being the signiﬁcance
level of the calculated test statistic. For example the calculated test statistic for
the investor problem was z −1.82 and the associated Prob-value is obtained
from Table A2 see page 414 as 3.44 i.e. −1.82 cuts off 3.44 in one tail of
the standard Normal distribution. This means that the null hypothesis can be
rejected at the 3.44 signiﬁcance level or alternatively expressed with 96.56
conﬁdence.
Notice that Table A2 gives the Prob-value for a one-tail test for a two-tail test
the Prob-value should be doubled. Thus for the accountant using the two-tail
test the signiﬁcance level is 6.88 and this is the level at which the null hypo-
thesis can be rejected. Alternatively we could say we reject the null with 93.12
conﬁdence. This does not meet the standard 5 criterion for the signiﬁcance
level which is most often used so would result in non-rejection of the null.
An advantage of using the Prob-value approach is that many statistical
software programs routinely provide the Prob-value of a calculated test statistic.
If one understands the use of Prob-values then one does not have to look up
tables this applies to any distribution not just the Normal which can save a
lot of time.
To summarise one rejects the null hypothesis if either:
● Method 1 – the test statistic is greater than the critical value i.e. z z or
● Method 2 – the Prob-value associated with the test statistic is less than the
signiﬁcance level i.e. P 0.05 if the 5 signiﬁcance level is used.
I have found that many students initially ﬁnd this confusing because of the
opposing inequality in the two versions greater than and less than. For example
a program might calculate a hypothesis test and report the result as ‘z 1.4
P value 0.162’. The ﬁrst point to note is that most software programs report
the Prob-value for a two-tail test by default. Hence assuming a 5 signiﬁcance
level in this case we cannot reject H
0
because z 1.4 1.96 or equivalently
because 0.162 0.05 against a two-tailed alternative i.e. H
1
contains ≠.
STFE_C05.qxd 26/02/2009 09:10 Page 180

slide 198:

Signiﬁcance effect size and power
181
If you wish to conduct a one-tailed test you have to halve the reported Prob-
value becoming 0.081 in this example. This is again greater than 5 so the
hypothesis is still accepted even against a one-sided alternative H
1
contains
or . Equivalently one could compare 1.4 with the one-tail critical value 1.64
showing non-rejection of the null but one has to look up the standard Normal
table with this method. Computers cannot guess whether a one- or two-sided
test is wanted so take the conservative option and report the two-sided value.
The correction for a one-sided test has to be done manually.
Signiﬁcance effect size and power
Researchers usually look for ‘signiﬁcant’ results. Academic papers report that
‘the results are signiﬁcant’ or that ‘the coefﬁcient is signiﬁcantly different from
zero at the 5 signiﬁcance level’. It is vital to realise that the word ‘signiﬁcant’ is
used here in the statistical sense and not in its everyday sense of being important.
Something can be statistically signiﬁcant yet still unimportant.
Suppose that we have some more data about the business examined earlier.
Data for 100 franchises have been uncovered revealing an average weekly
turnover of £4975 with standard deviation £143. Can we reject the hypothesis
that the average weekly turnover is £5000 The test statistic is
Since this is less than −z −1.64 the null is rejected with 95 conﬁdence.
True average weekly turnover is less than £5000. However the difference is
only £25 per week which is 0.5 of £5000. Common sense would suggest that
the difference may be unimportant even if it is signiﬁcant in the statistical
sense. One should not interpret statistical results in terms of signiﬁcance alone
therefore one should also look at the size of the difference sometimes known
as the effect size and ask whether it is important or not. This is a mistake made
by even experienced researchers a review of articles in the prestigious American
Economic Review reported that 82 of them confused statistical signiﬁcance for
economic signiﬁcance in some way McCloskey and Ziliak 2004.
This problem with hypothesis testing paradoxically grows worse as the sample
size increases. For example if 250 observations reveal average sales of 4985 with
standard deviation 143 the null would just be rejected at 5 signiﬁcance.
In fact given a large enough sample size we can virtually guarantee to reject the
null hypothesis even before we have gathered the data. This can be seen from
equation 5.4 for the z score test statistic: as n grows larger the test statistic also
inevitably increases.
A good way to remember this point is to appreciate that it is the evidence
which is signiﬁcant not the size of the effect. Strictly it is better to say ‘. . . there
is signiﬁcant evidence of difference between . . .’ than ‘. . . there is a signiﬁcant
difference between . . .’.
A related way of considering the effect of increasing sample size is via the
concept of the power of a test. This is deﬁned as
Power of a test 1 − PrType II error 1 − β 5.5
z
/
.
−
−
4975 5000
143 100
175
2
STFE_C05.qxd 26/02/2009 09:10 Page 181

slide 199:

Chapter 5 • Hypothesis testing
182
where β is the symbol conventionally used to indicate the probability of a Type II
error. As a Type II error is deﬁned as not rejecting H
0
when false equivalent to
rejecting H
1
when true power is the probability of rejecting H
0
when false
if H
0
is false it must be either accepted or rejected hence these probabilities
sum to one. This is one of the correct decisions identiﬁed earlier associated
with the lower right-hand box in Figure 5.1 that of correctly rejecting a false
null hypothesis. The power of a test is therefore given by the area under the H
1
distribution to the left of the decision line as illustrated shaded in Figure 5.4
for a one-tail test.
It is generally desirable to maximise the power of a test as long as the prob-
ability of a Type I error is not raised in the process. There are essentially three
ways of doing this.
● Avoid situations where the null and alternative hypotheses are very similar
i.e. the hypothesised means are not far apart a small effect size.
● Use a large sample size. This reduces the sampling variance of X under both
H
0
and H
1
so the two distributions become more distinct.
● Use good sampling methods which have small sampling variances. This has
a similar effect to increasing the sample size.
Unfortunately in economics and business the data are very often given in
advance and there is little or no control possible over the sampling procedures.
This leads to a neglect of consideration of power unlike in psychology for
example where the experiment can often be designed by the researcher. The
gathering of sample data will be covered in detail in Chapter 9.
If a researcher believes the cost of making a Type I error is much greater than the
cost of a Type II error should they choose a 5 or 1 signiﬁcance level Explain why.
a A researcher uses Excel to analyse data and test a hypothesis. The program
reports a test statistic of z 1.77 P value 0.077. Would you reject the null
hypothesis if carrying out i a one-tailed test ii a two-tailed test Use the 5
signiﬁcance level.
b Repeat part a using a 1 signiﬁcance level.
Figure 5.4
The power of a test
Exercise 5.4
Exercise 5.5
STFE_C05.qxd 26/02/2009 09:10 Page 182

slide 200:

Further hypothesis tests
183
Further hypothesis tests
We now proceed to consider a number of different types of hypothesis test all
involving the same principles but differing in details of their implementation.
This is similar to the exposition in the last chapter covering in turn tests of
a proportion tests of the difference of two means and proportions and ﬁnally
problems involving small sample sizes.
Testing a proportion
A car manufacturer claims that no more than 10 of its cars should need repairs
in the ﬁrst three years of their life the warranty period. A random sample of 50
three-year-old cars found that 8 had required attention. Does this contradict the
maker’s claim
This problem can be handled in a very similar way to the methods used for a
mean. The key once again is to recognise the sample proportion as a random
variable with an associated probability distribution. From Chapter 4 equation
4.9 the sampling distribution of the sample proportion in large samples is
given by
5.6
In this case π 0.10 under the null hypothesis the maker’s claim. The sample
data are
p 8/50 0.16
n 50
Thus 16 of the sample required attention within the warranty period. This
is substantially higher than the claimed 10 but is this just because of a bad
sample or does it reﬂect the reality that the cars are badly built The hypothesis
test is set out along the same lines as for a sample mean.
1 H
0
: π 0.10
H
1
: π 0.10
The only concern is the manufacturer not matching its claim.
2 Signiﬁcance level: α 0.05.
3 The critical value of the one-tail test at the 5 signiﬁcance level is z 1.64
obtained from the standard Normal table.
4 The test statistic is
5 Since the test statistic is less than the critical value it falls into the non-
rejection region. The null hypothesis is not rejected by the data. The
manufacturer’s claim is not unreasonable.
z
p
n
. .
. .
.
−
−
−
×
π
ππ 1
016 010
01 09
50
141
pN
n
π
ππ 1 − ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ STFE_C05.qxd 26/02/2009 09:10 Page 183

slide 201:

Chapter 5 • Hypothesis testing
184
STATISTICS
IN
PR AC TIC E
··
Note that for this problem the rejection region lies in the upper tail of the dis-
tribution because of the ‘greater than’ inequality in the alternative hypothesis.
The null hypothesis is therefore rejected in this case if z z.
Do children prefer branded goods only because of the name
Researchers at Johns Hopkins Bloomberg School of Public Health in Maryland
found young children were inﬂuenced by the packaging of foods. 63 children were
offered two identical meals save that one was still in its original packaging from
MacDonalds. 76 of the children preferred the branded French fries.
Is this evidence signiﬁcant The null hypothesis is H
0
: π 0.5 versus H
1
: π 0.5.
The test statistic for this hypothesis test is
which is greater than the critical value of z 1.64. Hence we conclude that
children are inﬂuenced by the packaging or brand name.
Source: New Scientist 11 August 2007.
Testing the difference of two means
Suppose a car company wishes to compare the performance of its two factories
producing an identical model of car. The factories are equipped with the same
machinery but their outputs might differ due to managerial ability labour rela-
tions etc. Senior management wishes to know if there is any difference between
the two factories. Output is monitored for 30 days chosen at random with the
following results:
Factory 1 Factory 2
Average daily output 420 408
Standard deviation of daily output 25 20
Does this produce sufﬁcient evidence of a real difference between the factories
or does the difference between the samples simply reﬂect random differences
such as minor breakdowns of machinery The information at our disposal may
be summarised as
X
1
420 X
2
408
s
1
25 s
2
20
n
1
30 n
2
30
The hypothesis test to be conducted concerns the difference between the factories’
outputs so the appropriate random variable to examine is X
1
− X
2
. From Chap-
ter 4 equation 4.12 this has the following distribution in large samples
5.7
XX
12 1 2
1
2
1
2
2
2
−− +
⎛ ⎝ ⎜ ⎞ ⎠ ⎟ N
nn
μμ
σσ
z
p
n
. .
. .
.
−
−
−
×
π
ππ 1
076 050
05 05
63
412
STFE_C05.qxd 26/02/2009 09:10 Page 184

slide 202:

Further hypothesis tests
185
The population variances σ
2
1
and σ
2
2
may be replaced by their sample estimates
s
2
1
and s
2
1
if the former are unknown as here. The hypothesis test is therefore as
follows.
1 H
0
: μ
1
− μ
2
0
H
1
: μ
1
− μ
2
≠ 0
The null hypothesis posits no real difference between the factories. This is a
two-tail test since there is no a priori reason to believe one factory is better
than the other apart from the sample evidence.
2 Signiﬁcance level: α 1. This is chosen since the management does not
want to interfere unless it is really conﬁdent of some difference between the
factories. In order to favour the null hypothesis a lower signiﬁcance level
than the conventional 5 is set.
3 The critical value of the test is z 2.57. This cuts off 0.5 in each tail of
the standard Normal distribution.
4 The test statistic is
Note that this is of the same form as in the single-sample cases. The hypo-
thesised value of the difference zero in this case is subtracted from the
sample difference and this is divided by the standard error of the random
variable.
5 Decision rule: z z so the test statistic falls into the non-rejection region.
There does not appear to be a signiﬁcant difference between the two
factories.
A number of remarks about this example should be made. First it should be
noted that it is not necessary for the two sample sizes to be equal although they
are in the example. For example 45 days’ output from factory 1 and 35 days’
from factory 2 could have been sampled. Second the values of s
2
1
and s
2
2
do not
have to be equal. They are respectively estimates of σ
2
1
and σ
2
2
and although the
null hypothesis asserts that μ
1
μ
2
it does not assert that the variances are equal.
Management wants to know if the average levels of output are the same it is not
concerned about daily ﬂuctuations in output. A test of the hypothesis of equal
variances is set out in Chapter 6.
The ﬁnal point to consider is whether all the necessary conditions for the
correct application of this test have been met. The example noted that the
30 days were chosen at random. If the 30 days sampled were consecutive
we might doubt whether the observations were truly independent. Low out-
put on one day e.g. due to a mechanical breakdown might inﬂuence the
following day’s output e.g. if a special effort were made to catch up on lost
production.
Testing the difference of two proportions
The general method should by now be familiar so we will proceed by example
for this case. Suppose that in a comparison of two holiday companies’ customers
z
s
n
s
n
.
−− −
+
−−
+
XX
12 1 2
1
2
1
2
2
2
22
420 408 0
25
30
20
30
205
μμ
STFE_C05.qxd 26/02/2009 09:10 Page 185

slide 203:

Chapter 5 • Hypothesis testing
186
of the 75 who went with Happy Days Tours 45 said they were satisﬁed while
48 of the 90 who went with Fly by Night Holidays were satisﬁed. Is there a
signiﬁcant difference between the companies
This problem can be handled by a hypothesis test on the difference of two
sample proportions. The procedure is as follows. The sample evidence is
p
1
45/75 0.6 n
1
75
p
2
48/90 0.533 n
2
90
The hypothesis test is carried out as follows
1 H
0
: π
1
− π
2
0
H
1
: π
1
− π
2
≠ 0
2 Signiﬁcance level: α 5.
3 Critical value: z 1.96.
4 Test statistic: The distribution of p
1
− p
2
is
so the test statistic is
5.8
However π
1
and π
2
in the denominator of equation 5.8 have to be replaced
by estimates from the samples. They cannot simply be replaced by p
1
and
p
2
because these are unequal to do so would contradict the null hypo-
thesis that they are equal. Since the null hypothesis is assumed to be
true for the moment it doesn’t make sense to use a test statistic which
explicitly supposes the null hypothesis to be false. Therefore π
1
and π
2
are
replaced by an estimate of their common value which is denoted and
whose formula is
5.9
i.e. a weighted average of the two sample proportions. This yields
0.564
This in fact is just the proportion of all customers who were satisﬁed 93
out of 165. The test statistic therefore becomes
5 The test statistic is less than the critical value so the null hypothesis cannot
be rejected with 95 conﬁdence. There is not sufﬁcient evidence to demon-
strate a difference between the two companies’ performance.
z
. .
. .
. .
.
−−
×−
+
×−
0 6 0 533 0
0 564 1 0 564
75
0 564 1 0 564
90
086
75 × 0.6 + 90 × 0.533
75 + 90
n
1
p
1
+ n
2
p
2
n
1
+ n
2
z
pp
nn
−− −
−
+
−
12 1 2
11
1
22
2
11
ππ
ππ π π
pp N
nn
12 1 2
11
1
22
2
11
−−
−
+
−
⎛ ⎝ ⎜ ⎞ ⎠ ⎟
ππ
ππ π π
STFE_C05.qxd 26/02/2009 09:10 Page 186

slide 204:

Hypothesis tests with small samples
187
STATISTICS
IN
PR AC TIC E
··
Exercise 5.6
Exercise 5.7
Exercise 5.8
Are women better at multi-tasking
The conventional wisdom is ‘yes’. However the concept of multi-tasking originated
in computing and in that domain it appears men are more likely to multi-task.
Oxford Internet Surveys http://www.oii.ox.ac.uk/microsites/oxis/ asked a
sample of 1578 people if they multi-tasked while on-line e.g. listening to music
using the phone. 69 of men said they did compared to 57 of women. Is this
difference statistically signiﬁcant
The published survey does not give precise numbers of men and women
respondents for this question so we will assume equal numbers the answer is
not very sensitive to this assumption. We therefore have the test statistic
0.63 is the overall proportion of multi-taskers. The evidence is signiﬁcant and
clearly suggests this is a genuine difference: men are the multi-taskers
A survey of 80 voters ﬁnds that 65 are in favour of a particular policy. Test the
hypothesis that the true proportion is 50 against the alternative that a majority is
in favour.
A survey of 50 teenage girls found that on average they spent 3.6 hours per week
chatting with friends over the internet. The standard deviation was 1.2 hours. A sim-
ilar survey of 90 teenage boys found an average of 3.9 hours with standard deviation
2.1 hours. Test if there is any difference between boys’ and girls’ behaviour.
One gambler on horse racing won on 23 of his 75 bets. Another won on 34 out of 95.
Is the second person a better judge of horses or just luckier
Hypothesis tests with small samples
As with estimation slightly different methods have to be employed when the
sample size is small n 25 and the population variance is unknown. When
both of these conditions are satisﬁed the t distribution must be used rather than
the Normal so a t test is conducted rather than a z test. This means consulting
tables of the t distribution to obtain the critical value of a test but otherwise the
methods are similar. These methods will be applied to hypotheses about sample
means only since they are inappropriate for tests of a sample proportion as was
the case in estimation.
Testing the sample mean
A large chain of supermarkets sells 5000 packets of cereal in each of its stores
each month. It decides to test-market a different brand of cereal in 15 of its
stores. After a month the 15 stores have sold an average of 5200 packets each
z
. .
. .
. .
.
−−
×−
+
×−
0 69 0 57 0
063 1 063
789
063 1 063
789
494
STFE_C05.qxd 26/02/2009 09:10 Page 187

slide 205:

Chapter 5 • Hypothesis testing
188
with a standard deviation of 500 packets. Should all supermarkets switch to
selling the new brand
The sample information is
X 5200 s 500 n 15
From Chapter 4 the distribution of the sample mean from a small sample
when the population variance is unknown is based upon
5.10
with v n − 1 degrees of freedom. The hypothesis test is based on this formula
and is conducted as follows
1 H
0
: μ 5000
H
1
: μ 5000
Only an improvement in sales is relevant.
2 Signiﬁcance level: α 1 chosen because the cost of changing brands is
high.
3 The critical value of the t distribution for a one-tail test at the 1 signi-
ﬁcance level with ν n − 1 14 degrees of freedom is t 2.62.
4 The test statistic is
5 The null hypothesis is not rejected since the test statistic 1.55 is less than
the critical value 2.62. It would probably be unwise to switch over to the
new brand of cereals.
Testing the difference of two means
A survey of 20 British companies found an average annual expenditure on
research and development of £3.7m with a standard deviation of £0.6m. A survey
of 15 similar German companies found an average expenditure on research
and development of £4.2m with standard deviation £0.9m. Does this evidence
lend support to the view often expressed that Britain does not invest enough in
research and development
This is a hypothesis about the difference of two means based on small sample
sizes. The test statistic is again based on the t distribution i.e.
5.11
where S
2
is the pooled variance as given in equation 4.23 and the degrees of
freedom are given by ν n
1
+ n
2
− 2.
The hypothesis test procedure is as follows:
1 H
0
: μ
1
− μ
2
0
H
1
: μ
1
− μ
2
0
2 Signiﬁcance level: α 5.
X X
12 1 2
2
1
2
2
−− −
+
μμ
S
n
S
n
t
v
t
sn
/
/
.
−
−
X μ
22
5200 5000
500 15
155
X
/
− μ
sn
t
v
2
STFE_C05.qxd 26/02/2009 09:10 Page 188

slide 206:

Are the test procedures valid
189
Exercise 5.9
Exercise 5.10
3 The critical value of the t distribution at the 5 signiﬁcance level for a
one-tail test with v n
1
+ n
2
− 2 33 degrees of freedom is approximately
t 1.70.
4 The test statistic is based on equation 5.11
where S
2
is the pooled variance calculated by
S
2
0.55
5 The test statistic falls in the rejection region t −t so the null hypothesis
is rejected. The data do support the view that Britain spends less on RD
than Germany.
It is asserted that parents spend on average £540 per annum on toys for each child.
A survey of 24 parents ﬁnds expenditure of £490 with standard deviation £150. Does
this evidence contradict the assertion
A sample of 15 ﬁnal-year students were found to spend on average 15 hours
per week in the university library with standard deviation 3 hours. A sample of
20 freshers found they spend on average 9 hours per week in the library standard
deviation 5 hours. Is this sufﬁcient evidence to conclude that ﬁnalists spend more
time in the library
Are the test procedures valid
A variety of assumptions underlie each of the tests which we have applied above
and it is worth considering in a little more detail whether these assumptions are
justiﬁed. This will demonstrate that one should not rely upon the statistical tests
alone it is important to retain one’s sense of judgement.
The ﬁrst test concerned the weekly turnover of a series of franchise opera-
tions. To justify the use of the Normal distribution underlying the test the
sample observations must be independently drawn. The random errors around
the true mean turnover ﬁgure should be independent of each other. This might
not be the case if for example similar events could affect the turnover ﬁgures
of all franchises.
If one were using time-series data as in the car factory comparison similar
issues arise. Do the 30 days represent independent observations or might there
be an autocorrelation problem e.g. if the sample days were close together in
time Suppose that factory 2 suffered a breakdown of some kind which took
three days to ﬁx. Output would be reduced on three successive days and factory
2 would almost inevitably appear less efﬁcient than factory 1. A look at the indi-
vidual sample observations might be worthwhile therefore to see if there are
19 × 0.6
2
+ 14 × 0.9
2
33
n
1
− 1s
2
1
+ n
2
− 1s
2
2
n
1
+ n
2
− 2
t
S
n
S
n
. .
.
.
.
−− −
+
−−
+
−
X X
12 1 2
2
1
2
2
37 42 0
055
20
055
15
197
μμ
STFE_C05.qxd 26/02/2009 09:10 Page 189

slide 207:

Chapter 5 • Hypothesis testing
190
2
This ﬁgure is somewhat out of date now but it is still a useful example.
unusual patterns. It would have been altogether better if the samples had been
collected on randomly chosen days over a longer time period to reduce the
danger of this type of problem.
If the two factories both obtain their supplies from a common but limited
source then the output of one factory might not be independent of the output
of the other. A high output of one factory would tend to be associated with a
low output from the other which has little to do with their relative efﬁciencies.
This might leave the average difference in output unchanged but might increase
the variance substantially either a very high positive value of X
1
− X
2
or a
very high negative value is obtained. This would lead to a low value of the test
statistic and the conclusion of no difference in output. Any real difference in
efﬁciency is masked by the common supplier problem. If the two samples are
not independent then the distribution of X
1
− X
2
may not be Normal.
Hypothesis tests and conﬁdence intervals
Formally two-tail hypothesis tests and conﬁdence intervals are equivalent.
Any value that lies within the 95 conﬁdence interval around the sample mean
cannot be rejected as the ‘true’ value using the 5 signiﬁcance level in a hypo-
thesis test using the same sample data. For example our by now familiar
accountant could construct a conﬁdence interval for the ﬁrm’s sales. This yields
the 95 conﬁdence interval
4792 5008 5.12
Notice that the hypothesised value of 5000 is within this interval and that it
was not rejected by the hypothesis test carried out earlier. As long as the same
conﬁdence level is used for both procedures they are equivalent.
Having said this their interpretation is different. The hypothesis test forces
us into the reject/do not reject dichotomy which is rather a stark choice. We
have seen how it becomes more likely that the null hypothesis is rejected as
the sample size increases. This problem does not occur with estimation. As the
sample size increases the conﬁdence interval becomes narrower around the
unbiased point estimate which is entirely beneﬁcial. The estimation approach
also tends to emphasise importance over signiﬁcance in most people’s minds.
With a hypothesis test one might know that turnover is signiﬁcantly different
from 5000 without knowing how far from 5000 it actually is.
On some occasions a conﬁdence interval is inferior to a hypothesis test
however. Consider the following case. In the UK only 17 out of 465 judges are
women 3.7.
2
The Equal Opportunities Commission commented that since
the appointment system is so secretive it is impossible to tell if there is discrim-
ination or not. What can the statistician say about this No discrimination in
its broadest sense would mean half of all judges would be women. Thus the
hypotheses are
STFE_C05.qxd 26/02/2009 09:10 Page 190

slide 208:

Independent and dependent samples
191
H
0
: π 0.5 no discrimination
H
1
: π 0.5 discrimination against women
The sample data are p 0.037 n 465. The z score is
This is clearly signiﬁcant and 3.7 is a long way from 50 so the null hypo-
thesis is rejected. There is some form of discrimination somewhere against women
unless women choose not to be judges. But a conﬁdence interval estimate of
the ‘true’ proportion of female judges would be meaningless. To what popula-
tion is this ‘true’ proportion related
The lesson from all this is that there exist differences between conﬁdence
intervals and hypothesis tests despite their formal similarity. Which technique
is more appropriate is a matter of judgement for the researcher. With hypothesis
testing the rejection of the null hypothesis at some signiﬁcance level might
actually mean a small and unimportant deviation from the hypothesised
value. It should be remembered that the rejection of the null hypothesis based
on a large sample of data is also consistent with the true value and hypothesised
value possibly being quite close together.
Independent and dependent samples
The following example illustrates the differences between independent samples
as encountered so far and dependent samples where slightly different methods
of analysis are required. The example also illustrates how a particular problem
can often be analysed by a variety of statistical methods.
A company introduces a training programme to raise the productivity of its
clerical workers which is measured by the number of invoices processed per
day. The company wants to know if the training programme is effective. How
should it evaluate the programme There is a variety of ways of going about the
task as follows:
● Take two random samples of workers one trained and one not trained and
compare their productivity.
● Take a sample of workers and compare their productivity before and after
training.
● Take two samples of workers one to be trained and the other not. Compare
the improvement of the trained workers with any change in the other group’s
performance over the same time period.
We shall go through each method in turn pointing out any possible difﬁculties.
Two independent samples
Suppose a group of 10 workers is trained and compared to a group of 10 non-
trained workers with the following data being relevant
z
p
n
. .
. .
.
−
−
−
×
−
π
ππ 1
0 037 0 5
05 05
465
19 97
STFE_C05.qxd 26/02/2009 09:10 Page 191

slide 209:

Chapter 5 • Hypothesis testing
192
X
T
25.5 X
N
21.0
s
T
2.55 s
N
2.91
n
T
10 n
N
10
Thus trained workers process 25.5 invoices per day compared to only 21 by
non-trained workers. The question is whether this is signiﬁcant given that the
sample sizes are quite small.
The appropriate test here is a t test of the difference of two sample means as
follows:
H
0
: μ
T
− μ
N
0
H
1
: μ
T
− μ
N
0
7.49 is S
2
the pooled variance. The t statistic leads to rejection of the null
hypothesis the training programme does seem to be effective.
One problem with this test is that the two samples might not be truly random
and thus not properly reﬂect the effect of the training programme. Poor
workers might have been reluctant and thus refused to take part in training
departmental managers might have selected better workers for training as some
kind of reward or simply better workers may have volunteered. In a well-
designed experiment this should not be allowed to happen of course but we
do not rule out the possibility. There is also the 5 signiﬁcance level chance
of unrepresentative samples being selected and a Type I error occurring.
Paired samples
This is the situation where a sample of workers is tested before and after train-
ing. The sample data are as follows:
Worker 1 2 3 4 5 6 7 8 9 10
Before 21 24 23 25 28 17 24 22 24 27
After 23 27 24 28 29 21 24 25 26 28
In this case the observations in the two samples are paired and this has implica-
tions for the method of analysis. One could proceed by assuming these are two
independent samples and conduct a t test. The summary data and results are
X
B
23.50 X
A
25.5
s
B
3.10 s
A
2.55
n
B
10 n
A
10
The resulting test statistic is t
18
1.58 which is not signiﬁcant at the 5 level.
There are two problems with this test and its result. First the two samples are
not truly independent since the before and after measurements refer to the
same group of workers. Second nine out of 10 workers in the sample have
shown an improvement which is odd in view of the result found above of no
signiﬁcant improvement. If the training programme really has no effect then
t
. .
.
.
.
−
+
25 5 21 0
749
10
749
10
368
STFE_C05.qxd 26/02/2009 09:10 Page 192

slide 210:

Independent and dependent samples
193
the probability of a single worker showing an improvement is . The probab-
ility of nine or more workers showing an improvement is by the Binomial
method
10
× 10C9 +
10
which is about one in a hundred. A very unlikely
event seems to have occurred.
The t test used above is inappropriate because it does not make full use of
the information in the sample. It does not reﬂect the fact for example that
the before and after scores 21 and 23 relate to the same worker. The Binomial
calculation above does reﬂect this fact. A re-ordering of the data would not
affect the t test result but would affect the Binomial since a different number
of workers would now show an improvement. Of course the Binomial does not
use all the sample information either – it dispenses with the actual productivity
data for each worker and replaces it with ‘improvement’ or ‘no improvement’.
It disregards the amount of improvement for each worker.
The best use of the sample data comes by measuring the improvement for
each worker as follows if a worker had deteriorated this would be reﬂected by
a negative number:
Worker 1 2 3 4 5 6 7 8 9 10
Improvement 2 3 1 3 1 4 0 3 2 1
These new data can be treated by single sample methods and account is
taken both of the actual data values and of the fact that the original samples
were dependent re-ordering of the data would produce different improvement
ﬁgures. The summary statistics of the new data are as follows
X 2.00 s 1.247 n 10
The null hypothesis of no improvement can now be tested as follows
H
0
: μ 0
H
1
: μ 0
This is signiﬁcant at the 5 level so the null hypothesis of no improvement
is rejected. The correct analysis of the sample data has thus reversed the pre-
vious conclusion. It is perhaps surprising that treating the same data in different
ways leads to such a difference in the results. It does illustrate the importance of
using the appropriate method.
Matters do not end here however. Although we have discovered an improve-
ment this might be due to other factors apart from the training programme.
For example if the before and after measurements were taken on different days
of the week that Monday morning feeling . . . or if one of the days were sunnier
making people feel happier and therefore more productive this would bias
the results. These may seem trivial examples but these effects do exist for example
the ‘Friday afternoon car’ which has more faults than the average.
The way to solve this problem is to use a control group so called because
extraneous factors are controlled for in order to isolate the effects of the factor
under investigation. In this case the productivity of the control group would be
t
.
.
.
−
20 0
1 247
10
507
2
1
2
1
2
1
2
STFE_C05.qxd 26/02/2009 09:10 Page 193

slide 211:

Chapter 5 • Hypothesis testing
194
Exercise 5.11
measured twice at the same times as that of the training group though no
training would be given to them. Ideally the control group would be matched
on other factors e.g. age to the treatment group to avoid other factors inﬂuen-
cing the results. Suppose that the average improvement of the control group
were 0.5 invoices per day with standard deviation 1.0 again for a group of 10.
This can be compared with the improvement of the training group via the
two-sample t test giving
1.13
2
is the pooled variance. This conﬁrms the ﬁnding that the training pro-
gramme is of value.
A group of students’ marks on two tests before and after instruction were as follows:
Student 1 2 3 4 5 6 7 8 9 10 11 12
Before 14 16 11 8 20 19 6 11 13 16 9 13
After 15 18 15 11 19 18 9 12 16 16 12 13
Test the hypothesis that the instruction had no effect using both the independent
sample and paired sample methods. Compare the two results.
Discussion of hypothesis testing
The above exposition has served to illustrate how to carry out a hypothesis test
and the rationale behind it. However the methodology has been subject to criti-
cism and it is important to understand this since it gives a greater insight into
the meaning of the results of a hypothesis test.
In the previous examples the problem has often been posed as a decision-
making one yet we noted that in many instances no decision is actually taken
and therefore it is difﬁcult to justify a particular signiﬁcance level. Bayesian
statisticians would argue that their methods do not suffer from this problem
since the result of their analysis termed a posterior probability gives the degree
of belief which the researcher has in the truth of the null hypothesis. However
this posterior probability does in part depend upon the prior probability i.e.
before the statistical analysis that the researcher attaches to the null hypothesis.
As noted in Chapter 2 the derivation of the prior probabilities can be difﬁcult.
In practice most people do not regard the results of a hypothesis test as all-
or-nothing proof but interpret the result on the basis of the quality of the data
the care the researcher has taken in analysing the data personal experience and
a multitude of other factors. Both schools of thought classical and Bayesian
introduce subjectivity into the analysis and interpretation of data: classical
statisticians in the choice of the signiﬁcance level and choice of one- or two-tail
test Bayesians in their choice of prior probabilities. It is not clear which
method is superior but classical methods have the advantage of being simpler.
t
. .
.
.
.
−
+
20 05
113
10
113
10
297
22
STFE_C05.qxd 26/02/2009 09:10 Page 194

slide 212:

Summary
195
Another criticism of hypothesis testing is that it is based on weak methodo-
logical foundations. The philosopher Karl Popper argued that theories should
be rigorously tested against the evidence and that strenuous efforts should be
made to try to falsify the theory or hypothesis. This methodology is not strictly
followed in hypothesis testing where the researcher’s favoured hypothesis is
usually the alternative. A conclusion in favour of the alternative hypothesis
is arrived at by default because of the failure of the null hypothesis to survive
the evidence.
Consider the researcher who believes that health standards have changed in
the last decade. This may be tested by gathering data on health and testing the
null hypothesis of no change in health standards against the alternative hypo-
thesis of some change. The researcher’s theory thus becomes the alternative
hypothesis and is never actually tested against the data. No attempt is made to
falsify the alternative hypothesis it is accepted by default if the null hypo-
thesis falls. Only the null hypothesis is ever tested.
A further problem is the asymmetry between the null and alternative hypo-
theses. The null hypothesis is that there is exactly no change in health standards
whereas the alternative hypothesis contains all other possibilities from a large
deterioration to a large improvement. The dice seem loaded against the null
hypothesis. Indeed as noted earlier if a large enough sample is taken the null
hypothesis is almost certain to be rejected because there is bound to have been
some change however small. The large sample size leads to a small standard
error σ
2
/n and thus a large z score. This suggests that the signiﬁcance level of
a test should decrease as the sample size increases.
These particular problems are avoided by the technique of estimation which
measures the size of the change and focuses attention upon that rather than
upon some accept/reject decision. As the sample size increases the conﬁdence
interval narrows and an improved measure of the true change in health standards
is obtained. Zero i.e. no change in health standards might be in the conﬁdence
interval or it might not it is not the central issue. We might say that an estimate
tells us what the value of a population parameter is while a hypothesis test tells
us what it is not. Thus the techniques of estimation and hypothesis testing
put different emphasis upon interpretation of the results even though they are
formally identical.
Summary
● Hypothesis testing is the set of procedures for deciding whether a hypothesis
is true or false. When conducting the test we presume the hypothesis termed
the null hypothesis is true until it is proved false on the basis of some sample
evidence.
● If the null is proved false it is rejected in favour of the alternative hypothesis.
The procedure is conceptually similar to a court case where the defendant is
presumed innocent until the evidence proves otherwise.
● Not all decisions turn out to be correct and there are two types of error that
can be made. A Type I error is to reject the null hypothesis when it is in fact
true. A Type II error is not to reject the null when it is false.
STFE_C05.qxd 26/02/2009 09:10 Page 195

slide 213:

Chapter 5 • Hypothesis testing
196
● Choosing the appropriate decision rule for rejecting the null hypothesis is
a question of trading off Type I and Type II errors. Because the alternative
hypothesis is imprecisely speciﬁed the probability of a Type II error usually
cannot be speciﬁed.
● The rejection region for a test is therefore chosen to give a 5 probability of
making a Type I error sometimes a 1 probability is chosen. The critical
value of the test statistic sometimes referred to as the critical value of the
test is the value which separates the acceptance and rejection regions.
● The decision is based upon the value of a test statistic which is calculated
from the sample evidence and from information in the null hypothesis
● The null hypothesis is rejected if the test statistic falls into the rejection
region for the test i.e. it exceeds the critical value.
● For a two-tail test there are two rejection regions corresponding to very high
and very low values of the test statistic.
● Instead of comparing the test statistic to the critical value an equivalent pro-
cedure is to compare the Prob-value of the test statistic with the signiﬁcance
level. The null is rejected if the Prob-value is less than the signiﬁcance level.
● The power of a test is the probability of a test correctly rejecting the null
hypothesis. Some tests have low power e.g. when the sample size is small
and therefore are not very useful.
e.g.
/
z
sn
− ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ X μ
D. McCloskey and S. Ziliak Size matters: the standard error of regressions in
the American Economic Review Journal of Socio-Economics 2004 33 527–546.
alternative hypothesis
critical value
effect size
independent samples
null or maintained hypothesis
one- and two-tail tests
paired samples
power
Prob-value
rejection region
signiﬁcance level
Type I and Type II errors
Key terms and concepts
Reference
STFE_C05.qxd 26/02/2009 09:10 Page 196

slide 214:

197
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
5.1 Answer true or false with reasons if necessary.
a There is no way of reducing the probability of a Type I error without simultaneously
increasing the probability of a Type II error.
b The probability of a Type I error is associated with an area under the distribution of X
assuming the null hypothesis to be true.
c It is always desirable to minimise the probability of a Type I error.
d A larger sample ceteris paribus will increase the power of a test.
e The signiﬁcance level is the probability of a Type II error.
f The conﬁdence level is the probability of a Type II error.
5.2 Consider the investor in the text seeking out companies with weekly turnover of at
least £5000. He applies a one-tail hypothesis test to each ﬁrm using the 5 signiﬁcance
level. State whether each of the following statements is true or false or not known and
explain why.
a 5 of his investments are in companies with less than £5000 turnover.
b 5 of the companies he fails to invest in have turnover greater than £5000 per week.
c He invests in 95 of all companies with turnover of £5000 or over.
5.3 A coin which is either fair or has two heads is to be tossed twice. You decide on the
following decision rule: if two heads occur you will conclude it is a two-headed coin
otherwise you will presume it is fair. Write down the null and alternative hypotheses and
calculate the probabilities of Type I and Type II errors.
5.4 In comparing two medical treatments for a disease the null hypothesis is that the two
treatments are equally effective. Why does making a Type I error not matter What
signiﬁcance level for the test should be set as a result
5.5 A ﬁrm receives components from a supplier which it uses in its own production. The com-
ponents are delivered in batches of 2000. The supplier claims that there are only 1
defective components on average from its production. However production occasionally
becomes out of control and a batch is produced with 10 defective components. The ﬁrm
wishes to intercept these low-quality batches so a sample of size 50 is taken from each
batch and tested. If two or more defectives are found in the sample then the batch is
rejected.
a Describe the two types of error the ﬁrm might make in assessing batches of
components.
b Calculate the probability of each type of error given the data above.
c If instead samples of size 30 were taken and the batch rejected if one or more rejects
were found how would the error probabilities be altered
Problems
Problems
STFE_C05.qxd 26/02/2009 09:10 Page 197

slide 215:

Chapter 5 • Hypothesis testing
198
d The ﬁrm can alter the two error probabilities by choice of sample size and rejection
criteria. How should it set the relative sizes of the error probabilities
i if the product might affect consumer safety
ii if there are many competitive suppliers of components
iii if the costs of replacement under guarantee are high
5.6 Computer diskettes which do not meet the quality required for high-density 1.44 Mb
diskettes are sold as double-density diskettes 720 kb for 80p each. High-density
diskettes are sold for £1.20 each. A ﬁrm samples 30 diskettes from each batch of 1000
and if any fail the quality test the whole batch is sold as double-density diskettes. What
are the types of error possible and what is the cost to the ﬁrm of a Type I error
5.7 Testing the null hypothesis that μ 10 against μ 10 a researcher obtains a sample mean
of 12 with standard deviation 6 from a sample of 30 observations. Calculate the z score
and the associated Prob-value for this test.
5.8 Given the sample data X 45 s 16 n 50 at what level of conﬁdence can you reject H
0
:
μ 40 against a two-sided alternative
5.9 What is the power of the test carried out in Problem 5.3
5.10 Given the two hypotheses
H
0
: μ 400
H
1
: μ 415
and σ
2
1000 for both hypotheses:
a Draw the distribution of X under both hypotheses.
b If the decision rule is chosen to be: reject H
0
if X 410 from a sample of size 40 ﬁnd
the probability of a Type II error and the power of the test.
c What happens to these answers as the sample size is increased Draw a diagram to
illustrate.
5.11 Given the following sample data
X 15 s
2
270 n 30
test the null hypothesis that the true mean is equal to 12 against a two-sided alternative
hypothesis. Draw the distribution of X under the null hypothesis and indicate the rejection
regions for this test.
5.12 From experience it is known that a certain brand of tyre lasts on average 15 000 miles
with standard deviation 1250. A new compound is tried and a sample of 120 tyres yields
an average life of 15 150 miles. Are the new tyres an improvement Use the 5
signiﬁcance level.
5.13 Test H
0
: π 0.5 against H
0
: π ≠ 0.5 using p 0.45 from a sample of size n 35.
5.14 Test the hypothesis that 10 of your class or lecture group are left-handed.
STFE_C05.qxd 26/02/2009 09:10 Page 198

slide 216:

199
5.15 Given the following data from two independent samples
X
1
115 X
2
105
s
1
21 s
2
23
n
1
49 n
2
63
test the hypothesis of no difference between the population means against the alternative
that the mean of population 1 is greater than the mean of population 2.
5.16 A transport company wants to compare the fuel efﬁciencies of the two types of lorry it
operates. It obtains data from samples of the two types of lorry with the following results:
Type Average mpg Std devn Sample size
A 31.0 7.6 33
B 32.2 5.8 40
Test the hypothesis that there is no difference in fuel efﬁciency using the 99 conﬁdence
level.
5.17 A random sample of 180 men who took the driving test found that 103 passed. A similar
sample of 225 women found that 105 passed. Test whether pass rates are the same for
men and women.
5.18 a A pharmaceutical company testing a new type of pain reliever administered the drug
to 30 volunteers experiencing pain. Sixteen of them said that it eased their pain. Does
this evidence support the claim that the drug is effective in combating pain
b A second group of 40 volunteers were given a placebo instead of the drug. Thirteen
of them reported a reduction in pain. Does this new evidence cast doubt upon your
previous conclusion
5.19 a A random sample of 20 observations yielded a mean of 40 and standard deviation 10.
Test the hypothesis that μ 45 against the alternative that it is not. Use the 5
signiﬁcance level.
b What assumption are you implicitly making in carrying out this test
5.20 A photo processing company sets a quality standard of no more than 10 complaints per
week on average. A random sample of 8 weeks showed an average of 13.6 complaints
with standard deviation 5.3. Is the ﬁrm achieving its quality objective
5.21 Two samples are drawn. The ﬁrst has a mean of 150 variance 50 and sample size 12. The
second has mean 130 variance 30 and sample size 15. Test the hypothesis that they are
drawn from populations with the same mean.
5.22 a A consumer organisation is testing two different brands of battery. A sample of 15
of brand A shows an average useful life of 410 hours with a standard deviation of
20 hours. For brand B a sample of 20 gave an average useful life of 391 hours
with standard deviation 26 hours. Test whether there is any signiﬁcant difference in
battery life.
b What assumptions are being made about the populations in carrying out this test
Problems
STFE_C05.qxd 26/02/2009 09:10 Page 199

slide 217:

Chapter 5 • Hypothesis testing
200
5.23 The output of a group of 11 workers before and after an improvement in the lighting in
their factory is as follows:
Before 52 60 58 58 53 51 52 59 60 53 55
After 56 62 63 50 55 56 55 59 61 58 56
Test whether there is a signiﬁcant improvement in performance
a assuming these are independent samples
b assuming they are dependent.
5.24 Another group of workers were tested at the same times as those in Problem 5.23
although their department also introduced rest breaks into the working day.
Before 51 59 51 53 58 58 52 55 61 54 55
After 54 63 55 57 63 63 58 60 66 57 59
Does the introduction of rest days alone appear to improve performance
5.25 Discuss in general terms how you might ‘test’ the following:
a astrology
b extra-sensory perception
c the proposition that company takeovers increase proﬁts.
5.26 Project Can your class tell the difference between tap water and bottled water Set up
an experiment as follows: ﬁll r glasses with tap water and n − r glasses with bottled water.
The subject has to guess which is which. If they get more than p correct you conclude they
can tell the difference. Write up a report of the experiment including:
a a description of the experimental procedure
b your choice of n r and p with reasons
c the power of your test
d your conclusions.
5.27 Computer project Use the RAND function in your spreadsheet to create 100 samples
of size 25 which are effectively all from the same population. Compute the mean and
standard deviation of each sample. Calculate the z score for each sample using a
hypothesised mean of 0.5 since the RAND function chooses a random number in the
range 0 to 1.
a How many of the z scores would you expect to exceed 1.96 in absolute value Explain
why.
b How many do exceed this Is this in line with your prediction
c Graph the sample means and comment upon the shape of the distribution. Shade in
the area of the graph beyond z ±1.96.
STFE_C05.qxd 26/02/2009 09:10 Page 200

slide 218:

Answers to exercises
201
Answers to exercises
Exercise 5.1
a H
0
: crime is the same as last year H
1
: crime has increased.
b Type I error – concluding crime has risen when in fact it has not. Type II –
concluding it has not risen when in fact it has. The cost of the former might be
employing more police ofﬁcers which are not in fact warranted of the latter not
employing more police to counter the rising crime level. The Economist magazine
19 July 2003 reported that 33 of respondents to a survey in the UK felt that
crime had risen in the previous two years only 4 thought that it had fallen.
In fact crime had fallen slightly by about 2. A lot of people were making a
Type I error therefore.
Exercise 5.2
a z 108 − 100/√36 1.33. The area in the tail beyond 1.33 is 9.18 which is the
probability of a Type I error.
b z 1.64 cuts off 5 in the upper tail of the distribution hence we need the
decision rule to be at X + 1.64 × s/√n 100 + 1.64 ×√36 109.84.
c Under H
1
: μ 112 we can write X N112 900/25. We assume the same vari-
ance under both H
0
and H
1
in this case. Hence z 108 − 112/√36 −0.67. This
gives an area in the tail of 25.14 which is the Type II error probability. Usually
however we do not have a precise statement of the value of μ under H
1
so
cannot do this kind of calculation.
Exercise 5.3
α 0.05 signiﬁcance level chosen hence the critical value is z 1.96. The test
statistic is z 530 − 500/90/√30 1.83 1.96 so H
0
is not rejected at the 5
signiﬁcance level.
Exercise 5.4
One wants to avoid making a Type I error if possible i.e. rejecting H
0
when true. Hence
set a low signiﬁcance level 1 so that H
0
is only rejected by very strong evidence.
Exercise 5.5
a i Reject. The Prob-value should be halved to 0.0385 which is less than 5.
Alternatively 1.77 1.64. ii Do not reject the Prob-value is greater than 5
equivalently 1.77 1.96.
b In this case the null is not rejected in both cases. In the one-tailed case
0.0385 1 so the null is not rejected.
Exercise 5.6
hence the null is decisively rejected.
Exercise 5.7
We have the data: X
1
3.6 s
1
1.2 n
1
50 X
2
3.9 s
2
2.1 n
2
90. The null hypo-
thesis is H
0
: μ
1
μ
2
versus H
1
: μ
1
≠ μ
2
. The test statistic is
z
. .
. .
.
−
×
065 05
05 05
80
268
STFE_C05.qxd 26/02/2009 09:10 Page 201

slide 219:

Chapter 5 • Hypothesis testing
202
absolute value so the null is not rejected at the 5 signiﬁcance level.
Exercise 5.8
The evidence is p
1
23/75 n
1
75 p
2
34/95 n
2
95. The hypothesis to be tested
is H
0
: π
1
− π
2
0 versus H
1
: π
1
− π
2
0. Before calculating the test statistic we must
calculate the pooled variance as
0.3353
The test statistic is then
This is less in absolute magnitude than 1.64 the critical value of a one tailed test
so the null is not rejected. The second gambler is just luckier than the ﬁrst we con-
clude. We have to be careful about our interpretation however: one of the gamblers
might prefer longer-odds bets so wins less often but gets more money each time.
Hence this may not be a fair comparison.
Exercise 5.9
We shall treat this as a two-tailed test although a one-tailed test might be justiﬁed
if there were other evidence that spending had fallen. The hypothesis is H
0
: μ 540
versus H
1
: μ ≠ 540. Given the sample evidence the test statistic is
The critical value of the t distribution for 23 degrees of freedom is 2.069 so the null
is not rejected.
Exercise 5.10
The hypothesis to test is H
0
: μ
F
− μ
N
0 versus H
1
: μ
F
− μ
N
0 F indexes ﬁnalists
N the new students. The pooled variance is calculated as
S
2
18.14
The test statistic is
The critical value of the t distribution with 15 + 20 − 2 33 degrees of freedom is
approximately 1.69 5 signiﬁcance level for a one-tailed test. Thus the null is
decisively rejected and we conclude ﬁnalists do spend more time in the library.
t
S
n
S
n
.
.
.
−− −
+
−−
+
X X
12 1 2
2
1
2
2
15 9 0
18 14
15
18 14
20
412
μμ
15 × 3
2
+ 20 × 5
2
35
n
1
− 1s
2
1
+ n
2
− 1s
2
2
n
1
+ n
2
− 2
t
sn
/
/
.
−
−
−
X μ
22
490 540
150 24
163
z
. .
. .
. .
.
−−
×−
+
×−
−
0 3067 0 3579 0
0 3353 1 0 3353
75
0 3353 1 0 3353
95
070
75 × 0.3067 + 95 × 0.3579
75 + 95
n
1
p
1
+ n
2
p
2
n
1
+ n
2
z
s
n
s
n
. .
.
.
. .
−− −
+
−−
+
−
XX
12 1 2
1
2
1
2
2
2
22
36 39 0
12
50
21
90
108 196
μμ
STFE_C05.qxd 26/02/2009 09:10 Page 202

slide 220:

Answers to exercises
203
Exercise 5.11
By the method of independent samples we obtain X
1
13 X
2
14.5 s
1
4.29
s
2
3.12 with n 12 in both cases. The test statistic is therefore
with pooled variance
S
2
14.05
The null of no effect is therefore accepted. By the method of paired samples we
have a set of improvements as follows:
Student 1234 5 6789 10 11 12
Improvement 1243 −1 −1 3 1 303 0
The mean of these is 1.5 and the variance is 3. The t statistic is therefore
This now conclusively rejects the null hypothesis critical value 1.8 in stark con-
trast to the former method. The difference arises because 10 out of 12 students have
improved or done as well as before only two have fallen back slightly. The gain in
marks is modest but applies consistently to nearly all candidates.
t
.
/
−
15 0
312
3
11 × 4.29
2
+ 11 × 3.12
2
22
n
1
− 1s
2
1
+ n
2
− 1s
2
2
n
1
+ n
2
− 2
t
S
n
S
n
.
.
.
.
−− −
+
−−
+
−
X X
12 1 2
2
1
2
2
13 14 5 0
14 05
12
14 05
12
098
μμ
STFE_C05.qxd 26/02/2009 09:10 Page 203

slide 221:

The χ
2
and F distributions
6
Contents
Learning outcomes 204
Introduction 205
The χ
2
distribution 205
Estimating a variance 206
Comparing actual and expected values of a variable 208
Contingency tables 215
Constructing the expected values 216
Calculation of the test statistic 218
The F distribution 220
Testing the equality of two variances 220
Analysis of variance 222
The result of the hypothesis test 226
The analysis of variance table 227
Summary 229
Key terms and concepts 230
Problems 231
Answers to exercises 234
Appendix: Use of χ
2
and F distribution tables 236
By the end of this chapter you should be able to:
● understand the uses of two new probability distributions: χ
2
and F
● construct conﬁdence interval estimates for a variance
● perform hypothesis tests concerning variances
● analyse and draw inferences from data contained in contingency tables
● construct a simple analysis of variance table and interpret the results.
Learning
outcomes
204
Complete your diagnostic test for Chapter 6 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C06.qxd 26/02/2009 09:11 Page 204

slide 222:

The χ
2
distribution
205
Figure 6.1
The χ
2
distribution with
different degrees of
freedom
Introduction
The ﬁnal two distributions to be studied are the χ
2
pronounced ‘kye-squared’
and F distributions. Both of these distributions have a variety of uses the most
common of which are illustrated in this chapter. These distributions allow us to
extend some of the estimation and testing procedures covered in Chapters 4 and
5. The χ
2
distribution allows us to establish conﬁdence interval estimates for a
variance just as the Normal and t distributions were used in the case of a mean.
Further just as the Binomial distribution was used to examine situations where
the result of an experiment could be either ‘success’ or ‘failure’ the χ
2
distribu-
tion allows us to analyse situations where there are more than two categories of
outcome. The F distribution enables us to conduct hypotheses tests regarding
the equality of two variances and also to make comparisons between the means
of multiple samples not just two. The F distribution is also used in Chapters 7
and 8 on regression analysis.
The χ
2
distribution
The χ
2
distribution has a number of uses. In this chapter we make use of it in
three ways:
● To calculate a conﬁdence interval estimate of the population variance.
● To compare actual observations on a variable with the theoretically expected
values.
● To test for association between two variables in a contingency table.
The use of the distribution is in many ways similar to the Normal and t
distributions already encountered. Once again it is actually a family of dis-
tributions depending upon one parameter the degrees of freedom similar to the
t distribution. The number of degrees of freedom can have slightly different
interpretations depending upon the particular problem but is often related to
sample size in some way. Some typical χ
2
distributions are drawn in Figure 6.1
for different values of the parameter. Note the distribution has the following
characteristics:
STFE_C06.qxd 26/02/2009 09:11 Page 205

slide 223:

Chapter 6 • The χ
2
and F distributions
206
● it is always non-negative
● it is skewed to the right
● it becomes more symmetric as the number of degrees of freedom increases.
Using the χ
2
distribution to construct conﬁdence intervals is done in the
usual way by using the critical values of the distribution given in Table A4
see page 416 which cut off an area α/2 in each tail of the distribution. For
hypothesis tests a rejection region is deﬁned which cuts off an area α in either
one or both tails of the distribution whichever is appropriate. These principles
should be familiar from previous chapters so are not repeated in detail. The
following examples show how this works for the χ
2
distribution.
Estimating a variance
The sample variance is also a random variable like the mean it takes on different
values from sample to sample. We can therefore ask the usual question: given
a sample variance what can we infer about the true value
To give an example we use the data on spending by Labour boroughs in the
example in Chapter 4 see page 163. In that sample of 20 boroughs the average
spending on administration was £175 per taxpayer with standard deviation
25 and hence variance of 625. What can we say about the true variance and
standard deviation
We work in terms of variances this is more convenient when using the χ
2
distribution taking the square root when we need to refer to the standard
deviation. First of all the sample variance is an unbiased estimator of the
population variance
1
Es
2
σ
2
so we may use this as our point estimate which
is therefore 625. To construct the conﬁdence interval around this we need to
know about the distribution of s
2
. Unfortunately this does not have a con-
venient probability distribution so we transform it to
6.1
which does have a χ
2
distribution with ν n − 1 degrees of freedom. Again we
state this without a formal mathematical proof.
To construct the 95 conﬁdence interval around the point estimate we
proceed in a similar fashion to the Normal or t distribution. First we ﬁnd the
critical values of the χ
2
distribution which cut off 2.5 in each tail. These are
no longer symmetric around zero as was the case with the standard Normal and
t distributions. Table 6.1 shows an excerpt from the χ
2
table which is given in
full in Table A4 in the Appendix at the end of the book see page 416.
Like the t distribution the ﬁrst column gives the degrees of freedom so we
require the row corresponding to ν n − 1 19.
● For the left-hand critical value cutting off 2.5 in the left-hand tail we look
at the column headed ‘0.975’ representing 97.5 in the right-hand tail. This
critical value is 8.91.
● For the right-hand critical value we look up the column headed ‘0.025’ 2.5
in the right-hand tail giving 32.85.
n − 1s
2
σ
2
1
This was stated without proof in Chapter 1 see page 36.
STFE_C06.qxd 26/02/2009 09:11 Page 206

slide 224:

The χ
2
distribution
207
We can therefore be 95 conﬁdent that n − 1s
2
/σ
2
lies between these two
values i.e.
8.91 32.85 6.2
We actually want an interval estimate for σ
2
so we need to rearrange equation
6.2 so that σ
2
lies between the two inequality signs. Rearranging yields
σ
2
6.3
and evaluating this expression leads to the 95 conﬁdence interval for σ
2
which is
σ
2
361.5 1332.8
Note that the point estimate 625 is no longer at the centre of the interval
but is closer to the lower limit. This is a consequence of the skewness of the
χ
2
distribution.
Worked example 6.1
Given a sample of size n 51 yielding a sample variance s
2
81 we may
calculate the 95 conﬁdence interval for the population variance as follows.
Since we are using the 95 conﬁdence level the critical values cutting off
the extreme 5 of the distribution are 32.36 and 71.42 from Table A4. We
can therefore use equation 6.3 to ﬁnd the interval
σ
2
Substituting in the values gives
σ
2
yielding a conﬁdence interval of 56.71 125.15.
J
L
51 − 1 × 81
32.36
51 − 1 × 81
71.42
G
I
J
L
n − 1 × s
2
32.36
n − 1 × s
2
71.42
G
I
J
L
19 × 625
8.91
19 × 625
32.85
G
I
J
L
n − 1s
2
8.91
n − 1s
2
32.85
G
I
J
L
n − 1s
2
σ
2
G
I
➔
Table 6.1 Excerpt from Table A4 – the χ
2
distribution
ν 0.99 0.975 . . . 0.10 0.05 0.025 0.01
1 0.0002 0.0010 . . . 2.7055 3.8415 5.0239 6.6349
2 0.0201 0.0506 . . . 4.6052 5.9915 7.3778 9.2104
33 3 ... 33 3 3
18 7.0149 8.2307 . . . 25.9894 28.8693 31.5264 34.8052
19 7.6327 8.9065 . . . 27.2036 30.1435 32.8523 36.1908
20 8.2604 9.5908 . . . 28.4120 31.4104 34.1696 37.5663
Note: The two critical values are found at the intersections of the shaded row and columns.
Alternatively you can use Excel. The formula CHIINV 0.975 19 gives the left-hand critical
value 8.91 similarly CHIINV 0.025 19 gives the answer 32.85 the right-hand critical value.
STFE_C06.qxd 26/02/2009 09:11 Page 207

slide 225:

Chapter 6 • The χ
2
and F distributions
208
Exercise 6.1
Note that if we wished to ﬁnd a 95 conﬁdence interval for the standard
deviation we can simply take the square root of the result to obtain 7.53
11.19.
The 99 CI for the variance can be obtained by altering the critical values.
The values cutting off 0.5 in each tail of the distribution are again from
Table A4 27.99 and 79.49. Using these critical values results in an interval
50.95 144.69. Note that as expected the 99 CI is wider than the 95
interval.
a Given a sample variance of 65 from a sample of size n 30 calculate the 95
conﬁdence interval for the variance of the population from which the sample
was drawn.
b Calculate the 95 CI for the standard deviation.
c Calculate the 99 interval estimate of the variance.
Comparing actual and expected values of a variable
A second use of the χ
2
distribution provides a hypothesis test allowing us to
compare a set of observed values to expected values the latter calculated on the
basis of some null hypothesis to be tested. If the observed and expected values
differ signiﬁcantly as judged by the χ
2
test the test statistic falls into the rejec-
tion region of the χ
2
distribution then the null hypothesis is rejected. Again
this is similar in principle to hypothesis testing using the Normal or t distribu-
tions but allows a slightly different type of problem to be handled.
This can be illustrated with a very simple example. Suppose that throwing a
die 72 times yields the following data:
Score on die 1 2 3 4 5 6
Frequency 6 15 15 7 15 14
Are these data consistent with the die being unbiased Previously we might
have investigated this problem by testing whether the proportion of say sixes
is more or less than expected using the Binomial distribution. One could still
do this but this does not make full use of the information in the sample it only
compares sixes against all other values together. The χ
2
test allows one to see
if there is any bias in the die for or against a particular number. It therefore
answers a slightly different and more general question than if we made use of
the Binomial distribution.
A crude examination of the data suggests a slight bias against 1 and 4 but
is this truly bias or just a random ﬂuctuation quite common in this type of
experiment First the null and alternative hypotheses are set up:
H
0
: the die is unbiased
H
1
: the die is biased
Note that the null hypothesis should be constructed in such a way as to permit
the calculation of the expected outcomes of the experiment. Thus the null and
alternative hypotheses could not be reversed in this case since ‘the die is biased’
STFE_C06.qxd 26/02/2009 09:11 Page 208

slide 226:

The χ
2
distribution
209
Table 6.2 Calculation of the χ
2
statistic for the die problem
Score Observed frequency O Expected frequency E O − E O − E
2
16 12 −6 36 3.00
2 15 12 3 9 0.75
3 15 12 3 9 0.75
47 12 −5 25 2.08
5 15 12 3 9 0.75
6 14 12 2 4 0.33
Totals 72 72 0 – 7.66
O − E
2
E
STATISTICS
IN
PR AC TIC E
··
is a vague statement exactly how biased for example and would not permit
the calculation of the expected outcomes of the experiment.
On the basis of the null hypothesis the expected values are based on the
uniform distribution i.e. each number should come up an equal number of times.
The expected values are therefore 12 72/6 for each number on the die.
This gives the data shown in Table 6.2 with observed and expected frequencies
in columns two and three respectively ignore columns 4–6 for the moment.
The χ
2
test statistic is now constructed using the formula
χ
2
∑
6.4
which has a χ
2
distribution with ν k − 1 degrees of freedom k is the number
of different outcomes here 6.
2
O represents the observed frequencies and E the
expected. If the value of this test statistic falls into the rejection region i.e. the
tail of the χ
2
distribution then we conclude the die is biased rejecting the null.
The calculation of the test statistic is shown in columns 4–6 of Table 6.2 and is
straightforward yielding a value of the test statistic of χ
2
7.66 to be compared
to the critical value of the distribution for 6 − 1 5 degrees of freedom.
Trap
In my experience many students misinterpret formula 6.4 and use instead
χ
2
This is not the same as the correct formula and gives the wrong answer Check
that you recognise the difference between the two and that you always use the
correct version.
Looking up the critical value for this test takes a little care as one needs ﬁrst to
consider if it is a one- or two-tailed test. Looking at the alternative hypothesis
suggests a two-sided test since the error could be in either direction. However this
intuition is wrong for the following reason. Looking closely at equation 6.4
∑O − E
2
∑E
O − E
2
E
2
Note that on this occasion the degrees of freedom are not based on the sample size.
STFE_C06.qxd 26/02/2009 09:11 Page 209

slide 227:

Chapter 6 • The χ
2
and F distributions
210
reveals that large discrepancies between observed and expected values however
occurring can only lead to large values of the test statistic. Conversely small
values of the test statistic must mean that differences between O and E are small
so the die must be unbiased. Thus the null is only rejected by large values of the
χ
2
statistic or in other words the rejection region is in the right-hand tail only
of the χ
2
distribution. It is a one-tailed test. This is illustrated in Figure 6.2.
The critical value of the χ
2
distribution in this case ν 5 5 signiﬁcance
level is 11.1 found from Table A4. Note that we require 5 of the distribution
in the right-hand tail to establish the rejection region. Since the test statistic
is less than the critical value 7.66 11.1 the null hypothesis is not rejected.
The differences between scores are due to sampling error rather than to bias in
the die.
An important point to note is that the value of the test statistic is sensitive to
the total frequency 72 in this case. Therefore the test should not be carried out
on the proportion of occasions on which each number comes up the expected
values would all be 12/72 0.167 and the observed values 8/72 13/72 etc.
since information about the ‘sample size’ number of rolls of the die would be
lost. As with all sampling experiments the inferences that can be drawn depend
upon the sample size with larger sample sizes giving more reliable results so
care must be taken to retain information about sample size in the calculations.
If the test had been incorrectly conducted in terms of proportions all O and E
values would have been divided by 72 and this would have reduced the test
statistic by a factor of 72 check the formula to conﬁrm this reducing it to 0.14
– nowhere near signiﬁcance. It would be surprising if any data would yield
signiﬁcance given this degree of mistreatment See the ‘Oops’ box later in this
chapter.
A second more realistic example will now be examined to reinforce the mess-
age about the use of the χ
2
distribution and to show how the expected values
might be generated in different ways. This example looks at road accident
ﬁgures to see if there is any variation through the year. One might reasonably
expect more accidents in the winter months due to weather conditions poorer
light etc. Quarterly data on the number of people killed on British roads are
used and the null hypothesis is that the number does not vary seasonally.
H
0
: there is no difference in fatal accidents between quarters
H
1
: there is some difference in fatal accidents between quarters
Figure 6.2
The rejection region for
the χ
2
test
STFE_C06.qxd 26/02/2009 09:11 Page 210

slide 228:

The χ
2
distribution
211
Such a study might be carried out by government for example to try to ﬁnd the
best means of reducing road accidents.
Table 6.3 shows data on road fatalities in 2006 by quarter in Great Britain
adapted from data taken from the UK government’s Road Casualties Great Britain
2006. There does appear some evidence of more accidents in the ﬁnal two quar-
ters of the year but is this convincing evidence or just random variation Under
the null hypothesis the total number of fatalities 3172 would be evenly split
between the four quarters yielding Table 6.4 and the χ
2
calculation that follows.
The calculated value of the test statistic is 30.19 given at the foot of the ﬁnal
column. The number of degrees of freedom is ν k − 1 3 so the critical value
at the 5 signiﬁcance level is 7.82. Since the test statistic exceeds this the null
hypothesis is rejected there is a difference between seasons in the accident rate.
The fourth edition of this book used similar data for 2002. Although the total
number of accidents was larger 3431 the seasonal pattern was almost identical
and the χ
2
statistic was 31.24. The similarity of patterns from two different years
strengthens our belief about seasonal differences.
The reason for this difference might be the increased hours of darkness
during winter months leading to more accidents. This particular hypothesis can
be tested using the same data but combining quarters I and IV to represent
winter and quarters II and III summer. The null hypothesis is of no difference
between summer and winter and the calculation is set out in Table 6.5. The
χ
2
test statistic is now extremely small and falls below the new critical value
ν 1 5 signiﬁcance level of 3.84 so the null hypothesis is not rejected.
Table 6.3 Road casualties in Great Britain 2006
Quarter I II III IV Total
Casualties 697 743 838 894 3172
Table 6.4 Calculation of the χ
2
statistic for road fatalities
Quarter Observed Expected O − E O − E
2
I 697 793 −96 9216 11.62
II 743 793 −50 2500 3.15
III 838 793 45 2025 2.55
IV 894 793 101 10 201 12.86
Totals 3172 3172 – – 30.19
O − E
2
E
Table 6.5 Seasonal variation in road casualties
Season Observed Expected O − E O − E
2
Summer 1581 1586 −5 25 0.016
Winter 1591 1586 5 25 0.016
Totals 3172 3172 0 – 0.032
O − E
2
E
STFE_C06.qxd 26/02/2009 09:11 Page 211

slide 229:

Chapter 6 • The χ
2
and F distributions
212
3
An earlier edition of this book using data from 1993 did ﬁnd a signiﬁcant difference
between summer and winter so either things have changed or there are still some puzzles
to resolve.
STATISTICS
IN
PR AC TIC E
··
Thus the variation between quarters does not appear to be a straightforward
summer/winter effect providing of course that combining quarters I and IV to
represent winter and II and III to represent summer is a valid way of combining
the quarters.
3
Another point which the example brings out is that the data can be examined
in a number of ways using the χ
2
technique. Some of the classes were combined
to test a slightly different hypothesis from the original one. This is a quite
acceptable technique but should be used with caution. In any set of data even
totally random data there is bound to be some way of dividing it up such that
there are signiﬁcant differences between the divisions. The point however
is whether there is any meaning to the division. In the above example the
amalgamation of the quarters into summer and winter has some intuitive mean-
ing and we have good reason to believe that there might be differences between
them. Driving during the hours of darkness might be more dangerous and might
have had some relevance to accident prevention policy e.g. an advertising
campaign to persuade people to check that their lights work correctly. The
hypothesis is led by some prior theorising and is worth testing.
Road accidents and darkness
The question of the effect of darkness on road accidents has been extensively
studied particularly in relation to putting the clocks forwards in spring and
back in autumn. A study by H. Green in 1980 reported the following numbers of
accidents involving death or serious injury on the ﬁve weekday evenings before
and after the clocks changed:
Spring Autumn
Year Before After Before After
1975 19 11 20 31
1976 14 9 23 36
1977 22 8 12 29
It is noticeable that accidents fell in spring after the hour change when it
becomes lighter but increased in autumn when it becomes darker. This is a
better test than simply combining quarterly ﬁgures as in our example so casts
doubt upon our result. Evidence from other countries also supports the view that
the light level has an important inﬂuence on accidents.
Source: H. Green Some effects on accidents of changes in light conditions at the beginning and end of
British Summer Time. Supplementary Report 587 Transport and Road Research Laboratory 1980. For
an update on research see J. Boughton et al. Inﬂuence of light level on the incidence of road
casualties J. Royal Statistical Society Series A 1999 162 2 137–175.
It is dangerous however to look at the data and then formulate a hypothesis.
From Table 6.4 there appears to be a large difference between the ﬁrst and
STFE_C06.qxd 26/02/2009 09:11 Page 212

slide 230:

The χ
2
distribution
213
second halves of the year. If quarters I and II were combined and III and IV
combined the χ
2
test statistic might be signiﬁcant in fact it is χ
2
26.9 but
does this signify anything It is extremely easy to look for a big difference some-
where in any set of data and then pronounce it ‘signiﬁcant’ according to some
test. The probability of making a Type I error rejecting a correct null is much
greater than 5 in such a case. The point as usual is that it is no good looking
at data in a vacuum and simply hoping that they will ‘tell you something’.
A related warning is that we should be wary of testing one hypothesis and
on the basis of that result formulating another hypothesis and testing it as
we have done by going on to compare summer and winter. Once again we
are indirectly using the data to help formulate the hypothesis and the true
signiﬁcance level of the test is likely to be different from 5 even though we use
the 5 critical value. We have therefore sinned but is difﬁcult to do research
without sometimes resorting to this kind of behaviour. There are formal methods
for dealing with such situations but they are beyond the scope of this book.
There is one further point to make about carrying out a χ
2
test and this
involves circumstances where classes must be combined. The theoretical χ
2
dis-
tribution from which the critical value is obtained is a continuous distribution
yet the calculation of the test statistic comes from data which are divided
up into a discrete number of classes. The calculated test statistic is therefore only
an approximation to a true χ
2
variable but this approximation is good enough
as long as each expected not observed value is greater than or equal to ﬁve.
It does not matter what the observed values are. In other circumstances the
class or classes with expected values less than ﬁve must be combined with
other classes until all expected values are at least ﬁve. An example of this will
be given below.
In all cases of χ
2
testing the most important part of the analysis is the calcu-
lation of the expected values the rest of the analysis is mechanical. Therefore
it is always worth devoting most of the time to this part of the problem. The
expected values are of course calculated on the basis of the null hypothesis
being true so different null hypotheses will give different expected values.
Consider again the case of road fatalities. Although the null hypothesis ‘no
differences in accidents between quarters’ seems clear enough it could mean
different things. Here it was taken to mean an equal number in each quarter
but another interpretation is an equal number of casualties per car-kilometre
travelled in each quarter in other words accidents might be higher in a given
quarter simply because there are more journeys in that quarter during holiday
periods for example. Table 6.6 gives an index of average daily trafﬁc ﬂows on
British roads in each quarter of the year.
The pattern of accidents might follow the pattern of road usage – the ﬁrst
quarter of the year has the fewest casualties and also the least amount of travel.
This may be tested by basing the expected values on the average trafﬁc ﬂow: the
3172 total casualties are allocated to the four quarters in the ratios 95:102:105:98.
This is shown in Table 6.7 along with the calculation of the χ
2
statistic.
The χ
2
test statistic is 27.18 well in excess of the critical value 7.82. This
indicates that there are signiﬁcant differences between the quarters even after
accounting for different amounts of trafﬁc. In fact the statistic is little changed
from before suggesting either that trafﬁc ﬂows do not affect accident probabil-
ities much or that the ﬂows do not actually vary very much. It is evident that
STFE_C06.qxd 26/02/2009 09:11 Page 213

slide 231:

Chapter 6 • The χ
2
and F distributions
214
Table 6.6 Index of road trafﬁc 2002–2006
Q1 Q2 Q3 Q4 Total
Index 95 102 105 98 400
Table 6.7 Calculation with alternative pattern of expected values
Quarter Observed Expected O − E O − E
2
I 697 753.4 −56.4 3175.32 4.21
II 743 808.9 −65.9 4337.54 5.36
III 838 832.7 5.4 28.62 0.03
IV 894 777.1 116.9 13 656.26 17.57
Totals 3172 3172 – – 27.18
Note: The ﬁrst expected value is calculated as 3172 × 95 ÷ 400 753.4 the second as
3172 × 102 ÷ 400 808.9 and so on.
O − E
2
E
the variation in trafﬁc ﬂows is much less than the variation in casualties. One
possible explanation is that increased trafﬁc means lower speed and hence a
lower severity of accidents.
Worked example 6.2
One hundred volunteers each toss a coin twice and note the numbers of
heads. The results of the experiment are as follows:
Heads 0 1 2 Total
Frequency 15 55 30 100
Can we reject the hypothesis that a fair coin or strictly coins was used
for the experiment
On the basis of the Binomial distribution the probability of no heads is
0.25
1
/2 ×
1
/2 of one head is 0.5 and of two heads is again 0.25 as explained
in Chapter 2. The expected frequencies are therefore 25 50 and 25. The
calculation of the test statistic is set out below:
Number of heads OE O − E O − E
2
01525 −10 100 4
1 55 50 5 25 0.5
2 30 25 5 25 1
Totals 100 100 – – 5.5
The test statistic of 5.5 compares to a critical value of 5.99 ν 2 so we
cannot reject the null hypothesis of a fair coin being used.
Note that we could test the hypothesis using a z test using the methods of
Chapter 5. There have been a total of 200 tosses of which 115 55 + 2 × 30
O − E
2
E
STFE_C06.qxd 26/02/2009 09:11 Page 214

slide 232:

The χ
2
distribution
215
were heads i.e. a ratio of 0.575 against the expected 0.5. We can therefore test
H
0
: π 0.5 against H
1
: π ≠ 0.5 using the evidence n 200 and p 0.575. This
yields the test statistic
Interestingly we now reject the null as the test statistic is greater than the
critical value of 1.96. How can we reconcile these conﬂicting results
Note that both results are close to the critical values so narrowly reject
or accept the null. The χ
2
and z distributions are both continuous ones and
in this case are approximations to the underlying Binomial experiment.
This is the cause of the problem. If we alter the data very slightly to 16 55
29 observed frequencies of no heads one head and two heads then both
methods accept the null hypothesis. Similarly for frequencies 14 55 31 both
methods reject the null.
The lesson of this example is to be cautious when the test statistic is close
to the critical value. We cannot say decisively that the null has been accepted
or rejected.
The following data show the observed and expected frequencies of an experiment
with four possible outcomes A–D.
Outcome A B C D
Observed 40 60 75 90
Expected 35 55 75 100
Test the hypothesis that the results are in line with expectations using the 5
signiﬁcance level.
a Verify the claim in the worked example above that both χ
2
and z statistic methods
give the same qualitative accept or reject result when the observed frequencies
are 16 55 29 and when they are 14 55 31.
b In each case look up or calculate using Excel the Prob-values for the χ
2
and z
test statistics and compare.
Contingency tables
Data are often presented in the form of a two-way classiﬁcation as shown in
Table 6.8 known as a contingency table and this is another situation where the
χ
2
distribution is useful. It provides a test of whether or not there is an associa-
tion between the two variables represented in the table.
The table shows the voting intentions of a sample of 200 voters cross-
classiﬁed by social class. The interesting question that arises from these data is
whether there is any association between people’s voting behaviour and their
social class. Are manual workers social class C in the table more likely to vote
for the Labour party than for the Conservative party The table would appear to
z
. .
. .
.
−
×
0 575 0 5
05 05
200
212
Exercise 6.2
Exercise 6.3
STFE_C06.qxd 26/02/2009 09:11 Page 215

slide 233:

Chapter 6 • The χ
2
and F distributions
216
indicate support for this view but is this truly the case for the whole population
or is the evidence insufﬁcient to draw this conclusion
This sort of problem is amenable to analysis by a χ
2
test. The data presented
in the table represent the observed values so expected values need to be calcu-
lated and then compared to them using a χ
2
test statistic. The ﬁrst task is to
formulate a null hypothesis on which to base the calculation of the expected
values and an alternative hypothesis. These are
H
0
: there is no association between social class and voting behaviour
H
1
: there is some association between social class and voting behaviour
As always the null hypothesis has to be precise so that expected values can be
calculated. In this case it is the precise statement that there is no association
between the two variables they are independent.
Constructing the expected values
If H
0
is true and there is no association we would expect the proportions voting
Labour Conservative and Liberal Democrat to be the same in each social class.
Further the parties would be identical in the proportions of their support com-
ing from social classes A B and C. This means that since the whole sample of
200 splits 80:70:50 for the Labour Conservative and Liberal Democrat parties
see the bottom row of the table each social class should split the same way.
Thus of the 40 people of class A 80/200 of them should vote Labour 70/200
Conservative and 50/200 Liberal Democrat. This yields:
Split of social class A:
Labour 40 × 80/200 16
Conservative 40 × 70/200 14
Liberal Democrat 40 × 50/200 10
For class B:
Labour 100 × 80/200 40
Conservative 100 × 70/200 35
Liberal Democrat 100 × 50/200 25
And for C the 60 votes are split Labour 24 Conservative 21 and Liberal
Democrat 15.
Both observed and expected values are presented in Table 6.9 expected
values are in brackets. Notice that both the observed and expected values sum
to the appropriate row and column totals. It can be seen that compared with
the ‘no association’ position Labour receives too few votes from Class A and the
Table 6.8 Data on voting intentions by social class
Social class Labour Conservative Liberal Democrat Total
A10 15 15 40
B 40 35 25 100
C30 20 10 60
Totals 80 70 50 200
STFE_C06.qxd 26/02/2009 09:11 Page 216

slide 234:

The χ
2
distribution
217
Liberal Democrats too many. However Labour receives disproportionately
many class C votes the Liberal Democrats too few. The Conservatives’ observed
and expected values are identical indicating that the propensities to vote
Conservative are the same in all social classes.
A quick way to calculate the expected value in any cell is to multiply the
appropriate row total by column total and divide through by the grand total
200. For example to obtain the expected value for the class A/Labour cell
expected value 16
In carrying out the analysis care should again be taken to ensure that informa-
tion is retained about the sample size i.e. the numbers in the table should be
actual numbers and not percentages or proportions. This can be checked by
ensuring that the grand total is always the same as the sample size.
As was the case before the χ
2
test is only valid if the expected value in each
cell is not less than ﬁve. In the event of one of the expected values being less
than ﬁve some of the rows or columns have to be combined. How to do this
is a matter of choice and depends upon the aims of the research. Suppose
for example that the expected number of class C voting Liberal Democrat were
less than ﬁve. There are four options open:
1 Combine the Liberal Democrat column with the Labour column
2 Combine the Liberal Democrat column with the Conservative column
3 Combine the class C row with the class A row
4 Combine the class C row with the class B row.
Whether rows or columns are combined depends upon whether interest
centres more upon differences between parties or differences between classes.
If the main interest is the difference between class A and the others option 4
should be chosen. If it is felt that the Liberal Democrat and Conservative parties
are similar option 2 would be preferred and so on. If there are several expected
values less than ﬁve rows and columns must be combined until all are eliminated.
The χ
2
test on a contingency table is similar to the one carried out before the
formula being the same:
χ
2
∑
6.5
with the number of degrees of freedom given by ν r − 1 × c − 1 where r
is the number of rows in the table and c is the number of columns. In this case
r 3 and c 3 so
ν 3 − 1 × 3 − 1 4
O − E
2
E
40 × 80
200
row total × column total
grand total
Table 6.9 Observed and expected values latter in brackets
Social class Labour Conservative Liberal Democrat Total
A 10 16 15 14 15 10 40
B 40 40 35 35 25 25 100
C 30 24 20 21 10 15 60
Totals 80 70 50 200
STFE_C06.qxd 26/02/2009 09:11 Page 217

slide 235:

Chapter 6 • The χ
2
and F distributions
218
STATISTICS
IN
PR AC TIC E
··
The reason why there are only four degrees of freedom is that once any four
cells of the contingency table have been ﬁlled the other ﬁve are constrained by the
row and column totals. The number of ‘free’ cells can always be calculated as the
number of rows less one times the number of columns less one as given above.
Calculation of the test statistic
The evaluation of the test statistic then proceeds as follows cell by cell
++
+++
+++
2.25 + 0.07 + 2.50 + 0 + 0 + 0 + 1.5 + 0.05 + 1.67
8.04
This must be compared with the critical value from the χ
2
distribution with four
degrees of freedom. At the 5 signiﬁcance level this is 9.50 from Table A4.
Since 8.04 9.50 the test statistic is smaller than the critical value so the null
hypothesis cannot be rejected. The evidence is not strong enough to support an
association between social class and voting intention. We cannot reject the null
of the lack of any association with 95 conﬁdence. Note however that the test
statistic is fairly close to the critical value so there is some weak evidence of an
association but not enough to satisfy conventional statistical criteria.
Oops
A leading ﬁrm of chartered accountants produced a report for the UK government
on education funding. One question it asked of schools was: Is the school budget
sufﬁcient to provide help to pupils with special needs This produced the follow-
ing table:
Primary schools Secondary schools
Yes 34 45
No 63 50
No response 3 5
Totals 100 100
n 137 159
χ
2
3.50 n.s.
Their analysis produces the conclusion that there is no signiﬁcant difference
between primary and secondary schools. But the χ
2
statistic is based on the per-
centage ﬁgures Using frequencies which can be calculated from the sample size
ﬁgures gives a correct χ
2
ﬁgure of 5.05. Fortunately for the accountants this is
still not signiﬁcant.
10 − 15
2
15
20 − 21
2
21
30 − 24
2
24
25 − 25
2
25
35 − 35
2
35
40 − 40
2
40
15 − 10
2
10
15 − 14
2
14
10 − 16
2
16
STFE_C06.qxd 26/02/2009 09:11 Page 218

slide 236:

The χ
2
distribution
219
STATISTICS
IN
PR AC TIC E
··
Exercise 6.4
Cohabitation
J. Ermisch and M. Francesconi examined the rise in cohabitation in the UK and
asked whether it led on to marriage or not. One of their tables shows the relation
between employment status and the outcome of living together. Their results
including the calculation of the χ
2
statistic for association between the variables
are shown in the following ﬁgure.
There were 694 cohabiting women in the sample. Of the 531 who were
employed 105 of them went on to marry their partner 46 split up and 380 con-
tinued living together. Similar ﬁgures are shown for unemployed women and
for students. The expected values for the contingency table then appear based on
the null hypothesis of no association followed by the calculation of the χ
2
test
statistic. You can see the formula for one of the elements of the calculation in the
formula bar.
The test statistic is signiﬁcant at the 5 level critical value 9.49 for four
degrees of freedom so there is an association. The biggest contribution to the
test statistic comes from the bottom right-hand cell where the actual value is
much higher than the expected. It appears that unfortunately those student
romances often do not turn out to be permanent.
Source: J. Ermisch and M. Francesconi Cohabitation: not for long but here to stay J. Royal Statistical
Society Series A 2000 163 2 153–171.
Suppose that the data on educational achievement and employment status in
Chapter 1 were obtained from a sample of 1002 people as follows:
STFE_C06.qxd 26/02/2009 09:12 Page 219

slide 237:

Chapter 6 • The χ
2
and F distributions
220
Figure 6.3
The F distribution for
different ν
1
ν
2
25
Higher education A-levels Other qualiﬁcation No qualiﬁcation Total
In work 222 153 302 70 747
Unemployed 6 6 19 8 39
Inactive 26 37 84 69 216
Total 254 196 405 147 1002
Test whether there is an association between education and employment status
using the 5 signiﬁcance level for the test.
The F distribution
The second distribution we encounter in this chapter is the F distribution. It has
a variety of uses in statistics in this section we look at two of these: testing for
the equality of two variances and conducting an analysis of variance ANOVA
test. Both of these are variants on the hypothesis test procedures which should
by now be familiar. The F distribution will also be encountered in later chapters
on regression analysis.
The F family of distributions resembles the χ
2
distribution in shape: it is
always non-negative and is skewed to the right. It has two sets of degrees of
freedom these are its parameters labelled ν
1
and ν
2
and these determine its
precise shape. Typical F distributions are shown in Figure 6.3. As usual for a
hypothesis test we deﬁne an area in one or both tails of the distribution to be
the rejection region. If a test statistic falls into the rejection region then the
null hypothesis upon which the test statistic was based is rejected. Once again
examples will clarify the principles.
Testing the equality of two variances
Just as one can conduct a hypothesis test on a mean so it is possible to test the
variance. It is unusual to want to conduct a test of a speciﬁc value of a variance
since we usually have little intuitive idea what the variance should be in
most circumstances. A more likely circumstance is a test of the equality of two
variances across two samples. In Chapter 5 two car factories were tested for the
equality of average daily output levels. One can also test whether the variance of
STFE_C06.qxd 26/02/2009 09:12 Page 220

slide 238:

The F distribution
221
output differs or not. A more consistent output lower variance from a factory
might be beneﬁcial to the ﬁrm e.g. dealers can be reassured that they are more
likely to be able to obtain models when they require them. In the example
in Chapter 5 one factory had a standard deviation of daily output of 25 the
second of 20 both from samples of size 30 i.e. 30 days’ output was sampled at
each factory. We can now test whether the difference between these ﬁgures is
signiﬁcant or not.
Such a test is set up as follows. It is known as a variance ratio test for reasons
which will become apparent.
The null and alternative hypotheses are
H
0
: σ
2
1
σ
2
2
H
1
: σ
2
1
≠ σ
2
2
or equivalently
H
0
: σ
2
1
/σ
2
2
1
H
1
: σ
2
1
/σ
2
2
≠ 1 6.6
It is appropriate to write the hypotheses in the form shown in equation 6.6
since the random variable and test statistic we shall use is in the form of the
ratio of sample variances s
2
1
/s
2
2
. This is a random variable which follows an F dis-
tribution with ν
1
n
1
− 1 ν
2
n
2
− 1 degrees of freedom. We require the assump-
tion that the two samples are independent for the variance ratio to follow an F
distribution. Thus we write
F
n 1 −1n 2 −1
6.7
The F distribution thus has two parameters the two sets of degrees of freedom
one ν
1
associated with the numerator the other ν
2
associated with the
denominator of the formula. In each case the degrees of freedom are given by
the relevant sample size minus one.
Note that s
2
2
/s
1
2
is also an F distribution i.e. it doesn’t matter which variance
goes into the numerator but with the degrees of freedom reversed ν
1
n
2
− 1
ν
2
n
1
− 1.
The sample data are
s
1
25 s
2
20
n
1
30 n
2
30
The test statistic is simply the ratio of sample variances. In testing it is less
confusing if the larger of the two variances is made the numerator of the test
statistic you will see why soon. Therefore we have the following test statistic
F 1.5625 6.8
This must be compared to the critical value of the F distribution with ν
1
29
ν
2
29 degrees of freedom.
The rejection regions for the test are the two tails of the distribution cutting
off 2.5 in each tail. Since we have placed the larger variance in the denom-
inator only large values of F reject the null hypothesis so we need only consult
the upper critical value of the F distribution i.e. that value which cuts off the
25
2
20
2
s
2
1
s
2
2
STFE_C06.qxd 26/02/2009 09:12 Page 221

slide 239:

Chapter 6 • The χ
2
and F distributions
222
Exercise 6.5
top 2.5 of the distribution. This is the advantage of putting the larger variance
in the numerator of the test statistic.
Table 6.10 shows an excerpt from the F distribution. The degrees of freedom
for the test are given along the top row ν
1
and down the ﬁrst column ν
2
. The
numbers in the table give the critical values cutting off the top 2.5 of the
distribution. The critical value in this case is 2.09 at the intersection of the row
corresponding to ν
2
29 and the column corresponding to ν
1
30 ν
1
29 is not
given so 30 is used instead this gives a very close approximation to the correct
critical value. Since the test statistic does not exceed the critical value the null
hypothesis of equal variances cannot be rejected with 95 conﬁdence.
Samples of 3-volt batteries from two manufacturers yielded the following outputs
measured in volts:
Brand A 3.1 3.2 2.9 3.3 2.8 3.1 3.2
Brand B 3.0 3.0 3.2 3.4 2.7 2.8
Test whether there is any difference in the variance of output voltage of batteries
from the two companies. Why might the variance be an important consideration for
the manufacturer or for customers
Analysis of variance
In Chapter 5 we learned how to test the hypothesis that the means of two
samples are the same using a z or t test depending upon the sample size. This
type of hypothesis test can be generalised to more than two samples using
a technique called analysis of variance ANOVA based on the F distribution.
Although it is called analysis of variance it actually tests differences in means.
The reason for this will be explained below. Using this technique we can test the
Table 6.10 Excerpt from the F distribution: upper 2.5 points
ν
1
1 2 3 . . . 20 24 30 40
ν
2
1 647.7931 799.4822 864.1509 . . . 993.0809 997.2719 1001.4046 1005.5955
2 38.5062 39.0000 39.1656 . . . 39.4475 39.4566 39.4648 39.4730
3 17.4434 16.0442 15.4391 . . . 14.1674 14.1242 14.0806 14.0365
33 3 3 ... 33 3 3
28 5.6096 4.2205 3.6264 . . . 2.2324 2.1735 2.1121 2.0477
29 5.5878 4.2006 3.6072 . . . 2.2131 2.1540 2.0923 2.0276
30 5.5675 4.1821 3.5893 . . . 2.1952 2.1359 2.0739 2.0089
40 5.4239 4.0510 3.4633 . . . 2.0677 2.0069 1.9429 1.8752
Note: The critical value lies at the intersection of the shaded row and column. Alternatively
use Excel or another computer package to give the answer. In Excel the formula FINV0.025
29 29 will give the answer 2.09 the upper 2.5 critical value of the F distribution with
ν
1
29 ν
2
29 degrees of freedom.
STFE_C06.qxd 26/02/2009 09:12 Page 222

slide 240:

Analysis of variance
223
hypothesis that the means of all the samples are equal versus the alternative
hypothesis that at least one of them is different from the others. To illustrate
the technique we shall extend the example in Chapter 5 where two different car
factories’ outputs were compared.
The assumptions underlying the analysis of variance technique are essentially
the same as those used in the t test when comparing two different means. We
assume that the samples are randomly and independently drawn from Normally
distributed populations which have equal variances.
Suppose there are three factories whose outputs have been sampled with
the results shown in Table 6.11. We wish to answer the question whether this
is evidence of different outputs from the three factories or simply random
variations around a common average output level. The null and alternative
hypotheses are therefore
H
0
: μ
1
μ
2
μ
3
H
1
: at least one mean is different from the others
This is the simplest type of ANOVA known as one-way analysis of variance. In
this case there is only one factor which affects output – the factory. The factor
which may affect output is also known as the independent variable. In more
complex designs there can be two or more factors which inﬂuence output. The
output from the factories is the dependent or response variable in this case.
Figure 6.4 presents a chart of the output from the three factories which
shows the greatest apparent difference between factories 2 and 3. Their ranges
scarcely overlap which does suggest some genuine difference between them but
Table 6.11 Samples of output from three factories
Observation Factory 1 Factory 2 Factory 3
1 415 385 408
2 430 410 415
3 395 409 418
4 399 403 440
5 408 405 425
6 418 400
7 399
Figure 6.4
Chart of factory output
on sample days
STFE_C06.qxd 26/02/2009 09:12 Page 223

slide 241:

Chapter 6 • The χ
2
and F distributions
224
as yet we cannot be sure that this is not just due to sampling variation. Factory
1 appears to be mid-way between the other two and this must also be included
in the analysis.
To decide whether or not to reject H
0
we compare the variance of output
within factories to the variance of output between the means of the factories.
Both methods provide estimates of the overall true variance of output and
under the null hypothesis that factories make no difference should provide
similar estimates. The ratio of the variances should be approximately unity. If
the null is false however the between-samples estimate will tend to be larger
than the within-samples estimate and their ratio will exceed unity. This ratio
has an F distribution and so if it is sufﬁciently large that it falls into the upper
tail of the distribution then H
0
is rejected.
To summarise: if there appears little variation between different days’ out-
puts but they differed substantially between factories then we would reject H
0
at the other extreme if each factory’s output varied substantially from one day
to another but the average levels of output were similar it would be clear that
there is no difference between them. In this case we would conclude that the
variations were due to other random factors not the factories. Figure 6.5 pro-
vides an illustration. Because the calculations are quite complex and invariably
done by computer nowadays it is worth keeping in mind this illustration of the
principle of ANOVA.
To test the hypothesis formally we break down the total variance of all the
observations into:
1 the variance due to differences between factories
2 the variance due to differences within factories also known as the error
variance.
Initially we work with sums of squares rather than variances. Recall from
Chapter 1 that the sample variance is given by
s
2
6.9
∑x − X
2
n − 1
Figure 6.5
Illustration of when to reject H
0
STFE_C06.qxd 26/02/2009 09:12 Page 224

slide 242:

Analysis of variance
225
The numerator of the right-hand side of this expression ∑x − X
2
gives the sum
of squares i.e. the sum of squared deviations from the mean.
Accordingly we have to work with three sums of squares:
● The total sum of squares measures squared deviations from the overall or
grand average using all the 18 observations. It ignores the existence of the
different factors.
● The between sum of squares measures how the three individual factor means
vary around the grand average.
● The within sum of squares is based on squared deviations of observations
from their own factor mean.
It can be shown that there is a relationship between these sums of squares i.e.
Total sum
Between sum
+
Within sum
of squares of squares of squares 6.10
which is often helpful for calculation. The larger is the between sum of squares
relative to the within sum of squares the more likely it is that the null is false.
Because we have to sum over factors and over observations within those
factors the formulae look somewhat complicated involving double summation
signs. It is therefore important to follow the example showing how the calcula-
tions are actually done.
The total sum of squares is given by the formula
Total sum of squares x
ij
− X
2
6.11
where x
ij
is the output from factory i on day j and X is the grand average. The
index i runs from 1 to 3 in this case there are three classes or groups for this
factor and the index j indexing the observations goes from 1 to 6 7 or 5 for
factories 1 2 and 3 respectively.
Although this looks complex it simply means that you calculate the sum of
squared deviations from the overall mean. The overall mean of the 18 values is
410.11 and the total sum of squares may be calculated as
Total sum of squares 415 − 410.11
2
+ 430 − 410.11
2
+ ... + 440 − 410.11
2
+ 425 − 410.11
2
2977.778
An alternative formula for the total sum of squares is
Total sum of squares x
2
ij
− nX
2
6.12
where n is the total number of observations. The sum of the squares of all the
observations ∑x
2
is 415
2
+ 430
2
+ ... + 425
2
3 030 418 and the total sum of
squares is then given by
x
2
ij
− nX
2
3 030 418 – 18 × 410.11
2
2977.778 6.13
as before.
The between sum of squares is calculated using the formula
Between sum of squares X
i
− X
2
6.14
∑
i
∑
j
k
∑
i1
n
i
∑
j1
k
∑
i1
n
i
∑
j1
k
∑
i1
n
i
∑
j1
STFE_C06.qxd 26/02/2009 09:12 Page 225

slide 243:

Chapter 6 • The χ
2
and F distributions
226
where X
i
denotes the mean output of factor i. This part of the calculation
effectively ignores the differences that exist within factors and compares the
differences between them. It does this by replacing the observations within each
factor by the mean for that factor. Hence all the factor 1 observations are
replaced by 410.83 for factor 2 they are replaced by the mean 401.57 and for
factor 3 by 421.2. We then calculate the sum of squared deviations of these
values from the grand mean. Hence we obtain
Between sum of squares 6 × 410.83 − 410.11
2
+ 7 × 401.57 − 410.11
2
+ 5 × 421.2 − 410.11
2
1128.43
Note that we take account of the number of observations within each factor in
this calculation.
Once again there is an alternative formula which may be simpler for calcula-
tion purposes
Between sum of squares n
i
X
i
2
− nX
2
6.15
Evaluating this results in the same answer as above
n
i
X
i
2
− nX
2
6 × 410.83
2
+ 7 × 401.57
2
+ 5 × 421.2
2
– 18 × 410.10
2
1128.43 6.16
We have arrived at the result that 37 1128.43/2977.78 of the total varia-
tion sum of squared deviations is due to differences between factories and the
remaining 63 is therefore due to variation day to day within factories. We
can therefore immediately calculate the within sum of squares as
Within sum of squares 2977.778 − 1128.430 1849.348
For completeness the formula for the within sum of squares is
Within sum of squares x
ij
− X
i
2
6.17
The term x
ij
− X
i
measures the deviations of the observations from the factor
mean and so the within sum of squares gives a measure of dispersion within the
classes. Hence it can be calculated as:
Within sum of squares 415 − 410.83
2
+ ... + 418 − 410.83
2
+ 385 − 401.57
2
+ ... + 399 − 401.57
2
+ 408 − 421.2
2
+ ... + 425 − 421.2
2
1849.348
The result of the hypothesis test
The F statistic is based upon comparison between and within sums of squares
BSS and WSS but we must also take account of the degrees of freedom for the
test. The degrees of freedom adjust for the number of observations and for the
number of factors. Formally the test statistic is
F
BSS/k − 1
WSS/n − k
∑
i
∑
j
∑
i
∑
i
STFE_C06.qxd 26/02/2009 09:12 Page 226

slide 244:

Analysis of variance
227
which has k − 1 and n − k degrees of freedom. k is the number of factors 3 in
this case and n the overall number of observations 18. We thus have
F 4.576
The critical value of F for 2 and 15 degrees of freedom at the 5 signiﬁcance
level is 3.682. As the test statistic exceeds the critical value we reject the null
hypothesis of no difference between factories.
The analysis of variance table
ANOVA calculations are conventionally summarised in an analysis of variance
table. Figure 6.6 shows such a table as produced by Excel. Excel can produce the
table automatically from data presented in the form shown in Table 6.11 and
there is no need to do any of the calculations by hand. In Excel you need to
install the Analysis ToolPak in order to perform ANOVA. Other software pack-
ages such as SPSS also have routines to perform ANOVA.
The ﬁrst part of the table summarises the information for each factory in the
form of means and variances. Note that the means were used in the calculation
of the between sum of squares. The ANOVA section of the output then follows
giving sums of squares and other information.
The column of the ANOVA table headed ‘SS’ gives the sums of squares which
we calculated above. It can be seen that the between-group sum of squares
makes up about 37 of the total suggesting that the differences between
factories referred to as ‘groups’ by Excel do make a substantial contribution to
the total variation in output.
The ‘df’ column gives the degrees of freedom associated with each sum of
squares. These degrees of freedom are given by
1128.43/3 − 1
1849.348/18 − 3
Figure 6.6
One-way analysis of
variance: Excel output
Note: Excel like many other statistical packages performs all the ANOVA calculations
automatically based on the data in the spreadsheet. There is no need to evaluate any
formulae so you can concentrate on the interpretation of the results.
STFE_C06.qxd 26/02/2009 09:12 Page 227

slide 245:

Chapter 6 • The χ
2
and F distributions
228
Between sum of squares k − 1
Within sum of squares n − k
Total sum of squares n − 1
The ‘MS’ ‘mean square’ column divides the sums of squares by their degrees
of freedom and the F column gives the F statistic which is the ratio of the two
values in the MS column i.e. 4.576 564.215/123.290. This is the test statistic
for the hypothesis test which we calculated above. Excel helpfully gives the
critical value of the test at the 5 signiﬁcance level in the ﬁnal column 3.682.
The Prob-value labelled ‘P value’ is given in the penultimate column and
reveals that only 2.8 of the F distribution lies beyond the test statistic value
of 4.576.
The test has found that the between sum of squares is ‘large’ relative to the
within sum of squares too large to be due simply to random variation and this
is why the null hypothesis of equal outputs is rejected. The rejection region for
the test consists of the upper tail only of the F distribution small values of
the test statistic would indicate small differences between factories and hence
non-rejection of H
0
.
This simple example involves only three groups but the extension to four
or more follows the same principles with different values of k in the formulae
and is fairly straightforward. Also we have covered only the simplest type
of ANOVA with a one-way classiﬁcation. More complex experimental designs
are possible with a two-way classiﬁcation for example where there are two
independent factors affecting the dependent variable. This is not covered in this
book although Chapter 8 on the subject of multiple regression does examine
a method of modelling situations where two or more explanatory variables
inﬂuence a dependent variable.
Worked example 6.3
ANOVA calculations are quite complex and are easiest handled by software
which calculates all the results directly from the initial data. However this
is a kind of ‘black box’ approach to learning so this example shows all the
calculations mechanically.
Suppose we have six observations on each of three factors as follows:
AB C
44 41 48
35 36 37
60 58 61
28 32 37
43 40 44
55 59 61
These might be for example scores of different groups of pupils in a test.
We wish to examine whether there is a signiﬁcant difference between the
different groups. We need to see how the differences between the groups
compare to those within groups.
STFE_C06.qxd 26/02/2009 09:12 Page 228

slide 246:

Summary
229
Exercise 6.6
First we calculate the total sum of squares by ignoring the groupings and
treating all 18 observations together. The overall mean is 45.5 so the squared
deviations are 44 − 45.5
2
41 − 45.5
2
etc. Summing these gives 2020.5 as
the TSS.
For the between sum of squares we ﬁrst calculate the means of each factor.
These are 44.17 44.33 and 48. We compare these to the grand average. The
squared deviations are therefore 44.17 − 45.5
2
44.33 − 45.5
2
and 48 − 45.5
2
.
Rather than sum these we must take account of the number of observations
in each group which in this case is 6. Hence we obtain
Between sum of squares 6 × 44.17 − 45.5
2
+ 6 × 44.33 − 45.5
2
+ 6 × 48 − 45.5
2
56.33
The within sum of squares can be explicitly calculated as follows. For group
A the squared deviations from the group mean are 44 − 44.17
2
35 − 44.17
2
etc. Summing these for group A gives 714.8. Similar calculations give 653.3
and 596 for groups B and C. These sum to 1964.2 which is the within sum
of squares. As a check we note
2020.5 56.3 + 1964.2
The degrees of freedom are k − 1 3 − 1 2 for the between sum of squares
n − k 18 − 3 15 for the within sum of squares and n − 1 18 − 1 17. The
test statistic is therefore
F 0.22
The critical value at the 5 signiﬁcance level is 3.68 so we cannot reject the
null of no difference between the factors.
The reaction times of three groups of sportsmen were measured on a particular
task with the following results time in milliseconds:
Racing drivers 31 28 39 42 36 30
Tennis players 41 35 41 48 44 39 38
Boxers 44 47 35 38 51
Test whether there is a difference in reaction times between the three groups.
Summary
● The χ
2
and F distributions play important roles in statistics particularly in
problems relating to the goodness of ﬁt of the data to that predicted by a null
hypothesis.
● A random variable based on the sample variance n − 1s
2
/σ
2
has a χ
2
distribu-
tion with n − 1 degrees of freedom. Based on this fact the χ
2
distribution may
be used to construct conﬁdence interval estimates for the variance σ
2
. Since
56.33/2
1964.2/15
STFE_C06.qxd 26/02/2009 09:12 Page 229

slide 247:

Chapter 6 • The χ
2
and F distributions
230
the χ
2
is not a symmetric distribution the conﬁdence interval is not sym-
metric around the unbiased point estimate s
2
.
● The χ
2
distribution may also be used to compare actual and expected values
of a variable and hence to test the hypothesis upon which the expected
values were constructed.
● A two-way classiﬁcation of observations is known as a contingency table. The
independence or otherwise of the two variables may be tested using the χ
2
distribution by comparing observed values with those expected under the
null hypothesis of independence.
● The F distribution is used to test a hypothesis of the equality of two variances.
The test statistic is the ratio of two sample variances which under the null
hypothesis has an F distribution with n
1
− 1 n
2
− 1 degrees of freedom.
● The F distribution may also be used in an analysis of variance which tests for
the equality of means across several samples. The results are set out in an
analysis of variance table which compares the variation of the observations
within each sample to the variation between samples.
actual and expected values
analysis of variance
ANOVA table
between sum of squares
classes or groups
contingency table
dependent or response variable
grand average
independent variable
total sum of squares
variance ratio test
within sum of squares
Key terms and concepts
STFE_C06.qxd 26/02/2009 09:12 Page 230

slide 248:

231
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
6.1 A sample of 40 observations has a standard deviation of 20. Estimate the 95 conﬁdence
interval for the standard deviation of the population.
6.2 Using the data n 70 s 15 construct a 99 conﬁdence interval for the true standard
deviation.
6.3 Use the data in Table 6.3 to see if there is a signiﬁcant difference between road casualties
in quarters I and III on the one hand and quarters II and IV on the other.
6.4 A survey of 64 families with ﬁve children found the following gender distribution:
Number of boys 0 1 2 3 4 5
Number of families 1 8 28 19 4 4
Test whether the distribution can be adequately modelled by the Binomial distribution.
6.5 Four different holiday ﬁrms which all carried equal numbers of holidaymakers reported
the following numbers who expressed satisfaction with their holiday:
Firm A B C D
Number satisﬁed 576 558 580 546
Is there any signiﬁcant difference between the ﬁrms If told that the four ﬁrms carried
600 holidaymakers each would you modify your conclusion What do you conclude about
your ﬁrst answer
6.6 A company wishes to see whether there are any differences between its departments in
staff turnover. Looking at their records for the past year the company ﬁnds the following
data:
Department Personnel Marketing Admin. Accounts
Number in post at start of year 23 16 108 57
Number leaving 3 4 20 13
Do the data provide evidence of a difference in staff turnover between the various
departments
6.7 A survey of 100 ﬁrms found the following evidence regarding proﬁtability and market
share:
Problems
Problems
STFE_C06.qxd 26/02/2009 09:12 Page 231

slide 249:

Chapter 6 • The χ
2
and F distributions
232
Proﬁtability Market share
15 15–30 30
Low 18 7 8
Medium 13 11 8
High 8 12 15
Is there evidence that market share and proﬁtability are associated
6.8 The following data show the percentages of ﬁrms using computers in different aspects of
their business:
Firm size Computers used in Total numbers of ﬁrms
Admin. Design Manufacture
Small 60 24 20 450
Medium 65 30 28 140
Large 90 44 50 45
Is there an association between the size of ﬁrm and its use of computers
6.9 a Do the accountants’ job properly for them see the Oops box in the text page 218.
b It might be justiﬁable to omit the ‘no responses’ entirely from the calculation. What
happens if you do this
6.10 A roadside survey of the roadworthiness of vehicles obtained the following results:
Roadworthy Not roadworthy
Private cars 114 30
Company cars 84 24
Vans 36 12
Lorries 44 20
Buses 36 12
Is there any association between the type of vehicle and the likelihood of it being unﬁt for
the road
6.11 Given the following data on two sample variances test whether there is any signiﬁcant
difference. Use the 1 signiﬁcance level.
s
2
1
55 s
2
2
48
n
1
25 n
2
30
6.12 An example in Chapter 4 compared RD expenditure in Britain and Germany. The sample
data were
e
1
3.7 e
2
4.2
s
1
0.6 s
2
0.9
n
1
20 n
2
15
Is there evidence at the 5 signiﬁcance level of difference in the variances of RD
expenditure between the two countries What are the implications if any for the test
carried out on the difference of the two means in Chapter 4
STFE_C06.qxd 26/02/2009 09:12 Page 232

slide 250:

233
6.13 Groups of children from four different classes in a school were randomly selected and sat
a test with the following test scores:
Class Pupil
12 34 567
A 42 63 73 55 66 48 59
B 39 47 47 61 44 50 52
C 71 65 33 49 61
D 49 51 62 48 63 54
a Test whether there is any difference between the classes using the 95 conﬁdence
level for the test.
b How would you interpret a ‘signiﬁcant’ result from such a test
6.14 Lottery tickets are sold in different outlets: supermarkets smaller shops and outdoor
kiosks. Sales were sampled from several of each of these with the following results:
Supermarkets 355 251 408 302
Small shops 288 257 225 299
Kiosks 155 352 240
Does the evidence indicate a signiﬁcant difference in sales Use the 5 signiﬁcance level.
6.15 Project Conduct a survey among fellow students to examine whether there is any
association between
a gender and political preference
b subject studied and political preference
c star sign and personality introvert/extrovert – self-assessed: I am told that
Aries Cancer Capricorn Gemini Leo and Scorpio are associated with an extrovert
personality or
d any other two categories of interest.
6.16 Computer project Use your spreadsheet or other computer program to generate
100 random integers in the range 0 to 9. Draw up a frequency table and use a χ
2
test to
examine whether there is any bias towards any particular integer. Compare your results
with those of others in your class.
Problems
STFE_C06.qxd 26/02/2009 09:12 Page 233

slide 251:

Chapter 6 • The χ
2
and F distributions
234
Answers to exercises
Exercise 6.1
σ
2
yields the interval a 41.2 117.4 b 6.4 10.8 c 36.0 143.7.
Exercise 6.2
The calculation of the test statistic is
Outcome Observed Expected O − E O − E
2
A 40 35 5 25 0.714
B 60 55 5 25 0.455
C75 75 0 0 0
D 90 100 −10 100 1
Total 2.169
This is smaller than the critical value of 7.81 so the null is not rejected.
Exercise 6.3
The test statistics are for 16 55 29 χ
2
4.38 Prob-value 0.112 and z 1.84
Prob-value 0.066 and for 14 55 31 χ
2
6.78 Prob-value 0.038 and z 2.40
Prob-value 0.016. The two methods agree on the results although the Prob-
values are quite different.
Exercise 6.4
The expected values are:
Higher education A levels Other qualiﬁcations No qualiﬁcations Total
In work 189 146 302 110 747
Unemployed 10 8 16 6 39
Inactive 55 42 87 32 216
Totals 254 196 405 147 1002
These are calculated by multiplying row and column totals and dividing by the
grand total e.g. 189 747 × 254/1002.
The test statistic is
5.6 + 0.3 + 0.0 + 14.3 + 1.5 + 0.3 + 0.7 + 0.9 + 15.1 + 0.7 + 0.1 + 43.9 83.5
This should be compared to a critical value of 12.59 ν 3 − 1 × 4 − 1 6 so the
null is rejected.
Exercise 6.5
The two variances are s
2
A
0.031 and s
2
B
0.066. We therefore form the ratio
F 0.066/0.031 2.09 which has an F distribution with 6 and 7 degrees of freedom.
The 5 critical value is therefore 3.87 and the null is not rejected. There appears to
O − E
2
E
J
L
30 − 1 × 65
16.05
30 − 1 × 65
45.72
G
I
STFE_C06.qxd 26/02/2009 09:12 Page 234

slide 252:

Answers to exercises
235
be no differences between manufacturers. The variance is important because con-
sumers want a reliable product – they would not be happy if their MP3 player worked
with one battery but not another.
Exercise 6.6
The answer is summarised in this Excel table:
SUMMARY
Groups Count Sum Average Variance
Racing drivers 6 206 34.333 30.667
Tennis players 7 286 40.857 17.810
Boxers 5 215 43 42.5
ANOVA
Source of variation SS df MS F P-value F crit
Between groups 233.421 2 116.710 4.069 0.039 3.682
Within groups 430.190 15 28.679
Totals 663.611 17
The result shows that there is a difference between the three groups with an F statistic
of 4.069 P value 3.9. The difference appears to be largely between racing drivers
and the other two types.
STFE_C06.qxd 26/02/2009 09:12 Page 235

slide 253:

Chapter 6 • The χ
2
and F distributions
236
Appendix Use of χ
2
and F distribution tables
Tables of the χ
2
distribution
Table A4 see page 416 presents critical values of the χ
2
distribution for a selec-
tion of signiﬁcance levels and for different degrees of freedom. As an example
to ﬁnd the critical value of the χ
2
distribution at the 5 signiﬁcance level for
ν 20 degrees of freedom the cell entry in the column labelled ‘0.05’ and the
row labelled ‘20’ are consulted. The critical value is 31.4. A test statistic greater
than this value implies rejection of the null hypothesis at the 5 signiﬁcance
level.
Tables of the F distribution
Table A5 see page 418 presents critical values of the F distribution. Since there
are two sets of degrees of freedom to be taken into account a separate table
is required for each signiﬁcance level. Four sets of tables are provided giving
critical values cutting off the top 5 2.5 1 and 0.5 of the distribution
Tables A5a A5b A5c and A5d respectively. These allow both one- and
two-tail tests at the 5 and 1 signiﬁcance levels to be conducted. Its use is
illustrated by example.
Two-tail test
To ﬁnd the critical values of the F distribution at the 5 signiﬁcance level
for degrees of freedom ν
1
numerator 10 ν
2
20. The critical values in this
case cut off the extreme 2.5 of the distribution in each tail and are found in
Table A5b:
● Right-hand critical value: this is found from the cell of the table correspond-
ing to the column ν
1
10 and row ν
2
20. Its value is 2.77.
● Left-hand critical value: this cannot be obtained directly from the tables
which only give right-hand values. However it is obtained indirectly as follows:
a Find the right-hand critical value for ν
1
20 ν
2
10 note reversal of
degrees of freedom. This gives 3.42.
b Take the reciprocal to obtain the desired left-hand critical value. This
gives 1/3.42 0.29.
The rejection region thus consists of values of the test statistic less than 0.29 and
greater than 2.77.
One-tail test
To ﬁnd the critical value at the 5 signiﬁcance level for ν
1
15 ν
2
25. As long
as the test statistic has been calculated with the larger variance in the numerator
the critical value is in the right-hand tail of the distribution and can be obtained
directly from Table A5a. For ν
1
15 ν
2
25 the value is 2.09. The null hypo-
thesis is rejected therefore if the test statistic is greater than 2.09.
STFE_C06.qxd 26/02/2009 09:12 Page 236

slide 254:

Correlation and regression
7
Contents
237
Learning outcomes 237
Introduction 238
What determines the birth rate in developing countries 238
Correlation 240
Correlation and causality 245
The coefﬁcient of rank correlation 246
A simpler formula 250
Regression analysis 251
Calculation of the regression line 252
Interpretation of the slope and intercept 254
Measuring the goodness of ﬁt of the regression line 255
Inference in the regression model 257
Analysis of the errors 258
Conﬁdence interval estimates of α and β 259
Testing hypotheses about the coefﬁcients 260
Testing the signiﬁcance of R
2
: the F test 261
Interpreting computer output 262
Prediction 264
Units of measurement 267
How to avoid measurement problems: calculating the elasticity 268
Non-linear transformations 268
Summary 271
Key terms and concepts 272
References 272
Problems 273
Answers to exercises 276
By the end of this chapter you should be able to:
● understand the principles underlying correlation and regression
● calculate and interpret a correlation coefﬁcient and relate it to an XY graph of
the two variables
● calculate the line of best ﬁt regression line and interpret the result
● recognise the statistical signiﬁcance of the results using conﬁdence intervals
and hypothesis tests
● recognise the importance of the units in which the variables are measured and
of transformations to the data
● use computer software Excel to derive the regression line and interpret the
computer output.
Learning
outcomes
Complete your diagnostic test for Chapter 7 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C07.qxd 26/02/2009 10:52 Page 237

slide 255:

238
Chapter 7 • Correlation and regression
Introduction
Correlation and regression are techniques for investigating the statistical
relationship between two or more variables. In Chapter 1 we examined the
relationship between investment and gross domestic product GDP using
graphical methods the XY chart. Although visually helpful this did not pro-
vide any precise measurement of the strength of the relationship. In Chapter 6
the χ
2
test did provide a test of the signiﬁcance of the association between two
category-based variables but this test cannot be applied to variables measured
on a ratio scale. Correlation and regression ﬁll in these gaps: the strength of the
relationship between two or more ratio scale variables can be measured and
the signiﬁcance tested.
Correlation and regression are the techniques most often used by economists
and forecasters. They can be used to answer such questions as
● Is there a link between the money supply and the price level
● Do bigger ﬁrms produce at lower cost than smaller ﬁrms
● Does instability in a country’s export performance hinder its growth
Each of these questions is about economics or business as much as about
statistics. The statistical analysis is part of a wider investigation into the problem
it cannot provide a complete answer to the problem but used sensibly is a vital
input. Correlation and regression techniques may be applied to time-series or
cross-section data. The methods of analysis are similar in each case although
there are differences of approach and interpretation which are highlighted in
this chapter and the next.
This chapter begins with the topic of correlation and simple i.e. two variable
regression using as an example the determinants of the birth rate in developing
countries. In Chapter 8 multiple regression is examined where a single depend-
ent variable is explained by more than one explanatory variable. This is illustrated
using time-series data pertaining to imports into the UK. This shows how a small
research project can be undertaken avoiding the many possible pitfalls along
the way. Finally a variety of useful additional techniques tips and traps is set
out to help you understand and overcome a number of problems that can arise
in regression analysis.
What determines the birth rate in developing countries
This example follows the analysis in Michael Todaro’s book Economic Develop-
ment in the Third World 3rd edn pp. 197–200 where he tries to establish which
of three variables gross national product GNP per capita the growth rate per
capita or income inequality is most important in determining a country’s birth
rate. This analysis has been dropped from later editions of Todaro’s book. The
analysis is instructive as an example of correlation and regression techniques
in a number of ways. First the question is an important one it was discussed at
the UN International Conference on Population and Development in Cairo in
1995. It is felt by many that reducing the birth rate is a vital factor in economic
STFE_C07.qxd 26/02/2009 10:52 Page 238

slide 256:

What determines the birth rate in developing countries
239
development birth rates in developed countries average around 12 per 1000
population in developing countries around 30. Second Todaro uses the stat-
istical analysis to arrive at an unjustiﬁed conclusion it is always best to learn
from others’ mistakes.
The data used by Todaro are shown in Table 7.1 using a sample of 12 develop-
ing countries. Two points need to be made initially. First the sample only
includes developing countries so the results will not give an all-embracing
explanation of the birth rate. Different factors might be relevant to developed
countries for example. Second there is the important question of why these
particular countries were chosen as the sample and others ignored. The choice
of country was in fact limited by data availability and one should ask whether
countries with data available are likely to be representative of all countries. Data
were in fact available for more than 12 countries so Todaro was selective. You
are asked to explore the implications of this in some of the problems at the end
of the chapter.
The variables are deﬁned as follows:
Birth rate: the number of births per 1000 population in 1981.
GNP per capita: 1981 gross national product p.c. in US dollars.
Growth rate: the growth rate of GNP p.c. per annum 1961–1981.
Income ratio: the ratio of the income share of the richest 20 to that of the
poorest 40. A higher value of this ratio indicates greater
inequality.
We leave aside the concerns about the sample until later and concentrate
now on analysing the ﬁgures. The ﬁrst thing it is useful to do is to graph the
variables to see if anything useful is revealed. XY graphs are the most suitable in
this case and they are shown in Figure 7.1. From these we see a reasonably tidy
relationship between the birth rate and the growth rate with a negative slope
there is a looser relationship with the income ratio with a positive slope and
there is little discernible pattern apart from a ﬂat line in the graph of birth rate
against GNP. Todaro asserts that the best relationship is between the birth rate
and income inequality. He rejects the growth rate as an important determinant
of the birth rate because of the four countries at the top of the chart which have
Table 7.1 Todaro’s data on birth rate GNP growth and inequality
Country Birth rate 1981 GNP p.c. GNP growth Income ratio
Brazil 30 2200 5.1 9.5
Colombia 29 1380 3.2 6.8
Costa Rica 30 1430 3.0 4.6
India 35 260 1.4 3.1
Mexico 36 2250 3.8 5.0
Peru 36 1170 1.0 8.7
Philippines 34 790 2.8 3.8
Senegal 48 430 −0.3 6.4
South Korea 24 1700 6.9 2.7
Sri Lanka 27 300 2.5 2.3
Taiwan 21 1170 6.2 3.8
Thailand 30 770 4.6 3.3
Source: Adapted from Todaro M. 1992.
STFE_C07.qxd 26/02/2009 10:52 Page 239

slide 257:

Chapter 7 • Correlation and regression
240
Figure 7.1
Graphs of the birth
rate against a GNP
b growth and
c income ratio
very different growth rates yet similar birth rates. In the following sections we
shall see whether Todaro’s conclusions are justiﬁed.
Correlation
The relationships graphed in Figure 7.1 can ﬁrst be summarised numerically by
measuring the correlation coefﬁcient between any pair of variables. We illustrate
this by calculating the correlation coefﬁcient between the birth rate B and
STFE_C07.qxd 26/02/2009 10:52 Page 240

slide 258:

Correlation
241
growth G although we also present the results for the other cases. Just as the
mean is a number that summarises information about a single variable so the
correlation coefﬁcient is a number which summarises the relationship between
two variables.
The different types of possible relationship between any two variables X and
Y may be summarised as follows:
● High values of X tend to be associated with low values of Y and vice versa.
This is termed negative correlation and appears to be the case for B and G.
● High low values of X tend to be associated with high low values of Y. This
is positive correlation and reﬂects rather weakly the relationship between B
and the income ratio IR.
● No relationship between X and Y exists. High low values of X are associated
about equally with high and low values of Y. This is zero or the absence of
correlation. There appears to be little correlation between the birth rate and
per capita GNP.
It should be noted that positive correlation does not mean that high values
of X are always associated with high values of Y but usually they are. It is also
the case that correlation only represents a linear relationship between the two
variables. As a counter-example consider the backwards-bending labour supply
curve as suggested by economic theory higher wages initially encourage extra
work effort but above a certain point the beneﬁt of higher wage rates is taken
in the form of more leisure. The relationship is non-linear and the measured
degree of correlation between wages and hours of work is likely to be low even
though the former obviously inﬂuences the latter.
The sample correlation coefﬁcient r is a numerical statistic which distin-
guishes between the types of cases shown in Figure 7.1. It has the following
properties:
● It always lies between −1 and +1. This makes it relatively easy to judge the
strength of an association.
● A positive value of r indicates positive correlation a higher value indicating
a stronger correlation between X and Y i.e. the observations lie closer to a
straight line. r 1 indicates perfect positive correlation and means that all the
observations lie precisely on a straight line with positive slope as Figure 7.2
illustrates.
● A negative value of r indicates negative correlation. Similar to the above a
larger negative value indicates stronger negative correlation and r −1
signiﬁes perfect negative correlation.
Figure 7.2
Perfect positive
correlation
STFE_C07.qxd 26/02/2009 10:52 Page 241

slide 259:

Chapter 7 • Correlation and regression
242
● A value of r 0 or close to it indicates a lack of correlation between X and Y.
● The relationship is symmetric i.e. the correlation between X and Y is the
same as between Y and X. It does not matter which variable is labelled Y and
which is labelled X.
The formula
1
for calculating the correlation coefﬁcient is given in equation 7.1
7.1
The calculation of r for the relationship between birth rate Y and growth
X is shown in Table 7.2 and equation 7.2. From the totals in Table 7.2 we
calculate
7.2
This result indicates a fairly strong negative correlation between the birth rate
and growth. Countries which have higher economic growth rates also tend to
have lower birth rates. The result of calculating the correlation coefﬁcient for
the case of the birth rate and the income ratio is r 0.35 which is positive
as expected. Greater inequality higher IR is associated with a higher birth rate
though the degree of correlation is not particularly strong and less than the
correlation with the growth rate. Between the birth rate and GNP per capita the
value of r is only −0.26 indicating only a modest degree of correlation. All of this
begins to cast doubt upon Todaro’s interpretation of the data.
r
. .
. .
.
×− ×
×− × −
−
12 1139 7 40 2 380
12 184 04 40 2 12 12 564 380
0 824
22
r
nXY X Y
nX X n Y Y
∑−∑∑
∑−∑ ∑ −∑
22 2 2
Table 7.2 Calculation of the correlation coefﬁcient r
Country Birth rate GNP growth Y
2
X
2
XY
YX
Brazil 30 5.1 900 26.01 153.0
Colombia 29 3.2 841 10.24 92.8
Costa Rica 30 3.0 900 9.00 90.0
India 35 1.4 1225 1.96 49.0
Mexico 36 3.8 1296 14.44 136.8
Peru 36 1.0 1296 1.00 36.0
Philippines 34 2.8 1156 7.84 95.2
Senegal 48 −0.3 2304 0.09 −14.4
South Korea 24 6.9 576 47.61 165.6
Sri Lanka 27 2.5 729 6.25 67.5
Taiwan 21 6.2 441 38.44 130.2
Thailand 30 4.6 900 21.16 138.0
Totals 380 40.2 12 564 184.04 1139.7
Note: In addition to the X and Y variables in the ﬁrst two columns three other columns are
needed for X
2
Y
2
and XY values.
1
The formula for r can be written in a variety of different ways. The one given here is the
most convenient for calculation.
STFE_C07.qxd 26/02/2009 10:52 Page 242

slide 260:

Correlation
243
a Perform the required calculations to conﬁrm that the correlation between the
birth rate and the income ratio is 0.35.
b In Excel use the CORREL function to conﬁrm your calculations in the previous
two exercises. For example the function CORRELA1:A12B1:B12 would cal-
culate the correlation between a variable X in cells A1:A12 and Y in cells B1:B12.
c Calculate the correlation coefﬁcient between the birth rate and the growth rate
again but expressing the birth rate per 100 population and the growth rate as a
decimal. In other words divide Y by 10 and X by 100. Your calculation should
conﬁrm that changing the units of measurement leaves the correlation
coefﬁcient unchanged.
Are the results signiﬁcant
These results come from a small sample one of many that could have been
collected. Once again we can ask the question what can we infer about the
population of all developing countries from the sample Assuming the sample
was drawn at random which may not be justiﬁed we can use the principles of
hypothesis testing introduced in Chapter 5. As usual there are two possibilities.
1 The truth is that there is no correlation in the population and that our
sample exhibits such a large absolute value by chance.
2 There really is a correlation between the birth rate and the growth rate and
the sample correctly reﬂects this.
Denoting the true but unknown population correlation coefﬁcient by ρ the
Greek letter ‘rho’ the possibilities can be expressed in terms of a hypothesis test
H
0
: ρ 0
H
1
: ρ ≠ 0
The test statistic in this case is not r itself but a transformation of it
7.3
which has a t distribution with n − 2 degrees of freedom. The ﬁve steps of the
test procedure are therefore:
1 Write down the null and alternative hypotheses shown above.
2 Choose the signiﬁcance level of the test: 5 by convention.
3 Look up the critical value of the test for n − 2 10 degrees of freedom:
t
10
2.228 for a two-tail test.
4 Calculate the test statistic using equation 7.3
5 Compare the test statistic with the critical value. In this case t −t
10
so H
0
is
rejected. There is a less than 5 chance of the sample evidence occurring if
the null hypothesis were true so the latter is rejected. There does appear to
be a genuine association between the birth rate and the growth rate.
Performing similar calculations see Exercise 7.2 below for the income ratio
and for GNP reveals that in both cases the null hypothesis cannot be rejected at
t
.
.
.
−−
−−
−
0 824 12 2
1 0 824
459
2
t
rn
r
−
−
2
1
2
Exercise 7.1
STFE_C07.qxd 26/02/2009 10:52 Page 243

slide 261:

Chapter 7 • Correlation and regression
244
the 5 signiﬁcance level. These observed associations could well have arisen by
chance.
Are signiﬁcant results important
Following the discussion in Chapter 5 we might ask if a certain value of the
correlation coefﬁcient is economically important as well as being signiﬁcant. We
saw earlier that ‘signiﬁcant’ results need not be important. The difﬁculty in this
case is that we have little intuitive understanding of the correlation coefﬁcient.
Is ρ 0.5 important for example Would it make much difference if it were
only 0.4
Our understanding may be helped if we look at some graphs of variables with
different correlation coefﬁcients. Three are shown in Figure 7.3. Panel a of the
ﬁgure graphs two variables with a correlation coefﬁcient of 0.2. Visually there
seems little association between the variables yet the correlation coefﬁcient
is just signiﬁcant: t 2.06 n 100 and the Prob-value is 0.046. This is a
signiﬁcant result which does not impress much.
Figure 7.3
Variables with different
correlations
STFE_C07.qxd 26/02/2009 10:52 Page 244

slide 262:

Correlation
245
In panel b the correlation coefﬁcient is 0.5 and the association seems a little
stronger visually though there is still a substantial scatter of the observations
around a straight line. Yet the t statistic in this case is 5.72 highly signiﬁcant
Prob-value 0.000.
Finally panel c shows an example where n 1000. To the eye this looks
much like a random scatter with no discernable pattern. Yet the correlation
coefﬁcient is 0.1 and the t statistic is 3.18 again highly signiﬁcant Prob-value
0.002.
The lessons from this seem fairly clear. What looks like a random scatter on
a chart may in fact reveal a relationship between variables which is statistically
signiﬁcant especially if there are a large number of observations. On the other
hand a high t-statistic and correlation coefﬁcient can still mean there is a lot of
variation in the data revealed by the chart. Panel b suggests for example that
we are unlikely to get a very reliable prediction of the value of y even if we know
the value of x.
a Test the hypothesis that there is no association between the birth rate and the
income ratio.
b Look up the Prob-value associated with the test statistic and conﬁrm that it does
not reject the null hypothesis.
Correlation and causality
It is important to test the signiﬁcance of any result because almost every pair of
variables will have a non-zero correlation coefﬁcient even if they are totally
unconnected the chance of the sample correlation coefﬁcient being exactly zero
is very very small. Therefore it is important to distinguish between correlation
coefﬁcients which are signiﬁcant and those which are not using the t test just
outlined. But even when the result is signiﬁcant one should beware of the dan-
ger of ‘spurious’ correlation. Many variables which clearly cannot be related turn
out to be ‘signiﬁcantly’ correlated with each other. One now famous example is
Figure 7.3
cont’d
Exercise 7.2
STFE_C07.qxd 26/02/2009 10:52 Page 245

slide 263:

Chapter 7 • Correlation and regression
246
between the price level and cumulative rainfall. Because they both rise year
after year it is easy to see why they are correlated yet it is hard to think of a
plausible reason why they should be causally related to each other.
Apart from spurious correlation there are four possible reasons for a non-zero
value of r.
1 X inﬂuences Y.
2 Y inﬂuences X.
3 X and Y jointly inﬂuence each other.
4 Another variable Z inﬂuences both X and Y.
Correlation alone does not allow us to distinguish between these alternatives.
For example wages X and prices Y are highly correlated. Some people believe
this is due to cost–push inﬂation i.e. that wage rises lead to price rises. This is
case 1 above. Others believe that wages rise to keep up with the cost of living
i.e. rising prices which is 2. Perhaps a more convincing explanation is 3
a wage–price spiral where each feeds upon the other. Others would suggest that
it is the growth of the money supply Z which allows both wages and prices
to rise. To distinguish between these alternatives is important for the control of
inﬂation but correlation alone does not allow that distinction to be made.
Correlation is best used therefore as a suggestive and descriptive piece of
analysis rather than a technique which gives deﬁnitive answers. It is often a
preparatory piece of analysis which gives some clues to what the data might
yield to be followed by more sophisticated techniques such as regression.
The coefﬁcient of rank correlation
On occasion it is inappropriate or impossible to calculate the correlation coefﬁ-
cient as described above and an alternative approach is required. Sometimes the
original data are unavailable but the ranks are. For example schools may be
ranked in terms of their exam results but the actual pass rates are not available.
Similarly they may be ranked in terms of spending per pupil with actual spend-
ing levels unavailable. Although the original data are missing one can still
test for an association between spending and exam success by calculating the
correlation between the ranks. If extra spending improves exam performance
schools ranked higher on spending should also be ranked higher on exam
success leading to a positive correlation.
Second even if the raw data are available they may be highly skewed and
hence the correlation coefﬁcient may be inﬂuenced heavily by a few outliers. In
this case the hypothesis test for correlation may be misleading as it is based on
the assumption of underlying Normal distributions for the data. In this case we
could transform the values to ranks and calculate the correlation of the ranks.
In a similar manner to the median described in Chapter 1 this can effectively
deal with heavily skewed distributions.
In these cases it is Spearman’s coefﬁcient of rank correlation that is calculated.
The ‘standard’ correlation coefﬁcient described above is more fully known
as Pearson’s product-moment correlation coefﬁcient to distinguish it. The
formula to be applied is the same as before though there are a few tricks to be
learned about constructing the ranks and also the hypothesis test is conducted
in a different manner.
STFE_C07.qxd 26/02/2009 10:52 Page 246

slide 264:

Correlation
247
Using the ranks is generally less efﬁcient than using the original data because
one is effectively throwing away some of the information e.g. by how much do
countries’ growth rates differ. However there is a trade-off: the rank correlation
coefﬁcient is more robust i.e. it is less inﬂuenced by outliers or highly skewed
distributions. If one suspects this is a risk it may be better to use the ranks. This
is similar to the situation where the median can prove superior to the mean as
a measure of central tendency.
We will calculate the rank correlation coefﬁcient for the data on birth and
growth rates to provide a comparison with the ordinary correlation coefﬁcient
calculated earlier. It is unlikely that the distributions of birth or of growth rates
is particularly skewed and we have too few observations to reliably tell so the
Pearson measure might generally be preferred but we calculate the Spearman
coefﬁcient for comparison. Table 7.3 presents the data for birth and growth rates
in the form of ranks. Calculating the ranks is fairly straightforward though
there are a couple of points to note.
The country with the highest birth rate has the rank of 1 the next highest 2
and so on. Similarly the country with the highest growth rate ranks 1 etc.
One could reverse a ranking so the lowest birth rate ranks 1 for example the
direction of ranking can be somewhat arbitrary. This would leave the rank cor-
relation coefﬁcient unchanged in value but the sign would change e.g. 0.5 would
become −0.5. This could be confusing as we would now have a ‘negative’
correlation rather than a positive one though the birth rate variable would
now have to be redeﬁned. It is better to use the ‘natural’ order of ranking for
each variable.
Where two or more observations are the same as are the birth rates of Mexico
and Peru then they are given the same rank which is the average of the
relevant ranking values. For example both countries are given the rank of 2.5
Table 7.3 Calculation of Spearman’s rank correlation coefﬁcient
Country Birth rate Growth rate Rank Rank Y
2
X
2
XY
YX Y X
Brazil 30 5.1 7 3 49 9 21
Colombia 29 3.2 9 6 81 36 54
Costa Rica 30 3.0 7 7 49 49 49
India 35 1.4 4 10 16 100 40
Mexico 36 3.8 2.5 5 6.25 25 12.5
Peru 36 1.0 2.5 11 6.25 121 27.5
Philippines 34 2.8 5 8 25 64 40
Senegal 48 −0.3 1 12 1 144 12
South Korea 24 6.9 11 1 121 1 11
Sri Lanka 27 2.5 10 9 100 81 90
Taiwan 21 6.2 12 2 144 4 24
Thailand 30 4.6 7 4 49 16 28
Totals – – 78 78 647.5 650 409
Note: The country with the highest growth rate South Korea is ranked 1 for variable X
Taiwan the next fastest growth nation is ranked 2 etc. For the birth rate Senegal is
ranked 1 having the highest birth rate 48. Taiwan has the lowest birth rate and so is
ranked 12 for variable Y.
STFE_C07.qxd 26/02/2009 10:52 Page 247

slide 265:

Chapter 7 • Correlation and regression
248
STATISTICS
IN
PR AC TIC E
··
which is the average of 2 and 3. Similarly Brazil Costa Rica and Thailand are
all given the rank of 7 which is the average of 6 7 and 8. The next country
Colombia is then given the rank of 9.
Excel warning
Microsoft Excel has a rank function built in which takes a variable and calcu-
lates a new variable consisting of the ranks similar to the above table. However
note that it deals with tied values in a different way. In the example above Brazil
Costa Rica and Thailand would all be given a rank of 6 by Excel not 7. This then
gives a different correlation coefﬁcient to that calculated here. Excel’s method can
be shown to be problematic since if the rankings are reversed e.g. the
highest growth country is numbered 12 rather than 1 Excel gives a different
numerical result.
We now apply formula 7.1 to the ranked data giving
This indicates a negative rank correlation between the two variables as with the
standard correlation coefﬁcient r −0.824 but with a slightly smaller absolute
value.
To test the signiﬁcance of the result a hypothesis test can be performed on
the value of ρ
s
the corresponding population parameter
H
0
: ρ
s
0
H
1
: ρ
s
≠ 0
This time the t distribution cannot be used because we are no longer relying
on the parent distribution being Normal but prepared tables of the critical
values for ρ
s
itself may be consulted these are given in Table A6 see page 426
and an excerpt is given in Table 7.4.
The critical value at the 5 signiﬁcance level for n 12 is 0.591. Hence the
null hypothesis is rejected if the rank correlation coefﬁcient falls outside the
×− ×
×− × −
−
.
.
12 409 78 78
12 650 78 12 647 5 78
0 691
22
r
nXY X Y
nX X n Y Y
s
∑−∑∑
∑−∑ ∑ −∑
22 2 2
Table 7.4 Excerpt from Table A6: Critical values of the rank correlation coefﬁcient
n 10 5 2 1
5 0.900
6 0.829 0.886 0.943
3 333 3
11 0.523 0.623 0.763 0.794
12 0.497 0.591 0.703 0.780
13 0.475 0.566 0.673 0.746
Note: The critical value is given at the intersection of the shaded row and column.
STFE_C07.qxd 26/02/2009 10:52 Page 248

slide 266:

Correlation
249
range −0.591 0.591 which it does in this case. Thus the null can be rejected
with 95 conﬁdence the data do support the hypothesis of a relationship
between the birth rate and growth. This critical value shown in the table is for
a two-tail test. For a one-tail test the signiﬁcance level given in the top row of
the table should be halved.
a Rank the observations for the income ratio across countries highest 1 and
calculate the coefﬁcient of rank correlation with the birth rate.
b Test the hypothesis that ρ
s
0.
c Reverse the rankings for both variables and conﬁrm that this does not affect the
calculated test statistic.
Worked example 7.1
To illustrate all the calculations and bring them together without distracting
explanation we work through a simple example with the following data on
X and Y:
Y 17 18 19 20 27 18
X 347 685
An XY graph of the data reveals the following picture which suggests
positive correlation:
Exercise 7.3
Note that one point appears to be something of an outlier. All the cal-
culations for correlation may be based on the following table:
Rank Y Rank X
Obs YX Y
2
X
2
XY R
Y
R
X
R
Y
2
R
X
2
R
X
R
Y
1 173 289 951 6 6 36 36 36
2 18 4 324 16 72 4.5 5 20.25 25 22.5
3 19 7 361 49 133 3 2 9 4 6
4 20 6 400 36 120 2 3 4 9 6
5 27 8 729 64 216 1 1 1 1 1
6 18 5 324 25 90 4.5 4 20.25 16 18
Totals 119 33 2427 199 682 21 21 90.5 91 89.5
➔
STFE_C07.qxd 26/02/2009 10:52 Page 249

slide 267:

Chapter 7 • Correlation and regression
250
The Pearson correlation coefﬁcient r is therefore:
The hypothesis H
0
: ρ 0 versus H
1
: ρ ≠ 0 can be tested using the t test statistic:
which is compared to a critical value of 2.776 so the null hypothesis is
not rejected narrowly. This is largely attributable to the small number of
observations and anyway it may be unwise to use the t-distribution on such
a small sample. The rank correlation coefﬁcient is calculated as
The critical value at the 5 signiﬁcance level is 0.886 so the rank correla-
tion coefﬁcient is signiﬁcant in contrast to the previous result. Not too much
should be read into this however with few observations the ranking process
can easily alter the result substantially.
A simpler formula
When the ranks occur without any ties equation 7.1 simpliﬁes to the following
formula:
7.4
where d is the difference in the ranks. An example of the use of this formula is
given below using the following data for calculation
Rank Y RankXd d
2
15 −416
41 3 9
52 3 9
63 3 9
34 −11
26 −416
Total 60
The differences d and their squared values are shown in the ﬁnal columns of
the table and from these we obtain
r
s
1 −−0.714 7.5
6 × 60
6 × 6
2
− 1
r
d
nn
s
−
×∑
−
1
6
1
2
2
×− ×
×− × −
.
.
.
6 89 5 21 21
6 91 21 6 90 5 21
0 928
22
r
nXY X Y
nX X n Y Y
−
−−
∑∑∑
∑∑ ∑ ∑
22 2 2
t
rn
r
.
.
.
−
−
×−
−
2
1
0 804 6 2
1 0 804
27
22
×− ×
×− × −
.
6 682 33 119
6 199 33 6 2427 119
0804
22
r
nXY X Y
nX X n Y Y
−
−−
∑∑∑
∑∑ ∑ ∑
22 2 2
STFE_C07.qxd 26/02/2009 10:52 Page 250

slide 268:

Regression analysis
251
Figure 7.4
The line of best ﬁt
This is the same answer as would be obtained using the conventional formula
7.1. The veriﬁcation is left as an exercise. Remember this formula can only be
used if there are no ties in either variable.
Regression analysis
Regression analysis is a more sophisticated way of examining the relationship
between two or more variables than is correlation. The major differences
between correlation and regression are the following:
● Regression can investigate the relationships between two or more variables.
● A direction of causality is asserted from the explanatory variable or variables
to the dependent variable.
● The inﬂuence of each explanatory variable upon the dependent variable is
measured.
● The signiﬁcance of each explanatory variable can be ascertained.
Thus regression permits answers to such questions as:
● Does the growth rate inﬂuence a country’s birth rate
● If the growth rate increases by how much might a country’s birth rate be
expected to fall
● Are other variables important in determining the birth rate
In this example we assert that the direction of causality is from the growth
rate X to the birth rate Y and not vice versa. The growth rate is therefore the
explanatory variable also referred to as the independent or exogenous variable
and the birth rate is the dependent variable also called the explained or endo-
genous variable.
Regression analysis describes this causal relationship by ﬁtting a straight line
drawn through the data which best summarises them. It is sometimes called
‘the line of best ﬁt’ for this reason. This is illustrated in Figure 7.4 for the birth
rate and growth rate data. Note that by convention the explanatory variable
is placed on the horizontal axis the explained on the vertical. This regression
line is downward sloping its derivation will be explained shortly for the same
STFE_C07.qxd 26/02/2009 10:52 Page 251

slide 269:

Chapter 7 • Correlation and regression
252
reason that the correlation coefﬁcient is negative i.e. high values of Y are
generally associated with low values of X and vice versa.
Since the regression line summarises knowledge of the relationship between
X and Y it can be used to predict the value of Y given any particular value of X.
In Figure 7.4 the value of X 3 the observation for Costa Rica is related via the
regression line to a value of Y denoted by Z of 32.6. This predicted value is
close but not identical to the actual birth rate of 30. The difference reﬂects the
absence of perfect correlation between the two variables.
The difference between the actual value Y and the predicted value Z
is called the error or residual. It is labelled e in Figure 7.4. Note: The italic e
denoting the error term should not be confused with the roman letter e used
as the base for natural logarithms see Appendix 1C to Chapter 1 page 78. Why
should such errors occur The relationship is never going to be an exact one
for a variety of reasons. There are bound to be other factors besides growth
which affect the birth rate e.g. the education of women and these effects are
all subsumed into the error term. There might additionally be simple measure-
ment error of Y and of course people do act in a somewhat random fashion
rather than follow rigid rules of behaviour.
All of these factors fall into the error term and this means that the observations
lie around the regression line rather than on it. If there are many of these factors
none of which is predominant and they are independent of each other then these
errors may be assumed to be Normally distributed about the regression line.
Why not include these factors explicitly On the face of it this would seem
to be an improvement making the model more realistic. However the costs
of doing this are that the model becomes more complex calculation becomes
more difﬁcult not so important now with computers and it is generally more
difﬁcult for the reader or researcher to interpret what is going on. If the main
interest is the relationship between the birth rate and growth why complicate
the model unduly There is a virtue in simplicity as long as the simpliﬁed model
still gives an undistorted view of the relationship. In Chapter 10 on multiple
regression the trade-off between simplicity and realism will be further discussed
particularly with reference to the problems which can arise if relevant explanatory
variables are omitted from the analysis.
Calculation of the regression line
The equation of the sample regression line may be written
Z
i
a + bX
i
7.6
where
Z
i
is the predicted value of Y for observation country i
X
i
is the value of the explanatory variable for observation i and
a b are ﬁxed coefﬁcients to be estimated a measures the intercept of the regres-
sion line on the Y axis b measures its slope.
This is illustrated in Figure 7.5.
The ﬁrst task of regression analysis is to ﬁnd the values of a and b so that the
regression line may be drawn. To do this we proceed as follows. The difference
between the actual value Y
i
and its predicted value Z
i
is e
i
the error. Thus
STFE_C07.qxd 26/02/2009 10:52 Page 252

slide 270:

Regression analysis
253
Y
i
Z
i
+ e
i
7.7
Substituting equation 7.6 into equation 7.7 the regression equation can be
written
Y
i
a + bX
i
+ e
i
7.8
Equation 7.8 shows that observed birth rates are made up of two components:
1 that part explained by the growth rate a + bX
i
and
2 an error component e
i
.
In a good model part 1 should be large relative to part 2 and the regres-
sion line is based upon this principle. The line of best ﬁt is therefore found by
ﬁnding the values of a and b which minimise the sum of squared errors ∑e
2
i
from
the regression line. For this reason this method is known as ‘the method of least
squares’ or simply ‘ordinary least squares’ OLS. The use of this criterion will be
justiﬁed later on but it can be said in passing that the sum of the errors is not
minimised because that would not lead to a unique answer for the values a and
b. In fact there is an inﬁnite number of possible regression lines which all yield
a sum of errors equal to zero. Minimising the sum of squared errors does yield a
unique answer.
The task is therefore to
minimise ∑e
2
i
7.9
by choice of a and b.
Rearranging equation 7.8 the error is given by
e
i
Y
i
− a − bX
i
7.10
so equation 7.9 becomes
minimise ∑Y
i
− a − bX
i
2
7.11
by choice of a and b.
Finding the solution to equation 7.11 requires the use of differential calculus
and is not presented here. The resulting formulae for a and b are
b 7.12
n ∑XY −∑X ∑Y
n ∑X
2
− ∑X
2
Figure 7.5
Intercept and slope of
the regression line
STFE_C07.qxd 26/02/2009 10:52 Page 253

slide 271:

Chapter 7 • Correlation and regression
254
and
a T − bS 7.13
where S and T are the mean values of X and Y respectively. The values neces-
sary to evaluate equations 7.12 and 7.13 can be obtained from Table 7.2
which was used to calculate the correlation coefﬁcient. These values are repeated
for convenience
∑Y 380 ∑Y
2
12 564
∑X 40.2 ∑X
2
184.04
∑XY 1139.70 n 12
Using these values we obtain
b−2.700
and
a− −2.700 × 40.711
Thus the regression equation can be written to two decimal places for clar-
ity as
Y
i
40.71 − 2.70X
i
+ e
i
Interpretation of the slope and intercept
The most important part of the result is the slope coefﬁcient b −2.7 since it
measures the effect of X upon Y. This result implies that a unit increase in the
growth rate e.g. from 2 to 3 p.a. would lower the birth rate by 2.7 for example
from 30 births per 1000 population to 27.3. Given that the growth data refer to
a 20-year period 1961 to 1981 this increase in the growth rate would have to
be sustained over such a time not an easy task. How big is the effect upon the
birth rate The average birth rate in the sample is 31.67 so a reduction of 2.7 for
an average country would be a fall of 8.5 2.7/31.67 × 100. This is reasonably
substantial although not enough to bring the birth rate down to developed
country levels but would need a considerable sustained increase in the growth
rate to bring it about.
The value of a the intercept may be interpreted as the predicted birth rate of
a country with zero growth since Z
i
a at X 0. This value of 40.71 is fairly
close to that of Senegal which actually had negative growth over the period
and whose birth rate was 48 a little higher than the intercept value. Although
a has a sensible interpretation in this case this is not always so. For example in
a regression of the demand for a good on its price a would represent demand at
zero price which is unlikely ever to be observed.
a Calculate the regression line relating the birth rate to the income ratio.
b Interpret the coefﬁcients of this equation.
40.2
12
380
12
12 × 1139.70 − 40.2 × 380
12 × 184.04 − 40.2
2
Exercise 7.4
STFE_C07.qxd 26/02/2009 10:52 Page 254

slide 272:

Regression analysis
255
Measuring the goodness of ﬁt of the regression line
Having calculated the regression line we now ask whether it provides a good ﬁt
for the data i.e. do the observations tend to lie close to or far away from the
line If the ﬁt is poor perhaps the effect of X upon Y is not so strong after all.
Note that even if X has no effect upon Y we can still calculate a regression line
and its slope coefﬁcient b. Although b is likely to be small it is unlikely to be
exactly zero. Measuring the goodness of ﬁt of the data to the line helps us to
distinguish between good and bad regressions.
We proceed by comparing the three competing models explaining the birth
rate. Which of them ﬁts the data best Using the income ratio and the GNP
variable gives the following regressions calculations not shown to compare
with our original model:
for the income ratio IR: B 26.44 + 1.045 × IR + e
for GNP: B 34.72 − 0.003 × GNP + e
for growth: B 40.71 − 2.70 × GROWTH + e
How can we decide which of these three is ‘best’ on the basis of the regres-
sion equations alone From Figure 7.1 it is evident that some relationships
appear stronger than others yet this is not revealed by examining the regression
equation alone. More information is needed. You cannot choose the best equa-
tion simply by looking at the size of the coefﬁcients. Try to think why.
The goodness of ﬁt is calculated by comparing two lines: the regression line
and the ‘mean line’ i.e. a horizontal line drawn at the mean value of Y. The
regression line must ﬁt the data better if the mean line were the best ﬁt that is
also where the regression line would be but the question is how much better
This is illustrated in Figure 7.6 which demonstrates the principle behind the
calculation of the coefﬁcient of determination denoted by R
2
and usually more
simply referred to as ‘R squared’.
The ﬁgure shows the mean value of Y the calculated sample regression line
and an arbitrarily chosen sample observation X
i
Y
i
. The difference between Y
i
and T length Y
i
− T can be divided up into:
1 That part ‘explained’ by the regression line Z
i
− T i.e. explained by the
value of X
i
.
2 The error term e
i
Y
i
− Z
i
.
Figure 7.6
The calculation of R
2
STFE_C07.qxd 26/02/2009 10:52 Page 255

slide 273:

Chapter 7 • Correlation and regression
256
In algebraic terms
Y
i
− T Y − Z
i
+ Z
i
− T 7.14
A good regression model should ‘explain’ a large part of the differences
between the Y
i
values and T i.e. the length Z
i
− T should be large relative to
Y
i
− T. A measure of ﬁt would therefore be Z
i
− T/Y
i
− T. We need to apply this
to all observations not just a single one. Hence we need to sum this expression
over all the sample observations. A problem is that some of the terms would take
a negative value and offset the positive terms. To get round this problem we
square each of the terms in equation 7.14 to make them all positive and then
sum over the observations. This gives
∑Y
i
− T
2
known as the total sum of squares TSS
∑Z
i
− T
2
the regression sum of squares RSS and
∑Y
i
− Z
i
2
the error sum of squares ESS
The measure of goodness of ﬁt R
2
is then deﬁned as the ratio of the regression
sum of squares to the total sum of squares i.e.
R
2
7.15
The better the divergences between Y
i
and T are explained by the regression line
the better the goodness of ﬁt and the higher the calculated value of R
2
. Further
it is true that
TSS RSS + ESS 7.16
From equations 7.15 and 7.16 we can then see that R
2
must lie between 0
and 1 note that since each term in equation 7.16 is a sum of squares none of
them can be negative. Thus
0 R
2
1
A value of R
2
1 indicates that all the sample observations lie exactly on the
regression line equivalent to perfect correlation. If R
2
0 then the regression
line is of no use at all – X does not inﬂuence Y linearly at all and to try to
predict a value of Y
i
one might as well use the mean T rather than the value X
i
inserted into the sample regression equation.
To calculate R
2
alternative formulae to those above make the task easier.
Instead we use
TSS ∑Y
i
− T
2
∑Y
2
i
− nT
2
12 564 − 12 × 31.67
2
530.667
ESS ∑Y
i
− Z
2
∑Y
2
i
− a ∑Y
i
− b ∑X
i
Y
i
12 564 − 40.711 × 380 − −2.7 × 1139.70 170.754
RSS TSS − ESS 530.667 − 170.754 359.913
This gives the result
R
2
0.678
This is interpreted as follows. Countries’ birth rates vary around the overall mean
value of 31.67. 67.8 of this variation is explained by variation in countries’
growth rates. This is quite a respectable ﬁgure to obtain leaving only 32.8 of
359.913
530.667
RSS
TSS
RSS
TSS
STFE_C07.qxd 26/02/2009 10:52 Page 256

slide 274:

Inference in the regression model
257
the variation in Y left to be explained by other factors or pure random varia-
tion. The regression seems to make a worthwhile contribution to explaining
why birth rates differ.
It turns out that in simple regression i.e. where there is only one explanatory
variable R
2
is simply the square of the correlation coefﬁcient between X and Y.
Thus for the income ratio and for GNP we have
for IR: R
2
0.35
2
0.13
for GNP: R
2
−0.26
2
0.07
This shows once again that these other variables are not terribly useful in
explaining why birth rates differ. Each of them only explains a small proportion
of the variation in Y.
It should be emphasised at this point that R
2
is not the only criterion or even
an adequate one in all cases for judging the quality of a regression equation and
that other statistical measures set out below are also required.
a Calculate the R
2
value for the regression of the birth rate on the income ratio
calculated in Exercise 7.4.
b Conﬁrm that this result is the same as the square of the correlation coefﬁcient
between these two variables calculated in Exercise 7.1.
Inference in the regression model
So far regression has been used as a descriptive technique to measure the rela-
tionship between the two variables. We now go on to draw inferences from the
analysis about what the true regression line might look like. As with correlation
the estimated relationship is in fact a sample regression line based upon data for
12 countries. The estimated coefﬁcients a and b are random variables since they
would differ from sample to sample. What can be inferred about the true but
unknown regression equation
The question is best approached by ﬁrst writing down a true or population
regression equation in a form similar to the sample regression equation
Y
i
α + βX
i
+ ε
i
7.17
As usual Greek letters denote true or population values. Thus α and β are the
population parameters of which a and b are point estimates using the method
of least squares and ε is the population error term. If we could observe the indi-
vidual error terms ε
i
then we would be able to get exact values of α and β even
from a sample rather than just estimates.
Given that a and b are estimates we can ask about their properties: whether
they are unbiased and how precise they are compared to alternative estimators.
Under reasonable assumptions e.g. see Maddala 2001 Chapter 3 it can be
shown that the OLS estimates of the coefﬁcients are unbiased. Thus OLS pro-
vides useful point estimates of the parameters the true values α and β. This
is one reason for using the least squares method. It can also be shown that
among the class of linear unbiased estimators OLS has the minimum variance
Exercise 7.5
STFE_C07.qxd 26/02/2009 10:52 Page 257

slide 275:

Chapter 7 • Correlation and regression
258
i.e. the method provides the most precise estimates. This is another powerful
justiﬁcation for the use of OLS. So just as the sample mean provides a more
precise estimate of the population mean than does a single observation the
least squares estimates of α and β are the most precise.
Analysis of the errors
To ﬁnd conﬁdence intervals for α and β we need to know which statistical dis-
tribution we should be using i.e. the distributions of a and b. These can be
derived based on the assumptions that the error term ε in equation 7.17 above
is Normally distributed and that the errors are statistically independent of each
other. Since we are using cross-section data from countries which are different
geographically politically and socially it seems reasonable to assume the errors
are independent.
To check the Normality assumption we can graph the residuals calculated
from the sample regression line. If the true errors are Normal it seems likely that
these residuals should be approximately Normal also. The residuals are calcu-
lated according to equation 7.10 above. For example to calculate the residual
for Brazil we subtract the ﬁtted value from the actual value. The ﬁtted value is
calculated by substituting the growth rate into the estimated regression equa-
tion yielding Z 40.712 − 2.7 × 5.1 26.9. Subtracting this from the actual
value gives Y
i
− Z 30 − 26.9 3.1. Other countries’ residuals are calculated in
similar manner yielding the results shown in Table 7.5.
These residuals may then be gathered together in a frequency table as in
Chapter 1 and graphed. This is shown in Figure 7.7.
Although the number of observations is small and therefore the graph is
not a smooth curve the chart does have the greater weight of frequencies in the
centre as one would expect with less weight as one moves into the tails of the
distribution. The assumption that the true error term is Normally distributed
does not seem unreasonable.
If the residuals from the sample regression equation appeared distinctly non-
Normal heavily skewed for example then one should be wary of constructing
conﬁdence intervals using the formulae below. Instead one might consider
transforming the data see below before continuing. There are more formal tests
for Normality of the residuals but they are beyond the scope of this book.
Drawing a graph is an informal alternative which can be useful but remember
that graphical methods can be misinterpreted.
Table 7.5 Calculation of residuals
Actual birth rate Fitted values Residuals
Brazil 30 26.9 3.1
Colombia 29 32.1 −3.1
Costa Rica 30 32.6 −2.6
33 3 3
Sri Lanka 27 34.0 −7.0
Taiwan 21 24.0 −3.0
Thailand 30 28.3 1.7
STFE_C07.qxd 26/02/2009 10:52 Page 258

slide 276:

Inference in the regression model
259
If one were using time-series data one should also check the residuals for
autocorrelation at this point. This occurs when the error in period t is dependent
in some way on the error in the previous periods and implies that the method
of least squares may not be the best way of estimating the relationship. In this
example we have cross-section data so it is not appropriate to check for
autocorrelation since the ordering of the data does not matter. Chapter 8 on
multiple regression covers this topic.
Conﬁdence interval estimates of α and β
Having checked that the residuals appear reasonably Normal we can proceed
with inference. This means ﬁnding interval estimates of the parameters α and β
and later on conducting hypothesis tests. As usual the 95 conﬁdence inter-
val is obtained by adding and subtracting approximately two standard errors
from the point estimate. We therefore need to calculate the standard error of
a and of b and we also need to look up tables to ﬁnd the precise number of
standard errors to add and subtract. The principle is just the same as for the
conﬁdence interval estimate of the sample mean covered in Chapter 4.
The estimated sampling variance of b the slope coefﬁcient is given by
s
2
b
7.18
where
s
2
e
7.19
is the estimated variance of the error term ε.
The sampling variance of b measures the uncertainty associated with the
estimate. Note that the uncertainty is greater i the larger the error variance s
2
e
i.e. the more scattered the points around the regression line and ii the lower
the dispersion of the X observations. When X does not vary much it is then
more difﬁcult to measure the effect of changes in X upon Y and this is reﬂected
in the formula.
ESS
n − 2
∑e
2
i
n − 2
s
2
e
∑X
i
− S
2
Figure 7.7
Bar chart of residuals
from the regression
equation
STFE_C07.qxd 26/02/2009 10:52 Page 259

slide 277:

Chapter 7 • Correlation and regression
260
First we need to calculate s
2
e
. The value of this is
s
2
e
17.0754 7.20
and so the estimated variance of b is
s
2
b
0.346 7.21
Use ∑X
i
− S
2
∑X
2
i
− nS
2
in calculating 7.21 – it makes the calculation
easier. The estimated standard error of b is the square root of 7.21
7.22
To construct the conﬁdence interval around the point estimate b −2.7 the
t distribution is used in regression this applies to all sample sizes not just small
ones. The 95 conﬁdence interval is thus given by
b − t
v
s
b
b + t
v
s
b
7.23
where t
v
is the two-tail critical value of the t distribution at the appropriate
signiﬁcance level 5 in this case with v n − 2 degrees of freedom. The
critical value is 2.228. Thus the conﬁdence interval evaluates to
−2.7 − 2.228 × 0.588 −2.7 + 2.228 × 0.588 −4.01 −1.39
Thus we can be 95 conﬁdent that the true value of β lies within this range.
Note that the interval only includes negative values: we can rule out an upwards-
sloping regression line.
For the intercept a the estimate of the variance is given by
s
2
a
s
2
e
×+ 17.0754 ×+ 5.304 7.24
and the estimated standard error of a is the square root of this 2.303. The 95
conﬁdence interval for α again using the t distribution is
40.71 − 2.228 × 2.303 40.71 + 2.228 × 2.303 35.57 45.84
The results so far can be summarised as follows
Y
i
40.711 − 2.70X
i
+ e
i
s.e. 2.30 0.59
R
2
0.678 n 12
This conveys at a glance all the necessary information to the reader who can
then draw the inferences deemed appropriate. Any desired conﬁdence interval
not just the 95 one can be quickly calculated with the aid of a set of t tables.
Testing hypotheses about the coefﬁcients
As well as calculating conﬁdence intervals one can use hypothesis tests as the
basis for statistical inference in the regression model. These tests are quickly
and easily explained given the information already assembled. Consider the
following hypothesis
D
F
3.35
2
49.37
1
12
A
C
D
F
S
2
∑X
i
− S
2
1
n
A
C
s
b
. . 0 346 0 588
17.0754
49.37
170.754
10
STFE_C07.qxd 26/02/2009 10:52 Page 260

slide 278:

Inference in the regression model
261
STATISTICS
IN
PR AC TIC E
··
H
0
: β 0
H
1
: β ≠ 0
This null hypothesis is interesting because it implies no inﬂuence of X upon Y
at all i.e. the slope of the true regression line is ﬂat and Y
i
can be equally
well predicted by T. The alternative hypothesis asserts that X does in fact
inﬂuence Y.
The procedure is in principle the same as in Chapter 5 on hypothesis testing.
We measure how many standard deviations separate the observed value of b
from the hypothesised value. If this is greater than an appropriate critical value
we reject the hypothesis. The test statistic is calculated using the formula
t −4.59 7.25
Thus the sample coefﬁcient b differs by 4.59 standard errors from its hypo-
thesised value β 0. This is compared to the critical value of the t distribution
using n − 2 degrees of freedom. Since t −t
10
−2.228 in this case the null
hypothesis is rejected with 95 conﬁdence. X does have some inﬂuence on Y.
Similar tests using the income ratio and GDP to attempt to explain the birth rate
show that in neither case is the slope coefﬁcient signiﬁcantly different from
zero i.e. neither of these variables appears to inﬂuence the birth rate.
Rule of thumb for hypothesis tests
A quick and reasonably accurate method for establishing whether a coefﬁcient is
signiﬁcantly different from zero is to see if it is at least twice its standard error.
If so it is signiﬁcant. This works because the critical value at 95 of the t dis-
tribution for reasonable sample sizes is about 2.
Sometimes regression results are presented with the t statistic as calculated
above rather than the standard error below each coefﬁcient. This implicitly
assumes that the hypothesis of interest is that the coefﬁcient is zero. This is not
always appropriate: in the consumption function a test for the marginal propen-
sity to consume being equal to 1 might be of greater relevance for example. In
a demand equation one might want to test for unit elasticity. For this reason it
is better to present the standard errors rather than the t statistics.
Note that the test statistic t −4.59 is exactly the same result as in the case of
testing the correlation coefﬁcient. This is no accident for the two tests are
equivalent. A non-zero slope coefﬁcient means there is a relationship between
X and Y which also means the correlation coefﬁcient is non-zero. Both null
hypotheses are rejected.
Testing the signiﬁcance of R
2
: the F test
Another check of the quality of the regression equation is to test whether the R
2
value calculated earlier is signiﬁcantly greater than zero. This is a test using the
F distribution and turns out once again to be equivalent to the two previous tests
H
0
: β 0 and H
0
: ρ 0 conducted in previous sections using the t distribution.
−2.7 − 0
0.588
b − β
s
b
STFE_C07.qxd 26/02/2009 10:52 Page 261

slide 279:

Chapter 7 • Correlation and regression
262
The null hypothesis for the test is H
0
: R
2
0 implying once again that X does
not inﬂuence Y hence equivalent to β 0. The test statistic is
F 7.26
or equivalently
F 7.27
The F statistic is therefore the ratio of the regression sum of squares to the error
sum of squares each divided by their degrees of freedom for the RSS there is
one degree of freedom because of the one explanatory variable for the ESS there
are n − 2 degrees of freedom. A high value of the F statistic rejects H
0
in favour
of the alternative hypothesis H
1
: R
2
0. Evaluating 7.26 gives
F 21.078 7.28
The critical value of the F distribution at the 5 signiﬁcance level with
v
1
1 and v
2
10 is F
110
4.96. The test statistic exceeds this so the regression
as a whole is signiﬁcant. It is better to use the regression model to explain the
birth rate than to use the simpler model which assumes all countries have the
same birth rate the sample average.
As stated before this test is equivalent to those carried out before using the
t distribution. The F statistic is in fact the square of the t statistic calculated
earlier −4.59
2
21.078 and reﬂects the fact that in general
F
1n−2
t
2
n−2
The Prob-value associated with both statistics is the same approximately
0.001 in this case so both tests reject the null at the same level of signiﬁcance.
However in multiple regression with more than one explanatory variable the
relationship no longer holds and the tests do fulﬁl different roles as we shall see
in the next chapter.
a For the regression of the birth rate on the income ratio calculate the standard
errors of the coefﬁcients and hence construct 95 conﬁdence intervals for both.
b Test the hypothesis that the slope coefﬁcient is zero against the alternative that
it is not zero.
c Test the hypothesis H
0
: R
2
0.
Interpreting computer output
Having shown how to use the appropriate formulae to derive estimates of the
parameters their standard errors and to test hypotheses we now present all
these results as they would be generated by a computer software package in this
case Excel. This removes all the effort of calculation and allows us to concentrate
on more important issues such as the interpretation of the results. Table 7.6
shows the computer output.
0.678/1
1 − 0.678/10
RSS/1
ESS/n − 2
R
2
/1
1 − R
2
/n − 2
Exercise 7.6
STFE_C07.qxd 26/02/2009 10:52 Page 262

slide 280:

Inference in the regression model
263
The table presents all the results we have already derived plus a few more.
● The regression coefﬁcients standard errors and t ratios are given at the
bottom of the table suitably labelled. The column headed ‘P value’ this
is how Excel refers to the Prob-value discussed in Chapter 5 gives some
additional information – it shows the signiﬁcance level of the t statistic. For
example the slope coefﬁcient is signiﬁcant at the level of 0.1
2
i.e. there
is this probability of getting such a sample estimate by chance. This is much
less than our usual 5 criterion so we conclude that the sample evidence
did not arise by chance.
● The program helpfully calculates the 95 conﬁdence interval for the
coefﬁcients also which were derived above in equation 7.23.
● Moving up the table there is a section headed ANOVA Analysis of Variance.
This is similar to the ANOVA covered in Chapter 6. This table provides the
sums of squares values RSS ESS and TSS in that order and their associated
degrees of freedom in the ‘df’ column. The ‘MS’ ‘mean square’ column cal-
culates the sums of squares each divided by their degrees of freedom whose
ratio gives the F statistic in the next column. This is the value calculated in
equation 7.28. The ‘Signiﬁcance F’ value is similar to the P value discussed
previously: it shows the level at which the F statistic is signiﬁcant 0.1 in
this case and saves us looking up the F tables.
● At the top of the table is given the R
2
value and the standard error of the
error term s
e
labelled ‘Standard Error’ which we have already come across.
‘Multiple R’ is simply the square root of R
2
‘Adjusted R
2
’ sometimes called
‘R-bar squared’ and written R
2
adjusts the R
2
value for the degrees of freedom.
This is an alternative measure of ﬁt which is not affected by the number of
explanatory variables unlike R
2
. See Maddala 2001 Chapter 4 for a more
detailed explanation.
Table 7.6 Regression analysis output using Excel
2
This is the area in both tails so it is for a two-tail test.
STFE_C07.qxd 26/02/2009 10:52 Page 263

slide 281:

Chapter 7 • Correlation and regression
264
Prediction
Earlier we showed that the regression line could be used for prediction using the
ﬁgures for Costa Rica. The point estimate of Costa Rica’s birth rate is calculated
simply by putting its growth rate into the regression equation and assuming a
zero value for the error i.e.
Z 40.711 − 2.7 × 3 + 0 32.6
This is a point estimate which is unbiased around which we can build a
conﬁdence interval. There are in fact two conﬁdence intervals we can con-
struct the ﬁrst for the position of the regression line at X 3 the second for an
individual observation on Y at X 3. Using the 95 conﬁdence level the ﬁrst
interval is given by the formula
7.29
where X
P
is the value of X for which the prediction is made. t
n−2
denotes the
critical value of the t distribution at the 5 signiﬁcance level for a two-tail
test with n − 2 degrees of freedom. This evaluates to
29.90 35.30
This means that we predict with 95 conﬁdence that the average birth rate of
all countries growing at 3 p.a. is between 29.9 and 35.3.
The second type of interval for the value of Y itself at X
P
3 is somewhat
wider because there is an additional element of uncertainty: individual coun-
tries do not lie on the regression line but around it. This is referred to as the
95 prediction interval. The formula for this interval is
7.30
Note the extra ‘1’ inside the square root sign. When evaluated this gives a 95
prediction interval of 23.01 42.19. Thus we are 95 conﬁdent that an indi-
vidual country growing at 3 p.a. will have a birth rate within this range.
The two intervals are illustrated in Figure 7.8. The smaller conﬁdence interval
is shown in a darker shade with the wider prediction interval being about twice
as big. Note from the formulae that the prediction is more precise the interval
is smaller
● the closer the sample observations lie to the regression line smaller s
e
● the greater the spread of sample X values larger ∑X − S
2
Z
S
S
+× + +
−
∑−
⎤ ⎦ ⎥ ⎥ −
ts
n
X
X
ne
P
2
2
2
1
1
Z
S
S
−× + +
−
∑−
⎡ ⎣ ⎢ ⎢ −
ts
n
X
X
ne
P
2
2
2
1
1
32 6 2 228 4 132
1
12
3335
49 37
2
. . .
.
.
+× +
−
⎤ ⎦ ⎥ ⎥
32 6 2 228 4 132
1
12
3335
49 37
2
.. .
.
.
−× +
−
⎡ ⎣ ⎢ ⎢
Z
S
S
Z
S
S
−× +
−
∑−
+× +
−
∑−
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ −−
ts
n
X
X
ts
n
X
X
ne
P
ne
P
2
2
2
2
2
2
11
STFE_C07.qxd 26/02/2009 10:52 Page 264

slide 282:

Inference in the regression model
265
● the larger the sample size
● the closer to the mean of X the prediction is made smaller X
P
− S.
This last characteristic is evident in the diagram where the intervals are narrower
towards the centre of the diagram.
There is an additional danger of predicting far outside the range of sample X
values if the true regression line is not linear as we have assumed. The linear
sample regression line might be close to the true line within the range of
sample X values but diverge substantially outside. Figure 7.9 illustrates this point.
In the birth rate sample we have a fairly wide range of X values few coun-
tries grow more slowly than Senegal or faster than Korea.
Use Excel’s regression tool to conﬁrm your answers to Exercises 7.4 to 7.6.
a Predict point estimate the birth rate for a country with an income ratio of 10.
b Find the 95 conﬁdence interval prediction for a typical country with IR 10.
c Find the 95 conﬁdence interval prediction for an individual country with IR 10.
Figure 7.8
Conﬁdence and
prediction intervals
Figure 7.9
The danger of prediction
outside the range of
sample data
Exercise 7.7
Exercise 7.8
STFE_C07.qxd 26/02/2009 10:52 Page 265

slide 283:

Chapter 7 • Correlation and regression
266
Worked example 7.2
We continue the previous worked example completing the calculations
needed for regression. The previous table contains most of the preliminary
calculations. To ﬁnd the regression line we use
b 1.57
and
a 19.83 − 1.57 × 5.5 11.19
Hence we obtain the equation
Y
i
11.19 + 1.57X
i
+ e
i
For inference we start with the sums of squares:
TSS ∑Y
i
− T
2
∑Y
2
i
− nT
2
2427 − 6 × 19.83
2
66.83
ESS ∑Y
i
− Z
i
2
∑Y
2
i
− a ∑Y
i
− b ∑X
i
Y
i
2427 − 11.19 × 119 − 1.57 × 682 23.62
RSS TSS − ESS 66.83 − 23.62 43.21
We then obtain R
2
RSS/TSS 43.21/66.83 0.647 or 64.7 of the variation
in Y explained by variation in X.
To obtain the standard errors of the coefﬁcients we ﬁrst calculate the error
variance as s
e
2
ESS/n − 2 23.62/4 5.905 and the estimated variance of
the slope coefﬁcient is
s
2
b
0.338
and the standard error of b is therefore √0.338 0.581.
Similarly for a we obtain
s
2
a
s
2
e
×+ 5.905 ×+ 11.19
and the standard error of a is therefore 3.34.
Conﬁdence intervals for a and b can be constructed using the critical
value of the t distribution 2.776 5 ν 4 yielding 1.57 ± 2.776 × 0.581
−0.04 3.16 for b and 1.90 20.47 for a. Note that zero is inside the con-
ﬁdence interval for b. This is also reﬂected in the test of H
0
: β 0 which is
t 2.71
which falls short of the two-tailed critical value 2.776. Hence H
0
cannot be
rejected.
The F statistic to test H
0
: R
2
0 is
F 7.32
43.21/1
23.62/6 − 2
RSS/1
ESS/n − 2
1.57 − 0
0.581
D
F
5.5
2
17.50
1
6
A
C
D
F
S
2
∑X − S
2
1
n
A
C
5.905
17.50
s
2
e
∑X − S
2
6 × 682 − 33 × 119
6 × 199 − 33
2
n ∑XY −∑X ∑Y
n ∑X
2
− ∑X
2
STFE_C07.qxd 26/02/2009 10:52 Page 266

slide 284:

Inference in the regression model
267
which compares to a critical value of F14 of 7.71 so again the null cannot
be rejected remember that this and the t test on the slope coefﬁcient are
equivalent in simple regression.
We shall predict the value of Y for a value of X 10 yielding Z 11.19 +
1.57 × 10 26.90. The 95 conﬁdence interval for this prediction is calculated
using equation 7.29 which gives
The 95 prediction interval for an actual observation at X 10 is given by
7.30 resulting in
Units of measurement
The measurement and interpretation of the regression coefﬁcients depends
upon the units in which the variables are measured. For example suppose we
had measured the birth rate in births per hundred not thousand of population
what would be the implications Obviously nothing fundamental is changed
we ought to obtain the same qualitative result with the same interpretation.
However the regression coefﬁcients cannot remain the same: if the slope
coefﬁcient remained b −2.7 this would mean that an increase in the growth
rate of one percentage point reduces the birth rate by 2.7 births per hundred
which is clearly wrong. The right answer should be 0.27 births per hundred
equivalent to 2.7 per thousand so the coefﬁcient should change to b −0.27.
Thus in general the sizes of the coefﬁcients depend upon the units in which
the variables are measured. This is why one cannot judge the importance of a
regression equation from the size of the coefﬁcients alone.
It is easiest to understand this in graphical terms. A graph of the data will
look exactly the same except that the scale on the Y-axis will change it will
be divided by 10. The intercept of the regression line will therefore change to
a 4.0711 and the slope to b −0.27. Thus the regression equation becomes
Y
i
4.0711 − 0.27X
i
+ e
i
′
e
i
′ e
i
/10
Since nothing fundamental has altered any hypothesis test must yield the
same test statistic. Thus t and F statistics are unaltered by changes in the units
of measurement nor is R
2
altered. However standard errors will be divided by
10 they have to be to preserve the t statistics see equation 7.25 for example.
Table 7.7 sets out the effects of changes in the units of measurement upon the
26 90 2 776 2 43 1
1
6
10 5 5
17 50
26 90 2 776 2 43 1
1
6
10 5 5
17 50
16 62 37 18
2
2
. . .
.
.
. . .
.
.
. . .
−× ++
−
+× ++
−
⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥
26 90 2 776 2 43
1
6
10 5 5
17 50
26 90 2 776 2 43
1
6
10 5 5
17 50
19 14 34 66
2
2
. . .
.
.
. . .
.
.
. . .
−× +
−
+× +
−
⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥
STFE_C07.qxd 26/02/2009 10:52 Page 267

slide 285:

Chapter 7 • Correlation and regression
268
coefﬁcients and standard errors. In the table it is assumed that the variables have
been multiplied by a constant k in the above case k 1/10 was used.
It is important to be aware of the units in which the variables are measured.
If not it is impossible to know how large is the effect of X upon Y. It may be
statistically signiﬁcant but you have no idea of how important it is. This may
occur if for instance one of the variables is presented as an index number see
Chapter 10 rather than in the original units.
How to avoid measurement problems: calculating the elasticity
A neat way of avoiding the problems of measurement is to calculate the elasticity
i.e. the proportionate change in Y divided by the proportionate change in X. The
proportionate changes are the same whatever units the variables are measured
in. The proportionate change in X is given by ΔX/X where ΔX indicates the
change in X. Thus if X changes from 100 to 110 the proportionate change is
ΔX/X 10/100 0.1 or 10. The elasticity η is therefore given by
η × 7.31
The second form of the equation is more useful since ΔY/ΔX is simply the
slope coefﬁcient b. We simply need to multiply this by the ratio X/Y therefore.
But what values should be used for X and Y The convention is to use the means
so we obtain the following formula for the elasticity from a linear regression
equation
η b × 7.32
This evaluates to −2.7 × 3.35/31.67 −0.29. This is interpreted as follows: a
1 increase in the growth rate would lead to a 0.29 decrease in the birth rate.
Equivalently and perhaps a little more usefully a 10 rise in growth from say 3
to 3.3 p.a. would lead to a 2.9 decline in the birth rate e.g. from 30 to 29.13.
This result is the same whatever units the variables X and Y are measured in.
Note that this elasticity is measured at the means it would have a different
value at different points along the regression line. Later on we show an altern-
ative method for estimating the elasticity in this case the elasticity of demand
which is familiar in economics.
Non-linear transformations
So far only linear regression has been dealt with that is ﬁtting a straight line to
the data. This can sometimes be restrictive especially when there is good reason
S
T
X
Y
ΔY
ΔX
ΔY/Y
ΔX/X
Table 7.7 The effects of data transformations
Factor k multiplying . . . Effect upon
YX a s
a
bs
b
k 1 All multiplied by k
1 k Unchanged Divided by k
kk Multiplied by k Unchanged
STFE_C07.qxd 26/02/2009 10:52 Page 268

slide 286:

Inference in the regression model
269
to believe that the true relationship is non-linear e.g. the labour supply curve.
Poor results would be obtained by ﬁtting a straight line through the data in
Figure 7.10 yet the shape of the relationship seems clear at a glance.
Fortunately this problem can be solved by transforming the data so that
when graphed a linear relationship between the two variables appears. Then a
straight line can be ﬁtted to these transformed data. This is equivalent to ﬁtting
a curved line to the original data. All that is needed is to ﬁnd a suitable trans-
formation to ‘straighten out’ the data. Given the data represented in Figure 7.10
if Y were graphed against 1/X the relationship shown in Figure 7.11 would appear.
Thus if the regression line
Y
i
a + b + e
i
7.33
were ﬁtted this would provide a good representation of the data in Figure 7.10.
The procedure is straightforward. First calculate the reciprocal of each of the X
values and then use these together with the original data for Y using exactly
the same methods as before. This transformation appears inappropriate for the
birth rate data see Figure 7.1 but serves as an illustration. The transformed X
1
X
i
Figure 7.10
Graph of Y against X
Figure 7.11
Figure 7.10 transformed:
Y against 1/X
STFE_C07.qxd 26/02/2009 10:52 Page 269

slide 287:

Chapter 7 • Correlation and regression
270
values are 0.196 1/5.1 for Brazil 0.3125 1/3.2 for Colombia etc. The
resulting regression equation is
Y
i
31.92 − 3.96 + e
i
7.34
s.e. 1.64 1.56
R
2
0.39 F 6.44 n 12
This appears worse than the original speciﬁcation the R
2
is low and the slope
coefﬁcient is not signiﬁcantly different from zero so the transformation does
not appear to be a good one. Note also that it is difﬁcult to calculate the effect
of X upon Y in this equation. We can see that a unit increase in 1/X reduces the
birth rate by 3.96 but we do not have an intuitive feel for the inverse of the
growth rate. This latest result also implies that a fall in the growth rate hence
1/X rises lowers the birth rate – the converse of our previous result. In the next
chapter we deal with a different example where a non-linear transformation
does improve matters.
Table 7.8 presents a number of possible shapes for data with suggested data
transformations which will allow the relationship to be estimated using linear
regression. In each case once the data have been transformed the methods and
formulae used above can be applied.
It is sometimes difﬁcult to know which transformation if any to apply.
A graph of the data is unlikely to be as tidy as the diagrams in Table 7.8.
1
X
i
Table 7.8 Data transformations
Name Graph of Original Transformed Regression
relationship relationship relationship
Double log Y aX
b
e ln Y ln a + ln Y on ln X
b ln X + ln e
Reciprocal Y a + b/X +eY a + b +eY on
Semi-log e
Y
aX
b
eY ln a + b ln XY on ln X
+ ln e
Exponential Y e
a+bX+ e
ln Y a + bX + e ln Y on X
1
X
1
X
STFE_C07.qxd 26/02/2009 10:52 Page 270

slide 288:

Summary
271
Exercise 7.9
Exercise 7.10
Economic theory rarely suggests the form which a relationship should follow
and there are no simple statistical tests for choosing alternative formulations.
The choice can sometimes be made after visual inspection of the data or on the
basis of convenience. The double log transformation is often used in economics
as it has some very convenient properties. Unfortunately it cannot be used with
the growth rate data here because Senegal’s growth rate was negative. It is
impossible to take the logarithm of a negative number. We therefore postpone
the use of the log transformation in regression until the next chapter.
a Calculate the elasticity of the birth rate with respect to the income ratio using
the results of previous exercises.
b Give a brief interpretation of the meaning of this ﬁgure.
Calculate a regression relating the birth rate to the inverse of the income ratio 1/IR.
Summary
● Correlation refers to the extent of association between two variables. The
sample correlation coefﬁcient is a measure of this association extending
from r −1 to r +1.
● Positive correlation r 0 exists when high values of X tend to be associated
with high values of Y and low X values with low Y values.
● Negative correlation r 0 exists when high values of X tend to be associated
with low values of Y and vice versa.
● Values of r around 0 indicate an absence of correlation.
● As the sample correlation coefﬁcient is a random variable we can test for its
signiﬁcance i.e. test whether the true value is zero or not. This test is based
upon the t distribution.
● The existence of correlation even if ‘signiﬁcant’ does not necessarily imply
causality. There can be other reasons for the observed association.
● Regression analysis extends correlation by asserting a causality from X to Y
and then measuring the relationship between the variables via the regression
line the ‘line of best ﬁt’.
● The regression line Y a + bX is deﬁned by the intercept a and slope
coefﬁcient b. Their values are found by minimising the sum of squared errors
around the regression line.
● The slope coefﬁcient b measures the responsiveness of Y to changes in X.
● A measure of how well the regression line ﬁts the data is given by the coefﬁci-
ent of determination R
2
varying between 0 very poor ﬁt and 1 perfect ﬁt.
● The coefﬁcients a and b are unbiased point estimates of the true values of the
parameters. Conﬁdence interval estimates can be obtained based on the t
distribution. Hypothesis tests on the parameters can also be carried out using
the t distribution.
STFE_C07.qxd 26/02/2009 10:52 Page 271

slide 289:

Chapter 7 • Correlation and regression
272
● A test of the hypothesis R
2
0 implying the regression is no better at
predicting Y than simply using the mean of Y can be carried out using the
F distribution.
● The regression line may be used to predict Y for any value of X by assuming
the residual to be zero for that observation.
● The measured response of Y to X given by b depends upon the units of
measurement of X and Y. A better measure is often the elasticity which is the
proportionate response of Y to a proportionate change in X.
● Data are often transformed prior to regression e.g. by taking logs for a
variety of reasons e.g. to ﬁt a curve to the original data.
G. S. Maddala Introduction to Econometrics 2001 3rd edn. Wiley.
M. P. Todaro Economic Development for a Developing World 1992 3rd edn.
Financial Times Prentice Hall.
autocorrelation
correlation coefﬁcient
coefﬁcient of determination R
2
coefﬁcient of rank correlation
dependent endogenous variable
elasticity
error sum of squares
error term or residual
independent exogenous variable
intercept
prediction
regression line or equation
regression sum of squares
slope
standard error
t ratio
total sum of squares
Key terms and concepts
References
STFE_C07.qxd 26/02/2009 10:52 Page 272

slide 290:

273
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
7.1 The other data which Todaro might have used to analyse the birth rate were:
Country Birth rate GNP Growth Income ratio
Bangladesh 47 140 0.3 2.3
Tanzania 47 280 1.9 3.2
Sierra Leone 46 320 0.4 3.3
Sudan 47 380 −0.3 3.9
Kenya 55 420 2.9 6.8
Indonesia 35 530 4.1 3.4
Panama 30 1910 3.1 8.6
Chile 25 2560 0.7 3.8
Venezuela 35 4220 2.4 5.2
Turkey 33 1540 3.5 4.9
Malaysia 31 1840 4.3 5.0
Nepal 44 150 0.0 4.7
Malawi 56 200 2.7 2.4
Argentina 20 2560 1.9 3.6
For one of the three possible explanatory variables in class different groups could
examine each of the variables:
a Draw an XY chart of the data above and comment upon the result.
b Would you expect a line of best ﬁt to have a positive or negative slope Roughly what
would you expect the slope to be
c What would you expect the correlation coefﬁcient to be
d Calculate the correlation coefﬁcient and comment.
e Test to see if the correlation coefﬁcient is different from zero. Use the 95 conﬁdence
level.
Analysis of this problem continues in Problem 7.5.
7.2 The data below show consumption of margarine in ounces per person per week and its
real price for the UK.
Year Consumption Price Year Consumption Price
1970 2.86 125.6 1980 3.83 104.2
1971 3.15 132.9 1981 4.11 95.5
1972 3.52 126.0 1982 4.33 88.1
1973 3.03 119.6 1983 4.08 88.9
1974 2.60 138.8 1984 4.08 97.3
1975 2.60 141.0 1985 3.76 100.0
1976 3.06 122.3 1986 4.10 86.7
1977 3.48 132.7 1987 3.98 79.8
1978 3.54 126.7 1988 3.78 79.9
1979 3.63 115.7
Problems
Problems
STFE_C07.qxd 26/02/2009 10:52 Page 273

slide 291:

Chapter 7 • Correlation and regression
274
a Draw an XY plot of the data and comment.
b From the chart would you expect the line of best ﬁt to slope up or down In theory
which way should it slope
c What would you expect the correlation coefﬁcient to be approximately
d Calculate the correlation coefﬁcient between margarine consumption and its price.
e Is the coefﬁcient signiﬁcantly different from zero What is the implication of the result
The following totals will reduce the burden of calculation: ∑Y 67.52 ∑X 2101.70
∑Y
2
245.055 ∑X
2
240 149.27 ∑XY 7299.638 Y is consumption X is price. If you wish
you could calculate a logarithmic correlation. The relevant totals are: ∑y 23.88 ∑x
89.09 ∑y
2
30.45 ∑x
2
418.40 ∑xy 111.50 where y lnY and x lnX.
Analysis of this problem continues in Problem 7.6.
7.3 What would you expect to be the correlation coefﬁcient between the following variables
Should the variables be measured contemporaneously or might there be a lag in the
effect of one upon the other
a Nominal consumption and nominal income.
b GDP and the imports/GDP ratio.
c Investment and the interest rate.
7.4 As Problem 7.3 for:
a real consumption and real income
b individuals’ alcohol and cigarette consumption
c UK and US interest rates.
7.5 Using the data from Problem 7.1 calculate the rank correlation coefﬁcient between the
variables and test its signiﬁcance. How does it compare with the ordinary correlation
coefﬁcient
7.6 Calculate the rank correlation coefﬁcient between price and quantity for the data in
Problem 7.2. How does it compare with the ordinary correlation coefﬁcient
7.7 a For the data in Problem 7.1 ﬁnd the estimated regression line and calculate the R
2
statistic. Comment upon the result. How does it compare with Todaro’s ﬁndings
b Calculate the standard error of the estimate and the standard errors of the coefﬁci-
ents. Is the slope coefﬁcient signiﬁcantly different from zero Comment upon the result.
c Test the overall signiﬁcance of the regression equation and comment.
d Taking your own results and Todaro’s how conﬁdent do you feel that you understand
the determinants of the birth rate
e What do you think will be the result of estimating your equation using all 26 countries’
data Try it What do you conclude
7.8 a For the data given in Problem 7.2 estimate the sample regression line and calculate
the R
2
statistic. Comment upon the results.
b Calculate the standard error of the estimate and the standard errors of the coefﬁ-
cients. Is the slope coefﬁcient signiﬁcantly different from zero Is demand inelastic
c Test the overall signiﬁcance of the regression and comment upon your result.
STFE_C07.qxd 26/02/2009 10:52 Page 274

slide 292:

275
7.9 From your results for the birth rate model predict the birth rate for a country with either
a GNP equal to 3000 b a growth rate of 3 p.a. or c an income ratio of 7. How does
your prediction compare with one using Todaro’s results Comment.
7.10 Predict margarine consumption given a price of 70. Use the 99 conﬁdence level.
7.11 Project Update Todaro’s study using more recent data.
7.12 Try to build a model of the determinants of infant mortality. You should use cross-section
data for 20 countries or more and should include both developing and developed
countries in the sample.
Write up your ﬁndings in a report which includes the following sections: discussion of
the problem data gathering and transformations estimation of the model interpretation
of results. Useful data may be found in the Human Development Report use Google to
ﬁnd it online.
Problems
STFE_C07.qxd 26/02/2009 10:52 Page 275

slide 293:

Chapter 7 • Correlation and regression
276
Answers to exercises
Exercise 7.1
a The calculation is:
Birth rate Income ratio Y
2
X
2
XY
YX
Brazil 30 9.5 900 90.25 285
Colombia 29 6.8 841 46.24 197.2
Costa Rica 30 4.6 900 21.16 138
India 35 3.1 1225 9.61 108.5
Mexico 36 5 1296 25 180
Peru 36 8.7 1296 75.69 313.2
Philippines 34 3.8 1156 14.44 129.2
Senegal 48 6.4 2304 40.96 307.2
South Korea 24 2.7 576 7.29 64.8
Sri Lanka 27 2.3 729 5.29 62.1
Taiwan 21 3.8 441 14.44 79.8
Thailand 30 3.3 900 10.89 99
Totals 380 60 12 564 361.26 1964
c As for a except ∑X 0.6 ∑Y 38 ∑X
2
0.036126 ∑Y
2
125.64 ∑XY 1.964.
Hence
Exercise 7.2
a
b The Prob-value for a two-tailed test is 0.257 or 25 so we do not reject the null
of no correlation.
Exercise 7.3
a The calculation is:
Birth rate Income ratio Rank of Rank of Y
2
X
2
XY
YX Y X
Brazil 30 9.5 7 1 49 1 7
Colombia 29 6.8 9 3 81 9 27
Costa Rica 30 4.6 7 6 49 36 42
India 35 3.1 4 10 −16 100 40
Mexico 36 5 2.5 5 −6.25 25 12.5
Peru 36 8.7 2.5 2 6.25 4 5
Philippines 34 3.8 5 7.5 −25 56.25 37.5
Senegal 48 6.4 1 4 −116 4
South Korea 24 2.7 11 11 121 121 121
Sri Lanka 27 2.3 10 12 −100 144 120
Taiwan 21 3.8 12 7.5 144 56.25 90
Thailand 30 3.3 7 9 −49 81 63
Totals 78 78 647.5 649.5 569
t
.
.
.
−
−
0 355 12 2
1 0 355
120
2
r
. .
. . .
.
×− ×
×− × −
12 1 964 0 6 38
12 0 036126 0 6 12 125 64 38
0 355
22
r
.
.
×− ×
×− × −
12 1964 60 380
12 361 26 60 12 12 564 380
0 355
22
STFE_C07.qxd 26/02/2009 10:52 Page 276

slide 294:

Answers to exercises
277
b This is less than the critical value of 0.591 so the null of no rank correlation
cannot be rejected.
c Reversing the rankings should not alter the result of the calculation.
Exercise 7.4
a Using the data and calculations in the answer to Exercise 7.1 we obtain:
b 1.045
a− 1.045 × 26.443
b A unit increase in the measure of inequality leads to approximately one addi-
tional birth per 1000 mothers. The constant has no useful interpretation. The
income ratio cannot be zero in fact it cannot be less than 0.5.
Exercise 7.5
a TSS ∑Y
i
− T
2
∑Y
2
i
− nT
2
12 564 − 12 × 31.67
2
530.667
ESS ∑Y
i
− Z
i
2
∑Y
2
i
− a ∑Y
i
− b ∑X
i
Y
i
12 564 − 26.443 × 380 − 1.045 × 1139.70 463.804
RSS TSS − ESS 530.667 − 463.804 66.863
R
2
0.126.
b This is the square of the correlation coefﬁcient calculated earlier as 0.355.
Exercise 7.6
a s
2
e
46.3804
and so
s
2
b
0.757
and
s
b
√0.757 0.870
For a the estimated variance is
s
2
a
s
2
e
×+ 46.3804 ×+ 22.793
and hence s
a
4.774. The 95 CIs are therefore 1.045 ± 2.228 × 0.87 −0.894
2.983 for b and 26.443 ± 2.228 × 4.774 15.806 37.081.
b t 1.201
Not signiﬁcant.
c F 1.44
66.863/1
463.804/12 − 2
RSS/1
ESS/n − 2
1.045 − 0
0.870
D
F
5
2
61.26
1
12
A
C
D
F
S
2
∑X
i
− S
2
1
n
A
C
46.3804
61.26
463.804
10
60
12
380
12
12 × 1964 − 60 × 380
12 × 361.26 − 60
2
r
s
. .
.
×−
×− ×−
12 569 78
12 649 5 78 12 647 5 78
0 438
2
22
STFE_C07.qxd 26/02/2009 10:52 Page 277

slide 295:

Chapter 7 • Correlation and regression
278
Exercise 7.7
Excel should give the same answers.
Exercise 7.8
a B 26.44 + 1.045 × 10 36.9.
b
26.3 47.5
c
18.4 55.4
Exercise 7.9
a e 1.045 × 0.165
b A 10 rise in the inequality measure e.g. from 4 to 4.4 raises the birth rate by
1.65 e.g. from 30 to 30.49.
Exercise 7.10
BR 38.82 − 29.61 ×+ e
s.e. 19.0
R
2
0.19 F110 2.43.
The regression is rather poor and the F statistic is not signiﬁcant.
1
IR
5
31.67
36 9 2 228 6 81 1
1
12
10 5
61 26
36 9 2 228 6 81 1
1
12
10 5
61 26
22
. . .
.
. . .
.
−× ++
−
+× ++
−
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥
36 9 2 228 6 81
1
12
10 5
61 26
36 9 2 228 6 81
1
12
10 5
61 26
22
. . .
.
. . .
.
−× +
−
+× +
−
⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ STFE_C07.qxd 26/02/2009 10:52 Page 278

slide 296:

Multiple regression
8
Contents
279
Learning outcomes 279
Introduction 280
Principles of multiple regression 281
What determines imports into the UK 282
Theoretical issues 282
Data 283
Data transformations 284
Estimation 288
The signiﬁcance of the regression as a whole 290
Are the results satisfactory 291
Improving the model – using logarithms 292
Testing the accuracy of the forecasts: the Chow test 295
Analysis of the errors 296
Finding the right model 300
Testing compound hypotheses 302
Omitted variable bias 303
Dummy variables and trends 304
Multicollinearity 306
Measurement error 306
Some ﬁnal advice on regression 307
Summary 307
Key terms and concepts 308
Reference 308
Problems 309
Answers to exercises 313
By the end of this chapter you should be able to:
● understand the extension of simple regression to multiple regression with
more than one explanatory variable
● use computer software to calculate a multiple regression equation and interpret
its output
● recognise the role of economic theory in deriving an appropriate regression
equation
● interpret the effect of each explanatory variable on the dependent variable
● understand the statistical signiﬁcance of the results
● judge the adequacy of the model and know how to improve it.
Learning
outcomes
Complete your diagnostic test for Chapter 8 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C08.qxd 26/02/2009 09:13 Page 279

slide 297:

Chapter 8 • Multiple regression
280
STATISTICS
IN
PR AC TIC E
··
Introduction
Simple regression is rather limited as it assumes that there is only one explana-
tory factor affecting the dependent variable which is unlikely to be true in most
situations. Price and income affect demand for example. Multiple regression
the subject of this chapter overcomes this problem by allowing there to be several
explanatory variables though still only one dependent variable in a model. The
techniques are an extension of those used in simple or bivariate regression.
Multivariate regression allows more general and more helpful models to be estim-
ated although this does involve new problems as well as advantages.
The regression relationship now becomes
Y b
0
+ b
1
X
1
+ b
2
X
2
+ ... + b
k
X
k
+ e 8.1
where there are now k explanatory variables. The principles used in multiple
regression are basically the same as in the two-variable case: the coefﬁcients b
0
... b
k
are found by minimising the sum of squared errors a standard error can
be calculated for each coefﬁcient R
2
t ratios etc. can be calculated and hypo-
thesis tests performed. However there are a number of additional issues which
arise and these are dealt with in this chapter.
The formulae for calculating coefﬁcients standard errors etc. become very
complicated in multiple regression and are time-consuming and error-prone
when done by hand. For this reason these calculations are invariably done by
computer nowadays. Therefore the formulae are not given in this book: instead
we present the results of computer calculations which you can replicate and
concentrate on understanding and interpretting the results. This is as it should
be the calculations themselves are the means to an end not the end in itself.
Using spreadsheet packages
Standard spreadsheet packages such as Excel can perform multiple regression
analysis and are sufﬁcient for most straightforward tasks. A regression equation
can be calculated via menus and dialogue boxes and no knowledge of the formu-
lae is required. However when problems such as autocorrelation see below are
present specialised packages such as TSP Stata or PCGIVE are much easier to
use and provide more comprehensive results.
We also introduce a new example in this section estimating a demand equation
for imports into the UK over the period 1973–2005. There are a number of reasons
for this switch for we could have continued with the birth rate example you
are asked to do this in the exercises. First it allows us to work through a small
‘research project’ from beginning to end including the gathering of data data
transformations interpretation of results etc. Second the example uses time-
series data and this allows us to bring out some of the particular issues that arise
in such cases. Time-series data do not generally constitute a random sample of
observations such as we have dealt with in the rest of this book. This is because
the observations are constrained to follow one another in time rather than being
randomly chosen. The proper analysis of time-series data goes far beyond the scope
of this book however students often want or need to analyse such data using
STFE_C08.qxd 26/02/2009 09:13 Page 280

slide 298:

Principles of multiple regression
281
Figure 8.1
The regression plane in
three dimensions
elementary techniques. This chapter therefore also emphasises the checking of
the adequacy of the regression equation for such data. For a fuller treatment of
the issues the reader should consult a more advanced text such as Maddala 2001.
Principles of multiple regression
We illustrate some of the principles involved in multiple regression using two
explanatory variables X
1
and X
2
. Since we are using time-series data we replace
the subscript i with a subscript t to denote the individual observations.
The sample regression equation now becomes
Y
t
b
0
+ b
1
X
1t
+ b
2
X
2t
+ e
t
t 1... T 8.2
with three coefﬁcients b
0
b
1
and b
2
to be estimated. Note that b
0
now signiﬁes
the constant. Rather than ﬁtting a line through the data the task is now to ﬁt a
plane to the data in three dimensions as shown in Figure 8.1.
The plane is drawn sloping down in the direction of X
1
and up in the direc-
tion of X
2
. The observations are now points dotted about in three-dimensional
space with coordinates X
1t
X
2t
and Y
t
and the task of regression analysis is to
ﬁnd the equation of the plane so as to minimise the sum of squares of vertical
distances from each point to the plane. The principle is the same as in simple
regression and the regression plane is the one that best summarises the data.
The coefﬁcient b
0
gives the intercept on the Y axis b
1
is the slope of the plane
in the direction of the X
1
axis and b
2
is the slope in the direction of the X
2
axis.
Thus b
1
gives the effect upon Y of a unit change in X
1
assuming X
2
remains con-
stant. Similarly b
2
gives the response of Y to a unit change in X
2
assuming no
change in X
1
. If X
1
and X
2
both change by 1 then the effect on Y is b
1
+ b
2
. b
1
and
b
2
are estimates of the true parameters β
1
and β
2
and so standard errors and
conﬁdence intervals can be calculated implying that we are not absolutely cer-
tain about the true position of the plane. In general the smaller these standard
errors the better since it implies less uncertainty about the true relationship
between Y and the X variables.
STFE_C08.qxd 26/02/2009 09:13 Page 281

slide 299:

Chapter 8 • Multiple regression
282
When there are more than two explanatory variables more than three
dimensions are needed to draw a picture of the data. The reader will understand
that this is a difﬁcult if not impossible task however it is possible to estimate
such a model and interpret the results in a similar manner to that set out below
for the two explanatory variable case.
What determines imports into the UK
To illustrate multiple regression we suppose that we have the job of ﬁnding
out what determines the volume of imports into the UK and whether there are
any policy implications of the result. We are given this very open-ended task
which we have to carry through from start to ﬁnish. We end up by estimating a
demand equation so the analysis serves as a model for any demand estimation
for example a ﬁrm trying to ﬁnd out about the demand for its product.
How should we set about this task The project can be broken down into the
following steps:
1 Theoretical considerations: what can economic theory tell us about the
problem and how will this affect our estimation procedures
2 Data gathering: what data do we need Are there any deﬁnitional problems
for example
3 Data transformation: are the data suitable for the task We might want to
transform one or more variables before estimation.
4 Estimation: this is mainly done automatically by the computer.
5 Interpretation of the results: what do the results tell us Do they appear satis-
factory Do we need to improve the model Are there any policy conclusions
Although this appears reasonably clear-cut in practice these steps are often
mixed up. Researchers might gather the data estimate a model and then not be
happy with the results realising that some factors have been overlooked. They
therefore go back and obtain some different data perhaps some new variables
or maybe try a different method of investigation until ‘satisfactory’ results are
obtained. There is usually some element of data ‘mining’ or ‘ﬁshing’ involved.
These methodological issues are examined in more detail later on.
Theoretical issues
What does economic theory tell us about imports Like any market the quantity
transacted depends upon supply and demand. Strictly therefore we should
estimate a simultaneous equation model of both the demand and supply equa-
tions. Since this is beyond the scope of this book see Maddala 2001 Chapter 9
for analyses of such models we simplify by assuming that as the UK is a small
economy in the world market we can buy any quantity of imports that we demand
at the prevailing price. In other words supply is never a constraint and the UK’s
demand never inﬂuences the world price. This assumption which seems reason-
able means that we can concentrate on estimating the demand equation alone.
Second economic theory suggests that demand depends upon income and
relative prices particularly the prices of close substitutes and complements.
STFE_C08.qxd 26/02/2009 09:13 Page 282

slide 300:

What determines imports into the UK
283
Furthermore rational consumers do not suffer from money illusion so real
variables should be used throughout.
Economic theory does not tell us some things however. It does not tell us
whether the relationship is linear or not. Nor does it tell us whether demand
responds immediately to price or income changes or whether there is a lag. For
these questions the data are more likely to give us the answer.
Data
The raw data are presented in Table 8.1 obtained from ofﬁcial UK statistics.
Note that there is some slight rounding of the ﬁgures: imports are measured to
the nearest £0.1bn £100m so there is a possible rounding error of up to about
0.1. This is unlikely to substantially affect our estimates.
The variables are deﬁned as follows:
Table 8.1 Original data for study of imports
Year Imports GDP GDP deﬂator Price of imports RPI all items
1973 18.8 74.0 24.6 21.5 25.1
1974 27.0 83.8 28.7 31.3 29.1
1975 28.7 105.9 35.7 35.6 36.1
1976 36.5 125.2 41.4 43.6 42.1
1977 42.3 145.7 47.0 50.5 48.8
1978 45.2 167.9 52.5 52.4 52.8
1979 54.2 197.4 60.6 55.8 59.9
1980 57.4 230.8 71.5 65.5 70.7
1981 60.2 253.2 79.7 71.3 79.1
1982 67.6 277.2 85.8 77.3 85.9
1983 77.4 303.0 90.3 84.2 89.8
1984 92.6 324.6 94.9 91.8 94.3
1985 98.7 355.3 100.0 96.4 100.0
1986 100.9 381.8 103.8 91.9 103.4
1987 111.4 420.2 109.0 94.7 107.7
1988 124.7 469.0 116.3 93.7 113.0
1989 142.7 514.9 124.6 97.8 121.8
1990 148.3 558.2 134.1 100.0 133.3
1991 142.1 587.1 142.9 101.3 141.1
1992 151.7 612.0 148.5 102.1 146.4
1993 170.1 642.7 152.5 112.4 148.7
1994 185.4 681.0 155.3 116.1 152.4
1995 207.2 719.7 159.4 123.6 157.6
1996 227.7 765.2 164.6 123.4 161.4
1997 232.3 811.2 169.6 115.2 166.5
1998 239.2 860.8 174.1 109.3 172.2
1999 255.2 906.6 177.8 107.6 174.8
2000 287.0 953.2 180.6 111.2 180.0
2001 299.9 997.0 184.5 110.2 183.2
2002 307.4 1048.8 189.9 107.5 186.3
2003 314.8 1110.3 195.6 106.7 191.7
2004 333.7 1176.5 201.0 106.2 197.4
2005 366.5 1224.7 205.4 110.7 202.9
STFE_C08.qxd 26/02/2009 09:13 Page 283

slide 301:

Chapter 8 • Multiple regression
284
● Imports variable M: imports of goods and services into the UK at current
prices in £bn.
● Income GDP: UK gross domestic product GDP at factor cost at current
prices in £bn.
● The GDP deﬂator P
GDP
: an index of the ratio of nominal to real GDP 1985
100. This is an index of general price increases and may be used to transform
nominal GDP to real GDP.
● The price of imports P
M
: the unit value index of imports 1990 100.
● The price of competing products P: the retail price index RPI 1985 100.
These variables were chosen from a wide range of possibilities. To take income
as an example we could use personal disposable income or GDP. Since ﬁrms as
well as consumers import goods the wider measure is used. Then there is the
question of whether to use GDP or GNP and whether to measure them at factor
cost or market prices. Because there is little difference between these different
magnitudes this is not an important decision in this case. However in a research
project one might have to consider such issues in more detail.
Data transformations
Before calculating the regression equation we must transform the data in
Table 8.1. This is because the expenditures on imports and GDP have not
been adjusted for price changes inﬂation. Part of the observed increase in the
imports series is due to prices increasing over time not increased consumption
of imported goods. It is the latter we are trying to explain.
Since expenditure on any good including imports can be expressed as the
quantity purchased multiplied by the price to obtain the quantity of imports ‘real’
imports we must divide the expenditure by the price of imports. In algebraic terms
expenditure price × quantity hence
quantity
We therefore adjust both imports and GDP for the effect of price changes in
this way. This process is covered in more detail in Chapter 10 on index numbers
you may wish to read that before proceeding with this chapter although it is
not essential.
We also need to adjust the import price series which inﬂuences the demand
for imports. People make their spending decisions by looking at the price of an
imported good relative to prices generally. Hence we divide the price of imports
by the RPI to give the relative or real price of imports.
In summary the transformed variables are derived as follows:
● Real imports M/P
M
: this series is obtained by dividing the nominal series for
imports by the unit value index i.e. the import price index. The series gives
imports at 1990 prices in £bn. Note that the nominal and real series are
identical in 1990.
● Real income GDP/P
GDP
: this is the nominal GDP series divided by the GDP
deﬂator to give GDP at 1990 prices in £bn.
● Real import prices P
M
/P: the unit value index is divided by the RPI to give this
series. It is an index number series with its value set to 100 in 1990. It shows
expenditure
price
STFE_C08.qxd 26/02/2009 09:13 Page 284

slide 302:

What determines imports into the UK
285
the price of imports relative to the price of all goods. The higher this price ratio
the less attractive imports would be relative to domestically produced goods.
The transformed variables are shown in Table 8.2. Do not worry if you have
not fully understood the process of transforming to real terms. You can simply
begin with the data in Table 8.2 recognising them as the quantity of imports
demanded the level of real income or output and the price of imports relative
to all goods.
We should now ‘eyeball’ the data using appropriate graphical techniques. This
will give a broad overview of the characteristics of the data and any unusual or
erroneous observations may be spotted. This is an important step in the analysis.
Figure 8.2 shows a time-series plot of the three variables. The graph shows
that both imports and GDP increase smoothly over the period and that there
appears to be a fairly close relationship between them. This is conﬁrmed by the
XY plot of imports and GDP in Figure 8.3 which shows an approximately linear
Table 8.2 Transformed data
Year Real imports Real GDP Real import prices
1973 87.4 403.4 114.2
1974 86.3 391.6 143.4
1975 80.6 397.8 131.5
1976 83.7 405.5 138.0
1977 83.8 415.7 137.9
1978 86.3 428.9 132.3
1979 97.1 436.8 124.2
1980 87.6 432.9 123.5
1981 84.4 426.0 120.2
1982 87.5 433.2 120.0
1983 91.9 450.0 125.0
1984 100.9 458.7 129.8
1985 102.4 476.5 128.5
1986 109.8 493.3 118.5
1987 117.6 517.0 117.2
1988 133.1 540.8 110.5
1989 145.9 554.2 107.0
1990 148.3 558.2 100.0
1991 140.3 550.9 95.7
1992 148.6 552.7 93.0
1993 151.3 565.2 100.8
1994 159.7 588.0 101.5
1995 167.6 605.5 104.5
1996 184.5 623.4 101.9
1997 201.6 641.4 92.2
1998 218.8 663.0 84.6
1999 237.2 683.8 82.1
2000 258.1 707.8 82.3
2001 272.1 724.6 80.2
2002 286.0 740.6 76.9
2003 295.0 761.2 74.2
2004 314.2 784.9 71.7
2005 331.1 799.6 72.7
STFE_C08.qxd 26/02/2009 09:13 Page 285

slide 303:

Chapter 8 • Multiple regression
286
relationship. Care should be taken in interpreting this however since it shows
only the partial relationship between two of the three variables. However it does
appear to be fairly strong.
The price of imports has declined by about 35 over the period this is
relative to all goods generally so this might also have contributed to the rise
in imports. Figure 8.4 provides an XY chart of these two variables. There appears
to be a clear negative relationship between imports and their price. On the basis
of the graphs we might expect a positive relation between imports and GDP and
a negative one between imports and their price. Both of these expectations are
in line with what economic theory would predict.
Note that one does not always or even often get such neat graphs in line
with expectations. In multivariate analyses the relationships between the
Figure 8.2
A time-series plot of
imports GDP left hand
scale and import prices
real terms Note: This is a multiple time-series graph as described in Chapter 1.
Figure 8.3
XY graph of imports
against GDP
STFE_C08.qxd 26/02/2009 09:13 Page 286

slide 304:

What determines imports into the UK
287
variables can be complex and are not revealed by simple bivariate graphs. One
needs to do a multiple regression to uncover the true relationship.
For the exercises in this chapter we will be looking at the determinants of travel by car
in the UK which has obviously been increasing steadily and causes concern because
of issues such as pollution and congestion. Data for these exercises are as follows:
Year Car travel Real price Real price Real price Real personal
billions of of car travel of rail travel of bus travel disposable
passenger- income
kilometres
1980 388 107.0 76.2 78.9 54.2
1981 394 107.1 77.8 79.3 54.0
1982 406 104.2 82.3 84.6 53.8
1983 411 106.4 83.4 85.5 54.9
1984 432 103.8 79.8 83.3 57.0
1985 441 101.7 80.4 81.6 58.9
1986 465 97.4 82.7 87.1 61.3
1987 500 99.5 84.1 88.4 63.6
1988 536 98.4 85.4 88.4 67.0
1989 581 95.9 85.7 88.6 70.2
1990 588 93.3 86.3 88.9 72.6
1991 582 96.4 89.9 92.4 74.1
1992 583 98.3 93.5 94.7 76.2
1993 584 101.6 97.6 97.3 78.3
1994 591 101.3 99.2 99.2 79.4
1995 596 99.7 100.4 100.5 81.3
1996 606 101.4 101.1 103.1 83.3
1997 614 102.7 100.6 105.1 86.6
1998 618 102.1 101.5 106.6 86.9
1999 613 103.9 103.2 109.3 89.8
2000 618 103.7 102.2 110.0 95.3
2001 624 101.2 102.9 112.0 100.0
Figure 8.4
XY graph of imports
against import prices
Exercise 8.1
STFE_C08.qxd 26/02/2009 09:13 Page 287

slide 305:

Chapter 8 • Multiple regression
288
a Draw time-series graphs of car travel and its price and comment on the main
features.
b Draw XY plots of car travel against i price and ii income. Comment upon the
major features of the graphs.
c In a multiple regression of car travel on its price and on income what would you
expect the signs of the two slope coefﬁcients to be Explain your answer.
d If the prices of bus and rail travel are added as further explanatory variables
what would you expect the signs on their coefﬁcients to be Justify your answer.
Estimation
The model to be estimated is therefore
t
b
0
+ b
1
t
+ b
2
t
+ e
t
8.3
expressed in terms of the original variables. To simplify notation we rewrite this
in terms of the transformed variables as
m
t
b
0
+ b
1
gdp
t
+ b
2
pm
t
+ e
t
8.4
The results of estimating this equation are shown in Table 8.3 which shows
the output using Excel. We have used the data in years 1973–2003 for estima-
tion purposes ignoring the observations for 2004 and 2005. Later on we will use
the results to predict imports in 2004 and 2005.
D
F
P
M
P
A
C
D
F
GDP
P
GDP
A
C
D
F
M
P
M
A
C
Table 8.3 Regression results using Excel
SUMMARY OUTPUT
Regression statistics
Multiple R 0.98
R square 0.96
Adjusted R square 0.96
Standard error 13.24
Observations 31
ANOVA
Signiﬁcance
df SS MS F F
Regression 2 129 031.05 64 515.52 368.23 7.82E-21
Residual 28 4905.70 175.20
Total 30 133 936.75
Coefﬁcients Standard t Stat P-value Lower 95 Upper 95
error
Intercept −172.61 73.33 −2.35 0.03 −322.83 −22.39
Real GDP 0.59 0.06 9.12 0.00 0.45 0.72
Real import prices 0.05 0.37 0.13 0.90 −0.70 0.79
STFE_C08.qxd 26/02/2009 09:13 Page 288

slide 306:

What determines imports into the UK
289
The print-out gives all the results we need which may be summarised as
m
t
−172.61 + 0.59gdp
t
+ 0.05pm
t
+ e
t
8.5
0.06 0.37
R
2
0.96 F
226
368.23 n 31
How do we judge and interpret these results As expected we obtain a positive
coefﬁcient on income but surprisingly a positive one on price too. Note that
it is difﬁcult to give a sensible interpretation to the constant. The coefﬁcients
should be judged in two ways: in terms of their size and their signiﬁcance.
Size
As noted earlier the size of a coefﬁcient depends upon the units of measure-
ment. How ‘big’ is the coefﬁcient 0.59 for income This tells us that a rise in
GDP measured in 1990 prices of £1bn would raise imports also measured
in 1990 prices by £0.59bn. This is a bit cumbersome. It is better to interpret
everything in proportionate terms and calculate the elasticity of imports with
respect to income. This is the proportionate change in imports divided by the
proportionate change in income
η
gdp
8.6
which can be evaluated see equation 7.32 as
η
gdp
b
1
× 0.59 × 2.16 8.7
which shows that imports are highly responsive to income. A 3 rise in real
GDP a fairly typical annual ﬁgure leads to an approximate 6 rise in imports
as long as prices do not change at the same time. Thus as income rises imports
rise substantially faster. More generally we would interpret the result as showing
that a 1 rise in GDP leads to a 2.16 rise in imports.
A similar calculation for the price variable yields
η
gdp
0.05 × 0.04 8.8
This yields the ‘wrong’ sign for the elasticity: a 10 price rise relative to domestic
prices would raise import demand by 0.4. This is an extremely small effect
and for practical purposes can be regarded as zero.
Signiﬁcance
We can test whether each coefﬁcient is signiﬁcantly different from zero i.e.
whether the variable truly affects imports or not using a conventional hypo-
thesis test. For income we have the test statistic
t 9.12
as shown in Table 8.3. This has a t distribution with n − k − 1 31 − 2 − 1 28
degrees of freedom k is the number of explanatory variables excluding the
constant 2. The critical value for a one-tail test at the 95 conﬁdence level
is 1.701. Since the test statistic comfortably exceeds this we reject H
0
: β
1
0 in
0.59 − 0
0.06
109.4
146.3
536.4
146.3
gdp
M
Δm/m
Δgdp/gdp
STFE_C08.qxd 26/02/2009 09:13 Page 289

slide 307:

Chapter 8 • Multiple regression
290
STATISTICS
IN
PR AC TIC E
··
favour of H
1
: β
1
0. Hence income does indeed affect imports the sample data
are unlikely to have arisen purely by chance. Note that this t ratio is given on
the Excel print-out.
For price the test statistic is
t 0.13
which is smaller than 1.701 so does not fall into the rejection region. H
0
: β
2
0
cannot be rejected therefore. So not only is the coefﬁcient on price quantit-
atively small it is insigniﬁcantly different from zero i.e. there is a reasonable
probability of this result arising simply by chance. The fact that we had a
positive coefﬁcient is thus revealed as unimportant it was just a small random
ﬂuctuation around zero. This result arises despite the fact that the graph of
imports against price seemed to show a strong negative relationship. That graph
was in fact somewhat misleading. The regression tells us that the more import-
ant relationship is with income and once that is accounted for price provides
little additional explanation of imports. Well that is the story so far.
The signiﬁcance of the regression as a whole
We can test the overall signiﬁcance via an F test as we did for simple regression.
This is a test of the hypothesis that all the slope coefﬁcients are simultaneously
zero equivalent to the hypothesis that R
2
0
H
0
: β
1
β
2
0H
1
: β
1
≠ β
2
≠ 0
This tests whether either income or price affects demand. Since we have already
found that income is a signiﬁcant explanatory variable via the t test it would
be surprising if this null hypothesis were not rejected. The test statistic is similar
to equation 7.27
F 8.9
which has an F distribution with k and n − k − 1 degrees of freedom. Substituting
in the appropriate values gives
F 368.23
which is in excess of the critical value for the F
228
distribution of 3.34 at 5
signiﬁcance so the null hypothesis is rejected as expected. The actual signi-
ﬁcance level is given by Excel as ‘7.82E–21’ i.e. 7.82 × 10
−21
effectively zero and
certainly less than 5.
Does corruption harm investment
The World Bank examined this question in its 1997 World Development Report
using regression methods. There is a concern that levels of corruption in many
countries harm investment and hence also economic growth.
The study looked at the relationship between investment measured as a
percentage of GDP and the following variables: the level of corruption the
129 031.05/2
4905.70/31 − 2 − 1
RSS/k
ESS/n − k − 1
0.05 − 0
0.37
STFE_C08.qxd 26/02/2009 09:13 Page 290

slide 308:

What determines imports into the UK
291
predictability of corruption the level of secondary school enrolment GDP per
capita and a measure of ‘policy distortion’. Both the level and predictability of
corruption were based upon replies to surveys of businesses in the 39 countries
studied which asked questions such as ‘Do you have to make additional payments
to get things done’ The policy distortion variable measures how badly economic
policy is run based on openness to trade the exchange rate etc. Higher values of
the index indicate poorer economic management.
The regression obtained was
19.5 − 5.8 CORR + 6.3 PRED_CORR + 2.0 SCHOOL
s.e. 13.5 2.2 2.6 2.2
− 1.1 GDP − 2.0 DISTORT
1.9 1.5
R
2
0.24
Thus only the corruption variables prove signiﬁcant at the 5 level. A rise in
the level of corruption lowers investment note the negative coefﬁcient −5.8 as
expected but a rise in the predictability of corruption raises it. This is presumably
because people learn how to live with corruption. Unfortunately units of measure-
ment are not given so it is impossible to tell just how important are the sizes of
the coefﬁcients and in particular to ﬁnd the trade-off between corruption and its
predictability.
Adapted from: World Development Report 1997.
a Using the data from Exercise 8.1 calculate a regression explaining the level
of car travel using price and income as explanatory variables. Use only the
observations from 1980 to 1999. As well as calculating the coefﬁcients you
should calculate standard errors and t ratios R
2
and the F statistic.
b Interpret the results. You should evaluate the size of the effect of the explanatory
variables as well as their signiﬁcance and evaluate the goodness of ﬁt of the model.
Are the results satisfactory
The results so far appear reasonably satisfactory: we have found one signiﬁcant
coefﬁcient the R
2
value is quite high at 96 although R
2
values tend to be high
in time-series regressions and the result of the F test proves the regression is
worthwhile. Nevertheless it is perhaps surprising to ﬁnd no effect from the price
variable we might as well drop it from the equation and just regress imports
on GDP.
A more stringent test is to use the equation for forecasting since this uses
out-of-sample information for the test. So far the diagnostic tests such as the F
test are based on the same data that were used for estimation. A more suitable
test might be to see if the equation can forecast imports to within say 4 of
the correct value. Since real imports increased by about 4.1 p.a. on average
between 1973 and 2003 a simple forecasting rule would be to increase the
current year’s ﬁgure by 4.1. The regression model might be compared to this
standard.
Inv
GDP
Exercise 8.2
STFE_C08.qxd 26/02/2009 09:13 Page 291

slide 309:

Chapter 8 • Multiple regression
292
Forecasts for 2004 and 2005
1
are obtained by inserting the values of the
explanatory variables for these years into the regression equation giving
2004: N −172.61 + 0.59 × 784.9 + 0.05 × 71.7 290.0
2005: N −172.61 + 0.59 × 799.6 + 0.05 × 72.7 298.6
Table 8.4 summarises the actual and forecast values and the error between
them. The percentage error is about 8 in 2004 11 in 2005. This is not very
good Both years are under-predicted by a large amount. The simple growth rule
would have given predictions of 295.0 × 1.04 306.8 and 295.0 × 1.04
2
319.1
which are much closer. More work needs to be done.
Improving the model – using logarithms
There are various ways in which we might improve our model. We might try to
ﬁnd additional variables to improve the ﬁt although since we already have R
2
0.96 this might be difﬁcult or we might try lagged variables e.g. the previous
year’s price as explanatory variables on the grounds that the effects do not
work through instantaneously. Alternatively we might try a different functional
form for the equation. We have presumed that the regression should be a
straight line although we made no justiﬁcation for this. Indeed the graph of
imports against income showed some degree of curvature see Figure 8.3 above.
Hence we might try a non-linear transformation of the data as brieﬂy discussed
at the end of Chapter 7.
We shall re-estimate the regression equation having transformed all the data
using natural logarithms. Not only does this method ﬁt a curve to the data but
has the additional advantage of giving more direct estimates of the elasticities
as we shall see. Because of such advantages estimating a regression equation
in logs is extremely common in economics and analysts often start with the
logarithmic form in preference to the linear form.
We will therefore estimate the equation
ln m
t
b
0
+ b
1
ln gdp
t
+ b
2
ln pm
t
+ e
t
where ln m
t
indicates the logarithm of imports in period t etc. We therefore
need to transform our three variables into logarithms as shown in Table 8.5
selected years only.
We now use the new data for the regression with ln m as the dependent vari-
able ln gdp and ln pm as the explanatory variables. We also use exactly the same
formulae as before applied to this new data.
Table 8.4 Actual forecast and error values
Year Actual Forecast Error
2004 314.2 290.0 24.2
2005 331.1 298.6 32.5
1
Remember that data from 2004 and 2005 were not used to estimate the regression
equation.
STFE_C08.qxd 26/02/2009 09:13 Page 292

slide 310:

What determines imports into the UK
293
This gives the following results:
SUMMARY OUTPUT
Regression statistics
Multiple R 0.99
R square 0.98
Adjusted R square 0.98
Standard error 0.05
Observations 31
ANOVA
Signiﬁcance
df SS MS F F
Regression 2 5.31 2.65 901.43 3.83E-26
Residual 28 0.08 0.00
Total 30 5.39
Coefﬁcients Standard error t Stat P-value Lower 95 Upper 95
Intercept −3.60 1.65 −2.17 0.04 −6.98 −0.21
ln GDP 1.66 0.15 11.31 0.00 1.36 1.97
ln import prices −0.41 0.16 −2.56 0.02 −0.74 −0.08
The regression equation we have is therefore
lm m
t
−3.60 + 1.66 ln gdp
t
− 0.41 ln pm
t
Because we have transformed the variables the slope coefﬁcients are very
different from the values we had before from the linear equation. However the
interpretation of the log regression equation is different. A big advantage of this
formulation is that the coefﬁcients give direct estimates of the elasticities there
is no need to multiply by the ratio of the means as with the linear form see
equation 8.7.
Hence the income elasticity of demand is estimated as 1.66 and the price
elasticity is −0.41. These contrast with the values calculated from the linear
Table 8.5 Data in natural logarithm form
Year Real imports ln m Real GDP ln gdp Real import prices ln pm
1973 87.4 4.47 403.4 6.00 114.2 4.74
1974 86.3 4.46 391.6 5.97 143.4 4.97
1975 80.6 4.39 397.8 5.99 131.5 4.88
1976 83.7 4.43 405.5 6.01 138.0 4.93
33 3 3 3 3 3
2001 272.1 5.61 724.6 6.59 80.2 4.38
2002 286.0 5.66 740.6 6.61 76.9 4.34
2003 295.0 5.69 761.2 6.63 74.2 4.31
2004 314.2 5.75 784.9 6.67 71.7 4.27
2005 331.1 5.80 799.6 6.68 72.7 4.29
Note: You can obtain the natural logarithm by using the ‘ln’ key on your calculator or the ‘ln’
function in Excel or other software. Thus we have ln 87.4 4.47 etc.
STFE_C08.qxd 26/02/2009 09:13 Page 293

slide 311:

Chapter 8 • Multiple regression
294
equation of 2.16 and 0.04 respectively. The contrast with the previous estimate
of the price elasticity is particularly stark. We have gone from an estimate which
was positive although very small and statistically insigniﬁcant to one which is
negative and signiﬁcant.
It is difﬁcult to say which is the ‘right’ answer both are estimates of the
unknown true values. One advantage of the log model is that the elasticity does
not vary along the demand curve as it does with the linear model. With the latter
we had to calculate the elasticity at the means of the variables but the value
inevitably varies along the curve. For example taking 2003 values for imports
and income we obtain an elasticity of
η
gdp
0.59 × 1.52
This is quite different from the value at the mean 2.16. A convenient mathe-
matical property of the log formulation is that the elasticity does not change
along the curve. Hence we can talk about ‘the’ elasticity which is very convenient.
We can compare the linear and log models further to judge which is pre-
ferable. The log model has a higher price elasticity and is ‘signiﬁcant’ t−2.56 so
we can now reject the hypothesis that price has no effect upon import demand.
This is more in line with what economic theory would predict. The R
2
value is
also higher 0.98 versus 0.96 but this is a misleading comparison. R
2
tells us how
much of the variation in the dependent variable is explained by the explanatory
variables. However we have a different dependent variable now: the log of imports
rather than imports. Although they are both measuring imports they are different
variables making direct comparison of R
2
invalid.
We can also compare the predictive abilities of the two models. For the log
model we have the following predictions
2004: ln N −3.60 + 1.66 × 6.67 − 0.41 × 4.27 5.73
2005: ln N −3.60 + 1.66 × 6.68 − 0.41 × 4.29 5.76
These are log values so we need to take anti-logs to get back to the original units
e
5.73
308.2 and e
5.76
316.0
These predictions are substantially better than from the linear equation as we
see below:
Year Actual Fitted Error error
2004 314.2 308.2 6.0 1.9
2005 331.1 316.0 15.1 4.8
The errors are less than half the size they were in the linear formulation and
overall the log regression is beginning to look the better.
Choosing between alternative models is a matter of judgement. The criteria
are convenience conformity with economic theory and the general statistical
‘ﬁt’ of the model to the data. In this case the log model seems superior on all
counts. It is more convenient as we get direct estimates of the elasticities. It is
more in accord with economic theory as it suggests a signiﬁcant price effect and
also because the variables are growing over time which is usually better repres-
ented by the log transformation see Chapter 1. Finally the model seems to ﬁt
761.2
295.0
STFE_C08.qxd 26/02/2009 09:13 Page 294

slide 312:

What determines imports into the UK
295
the data better and in particular it gives better forecasts. There are more formal
statistical methods for choosing between different models but they are beyond
the scope of this book.
The rest of this chapter goes on to look at more advanced topics relating to
the regression model. These are not essential as far as estimation of the regres-
sion model goes but are useful ‘diagnostic tools’ which allow us to check the
quality of the estimates in more depth.
Testing the accuracy of the forecasts: the Chow test
There is a formal test for the accuracy of the forecasts which can be applied to
both linear and log forms of the equation based on the F distribution. This
is the Chow test named after its inventor. The null hypothesis is that the
true prediction errors are all equal to zero so the errors we do observe are just
random variation from the regression line. Alternatively we can interpret the
hypothesis as asserting that the same regression line applies to both estimation
and prediction periods. If the predictions lie too far from the estimated regres-
sion line then the null is rejected. The alternative hypothesis is that the model
has changed in some way and that a different regression line should be applied
to the prediction period.
The test procedure is as follows:
1 Use the ﬁrst n
1
observations for estimation the last n
2
observations for the
forecast. In this case we have n
1
31 n
2
2.
2 Estimate the regression equation using the ﬁrst n
1
observations as above
and obtain the error sum of squares ESS
1
.
3 Re-estimate the equation using all n
1
+ n
2
observations and obtain the
pooled error sum of squares ESS
P
.
4 Calculate the F statistic
F
We then compare this test statistic with the critical value of the F distribu-
tion with n
2
n
1
− k − 1 degrees of freedom. If the test statistic exceeds the
critical value the model fails the prediction test. A large value of the test
statistic indicates a large divergence between ESS
P
and ESS
1
adjusted for the
different sample sizes suggesting that the model does not ﬁt the two peri-
ods equally well. The bigger the prediction errors the more ESS
P
will exceed
ESS
1
leading to a large F statistic.
Evaluating the test for the log regression we have ESS
1
0.08246 the Excel
printout rounded this to 0.08. Estimating over the whole sample 1973–2005
gives
ln m
t
−3.54 + 1.67 ln gdp
t
− 0.42 ln pm
t
R
2
0.99 F
230
1202.52 ESS
P
0.08444
so the test statistic is
F 0.34
0.08444 − 0.08246/2
0.08246/28
ESS
P
− ESS
1
/n
2
ESS
1
/n
1
− k − 1
STFE_C08.qxd 26/02/2009 09:13 Page 295

slide 313:

Chapter 8 • Multiple regression
296
The critical value of the F distribution for 2 28 degrees of freedom is 3.34 so
the equation passes the test i.e. the same regression line may be considered
valid for both subperiods and the errors in the forecasts are just random errors
around the regression line.
It is noticeable that the predictions are always too low for all the models
the errors in both years are positive. This suggests a slight ‘boom’ in imports
relative to what one might expect despite the result of the Chow test. Perhaps we
have omitted an explanatory variable which has changed markedly in 2004–2005
or perhaps the errors are not truly random. Alternatively we still could have the
wrong functional form for the model. Since we already have an R
2
value of 0.98
we are unlikely to ﬁnd another variable which adds signiﬁcantly to the explan-
atory power of the model. We have already tried two functional forms. Therefore
we shall examine the errors in the model to see if they appear random.
a Use the regression equation from Exercise 8.2 to forecast the level of car travel
in 2000 and 2001. How accurate are your forecasts Is this a satisfactory result
b Convert the variables to natural logarithms and repeat the regression calcula-
tion. Interpret your result and compare to the linear equation.
c Calculate price and income elasticities from the linear model and compare to
those obtained from the log model.
d Forecast car travel in 2000 and 2001 using the log model and compare the results
to those from the linear model. Use the function e
x
to convert the forecasts in
logs back to the original units.
e Use a Chow test to test whether the forecasts are accurate. Is there any difference
between linear and log models
Analysis of the errors
Why analyse the errors as surely they are just random In setting out our model
equation 8.2 we asserted the error is random but this does depend upon us
formulating the correct model. Hence if we study the errors and ﬁnd they are
not random in some way this suggests the model is not correct and hence
could be improved. This is another important part of the checking procedure to
see if the model is adequate or whether it is mis-speciﬁed e.g. has the wrong
functional form or a missing explanatory variable. If the model is a good one
then the error term should be random and ideally should be unpredictable.
If there are any predictable elements to it then we could use this information
to improve our model and forecasts. Unlike forecasting this is a within-sample
procedure. Second we expect the observed errors to be approximately Normally
distributed since this assumption underlies the t and F distributions used for
inference. If the errors are not Normal this would cast doubt on our use of t and
F statistics for inference purposes.
A complete formal treatment of these issues is beyond the scope of this book
see for example Maddala Chapters 5 6 and 12. Instead we give an outline of
how to detect the problems and some simple procedures which might overcome
them. At least if you are aware of the problem you will know that you should
consult a more advanced text.
Exercise 8.3
STFE_C08.qxd 26/02/2009 09:13 Page 296

slide 314:

What determines imports into the UK
297
First we can quickly deal with the issue of Normality of the errors. In this
example we only have 31 observations which is not really sufﬁcient to check
for a Normal distribution. Drawing a histogram of the errors left as an exercise
does not give a nice smooth distribution because of the few observations and it
is hard to tell if it looks Normal or not. More formal methods also require more
observations to be reliable so we will have to take the assumption of Normality
on trust in this case.
Second we can examine the error term for evidence of autocorrelation. This
was introduced brieﬂy in Chapter 1. To recapitulate: autocorrelation occurs
when one error observation is correlated with an earlier often the previous
one. It only occurs with time-series data in cross-section the ordering of the
observations does not matter so there is not a natural ‘preceding’ observation.
Autocorrelation often occurs in time series data: if inﬂation is ‘high’ this month
it is likely to be high next month also if low it is likely to be low next month also.
Many economic variables are ‘sticky’ in this way. Imports are likely to behave
this way too as the factors affecting imports mainly GDP change slowly.
This characteristic has not been incorporated into our model. If it were we
might improve our forecasts: noting that the actual value of imports in 2003 is
above the predicted value a positive error we might expect another positive error
in in 2004. However our forecast was made by setting the error for 2004 to zero
i.e. using the ﬁtted value from the regression line. In the light of this perhaps
we should not be surprised that the predicted value is below the actual value.
One should therefore check for this possibility before making forecasts by
examining the errors up to 2003 for the presence of autocorrelation. Poor
forecasting is not the only consequence of autocorrelation – the estimated
standard errors can also be affected often biased downwards in practice leading
to incorrect inferences being drawn.
Checking for autocorrelation
The errors to be examined are obtained by subtracting the ﬁtted values from the
actual observations. Using time-series data we have
e
t
Y
t
− Z
t
Y
t
− b
0
− b
1
X
1t
− b
2
X
2t
8.10
The errors obtained from the import demand equation for the logarithmic
model of import demand are shown in Table 8.6 and are graphed in Figure 8.5.
The graph suggests a deﬁnite pattern that of positive errors initially followed
by a series of negative errors followed in turn by more positive errors. This
is deﬁnitely not a random pattern: a positive error is likely to be followed by
a positive error a negative error by another negative error. From this graph we
might reasonably predict that the two errors for 2004–2005 will be positive as
in fact they are. This means our regression equation is inadequate in some way
– we are expecting it to under-predict. If so we ought to be able to improve it.
The phenomenon we have uncovered positive errors usually following
positive negative following negative is known as positive autocorrelation. In
other words there appears to be a positive correlation between successive errors
e
t
and e
t−1
. A truly random series would have a low or zero correlation. Less
common in economic models is negative autocorrelation where positive errors
tend to follow negative ones negative follow positive. We will concentrate on
positive autocorrelation.
STFE_C08.qxd 26/02/2009 09:13 Page 297

slide 315:

Chapter 8 • Multiple regression
298
This non-randomness can be summarised and tested numerically by the
Durbin–Watson DW statistic named after its two inventors. This is routinely
printed out by specialist software packages but unfortunately not by spread-
sheet programs. The statistic is a one-tail test of the null hypothesis of no auto-
correlation against the alternative of positive or of negative autocorrelation.
The test statistic always lies in the range 0–4 and is compared to critical values
d
L
and d
U
given in Table A7 see page 427. The decision rule is best presented
graphically as in Figure 8.6.
Low values of DW below d
L
suggest positive autocorrelation high values
above 4 − d
L
suggest negative autocorrelation and a value near 2 between d
U
and 4 − d
U
suggests the problem is absent. There are also two regions where the
test is unfortunately inconclusive between the d
L
and d
U
values.
The test statistic can be calculated by the formula
2
8.11
DW
−
−
∑
∑
ee
e
tt
t
n
t
t
n
1
2
2
2
1
Table 8.6 Calculation of residuals
Observation Actual Predicted Residuals
1973 4.47 4.43 0.04
1974 4.46 4.29 0.17
1975 4.39 4.35 0.04
1976 4.43 4.36 0.07
33 3 3
2000 5.55 5.50 0.05
2001 5.61 5.55 0.05
2002 5.66 5.61 0.05
2003 5.69 5.67 0.02
Note: In logs the residual is approximately the percentage error. So for example the ﬁrst
residual 0.04 indicates the error is of the order of 4.
Figure 8.5
Time-series graph of the
errors from the import
demand equation
2
The DW statistic can also be approximated using the correlation coefﬁcient r between e
t
and e
t−1
and then DW ≈ 2 × 1 − r. The approximation gets closer the larger the sample
size. It should be reasonably accurate if you have 20 observations or more.
STFE_C08.qxd 26/02/2009 09:13 Page 298

slide 316:

What determines imports into the UK
299
This is relatively straightforward to calculate using a spreadsheet program.
Table 8.7 shows part of the calculation.
Hence we obtain
DW 0.855
The result suggests positive autocorrelation
3
of the errors. For n 30 close
enough to n 31 the critical values are d
L
1.284 and d
U
1.567 using the 95
conﬁdence level see Table A7 page 427 so we clearly reject the null of no auto-
correlation.
Consequences of autocorrelation
The presence of autocorrelation in this example causes our forecasts to be too
low. If we took account of the pattern of errors over time we could improve the
forecasting performance of the model. A second general consequence of auto-
correlation is that the standard errors are often under-estimated resulting in
excessive t and F statistics. This leads us to think the estimates are ‘signiﬁcant’
when they might not in fact be so. We may have what is sometimes known as
a ‘spurious’ regression – it looks good but is misleading. The bias in the standard
errors and t statistics can be large and this is potentially a serious problem.
This danger occurs particularly when the variables used in the analysis are
trended as many economic variables are over time. Variables trending over
time appear to be correlated with each other but there may be no true under-
lying relationship. One now-famous study
4
noted a strong correlation between
cumulative rainfall and the price level both increase over time but are unlikely
0.0705
0.0825
Figure 8.6
The Durbin–Watson test
statistic
Table 8.7 Calculation of the DW statistic
Year e
t
e
t −1
e
t
− e
t −1
e
t
− e
t −1
2
e
t
2
1973 0.0396 – – – 0.0016
1974 0.1703 0.0396 0.1308 0.0171 0.0290
1975 0.0401 0.1703 −0.1302 0.0170 0.0016
1976 0.0658 0.0401 0.0258 0.0007 0.0043
33 3 3 3 3
2000 0.0517 0.0236 0.0281 0.0008 0.0027
2001 0.0548 0.0517 0.0031 0.0000 0.0030
2002 0.0509 0.0548 −0.0039 0.0000 0.0026
2003 0.0215 0.0509 −0.0294 0.0009 0.0005
Totals – – – 0.0705 0.0825
3
The correlation between e
t
and e
t−1
is in fact 0.494.
4
D. F. Hendry Econometrics – alchemy or science Economica 1980 47 387–406.
STFE_C08.qxd 26/02/2009 09:13 Page 299

slide 317:

Chapter 8 • Multiple regression
300
Exercise 8.4
to be related. It has been suggested that a low value of the DW statistic typically
less than the R
2
value can be a symptom of such a problem. The fact that
economic theory supports the idea of a causal relationship between demand
prices and income should make us a little more conﬁdent that we have found a
valid economic relationship rather than a spurious one in this case.
This topic goes well beyond the scope of this book but it is raised because it
is important to be aware of the potential shortcomings of simple models. If you
estimate a time-series regression equation check the DW statistic to test for
autocorrelation. If present you may want to seek further advice rather than
accept the results as they are even if they appear to be good. The cause of the
autocorrelation is often although not always the omission of lagged variables
in the model i.e. a failure to recognise that it may take time for the effect of the
independent variables to work through to the dependent variable.
a Using the log model explaining car travel calculate the residuals from the
regression equation and draw a line graph of them. Do they appear to be random
or is some time-dependence apparent
b Calculate the Durbin–Watson statistic and interpret the result.
c If autocorrelation is present what are the implications for your estimates
Finding the right model
How do you know that you have found the ‘right’ model for the data Can you
be conﬁdent that another researcher using the same data would arrive at the
same results How can you be sure there isn’t a relevant explanatory variable out
there that you have omitted from your model Without trying them all it is
difﬁcult to be sure. Good modelling is based on theoretical considerations e.g.
models that are consistent with economic or business principles and statistical
ones e.g. signiﬁcant t ratios. One can identify two different approaches to
modelling.
● General to speciﬁc: this starts off with a comprehensive model including all
the likely explanatory variables then simpliﬁes it.
● Speciﬁc to general: this begins with a simple model that is easy to understand
then explanatory variables are added to improve the model’s explanatory power.
There is something to be said for both approaches but it is not guaranteed
that the two will end up with the same model. The former approach is usually
favoured nowadays it suffers less from the problem of omitted variable bias dis-
cussed below and the simplifying procedure is usually less ad hoc than that of
generalising a simple model. A very general model will almost certainly initially
contain a number of irrelevant explanatory variables. However this is not much
of a problem and less serious than omitted variable bias: standard errors on
the coefﬁcients tend to be higher than otherwise but this is remedied once the
irrelevant variables are excluded.
It is rare for either of these approaches to be adopted in its pure ideal form.
For example in the import demand equation we should have started out with
STFE_C08.qxd 26/02/2009 09:13 Page 300

slide 318:

Finding the right model
301
STATISTICS
IN
PR AC TIC E
··
several lags on the price variable since we cannot be sure how long imports take
to adjust to price changes. Therefore we might have started with assuming a
maximum lag of one year is ‘reasonable’
m
t
b
0
+ b
1
gdp
t
+ b
2
gdp
t −1
+ b
3
pm
t
+ b
4
pm
t −1
+ b
5
m
t −1
+ e
t
8.12
If b
4
proved to be insigniﬁcantly different from zero we would then re-estimate
the equation without pm
t−1
and obtain new coefﬁcient estimates. If the new b
2
proved insigniﬁcant we would omit gdp
t−1
and re-estimate. This process would
continue until all the remaining coefﬁcients had signiﬁcant t ratios. We would
then have the ﬁnal simpliﬁed model. At each stage we would omit the variable
with the least signiﬁcant coefﬁcient. Having found the right model we could
then test it on new data to see if it can explain the new observations.
Uncertainty regarding the correct model
The remarks about ﬁnding the right model apply to many of the other techniques
used in this book. For example we might employ the Poisson distribution to
model manufacturing faults in televisions but we are assuming this is the correct
distribution to use. In the example of railway accidents recounted in Chapter 4 it
was found that the Poisson distribution did not ﬁt the data precisely – the real
world betrayed less variation than predicted by the model.
Our estimates of parameters and the associated conﬁdence intervals are
based on the assumption that we are using the correct model. To our uncertainty
about the estimates we should ideally add the uncertainty about the correct
model but unfortunately this is difﬁcult to measure. It may be that if we used
a different model we would obtain a different conclusion. If possible therefore
it is a good idea to try out different models to see if the results are robust and
also to inform the reader about alternative methods that have been tried but
not reported.
In practice the procedure is not as mechanical nor as pure as this and more
judgement should be exercised. You may not want to exclude all the price vari-
ables from a demand equation even though the t ratios are small. A coefﬁcient
may be large in size even though it is not signiﬁcant. ‘Not signiﬁcant’ does not
mean the same as ‘insigniﬁcant’ rather that there is a lot of uncertainty about
its true value. In modelling imports we used the 2004 and 2005 observations to
test the model’s forecasts. When it failed we revised the model and applied the
forecast test again. But this is no longer a strictly independent test since we used
the 2004–2005 observations to decide upon revision to the model.
Brieﬂy to sum up a complex and contentious debate a good model should be:
● consistent with theory: an estimated demand curve should not slope upwards
for example
● statistically satisfactory: there should be good explanatory power e.g. R
2
F
statistics the coefﬁcients should be statistically signiﬁcant t ratios and the
errors should be random. It should also predict well using new data i.e. data
not used in the estimation procedure
● simple: although a very complicated model predicts better it might be dif-
ﬁcult for the reader to understand and interpret.
STFE_C08.qxd 26/02/2009 09:13 Page 301

slide 319:

Chapter 8 • Multiple regression
302
Sometimes these criteria conﬂict and then the researcher must use their judge-
ment and experience to decide between them.
Testing compound hypotheses
Simplifying a general model is largely based on hypothesis testing. Usually this
means a hypothesis of the form H
0
: β 0 using a t test. Sometimes however the
hypothesis is more complex as in the following examples.
● You want to test the equality of two coefﬁcients H
0
: β
1
β
2
.
● You want to test if a group of coefﬁcients are all zero H
0
: β
1
β
2
0.
A general method for testing these compound hypotheses is to use an F test.
We illustrate this by examining whether consumers suffer from money illusion
in the import demand equation. We assumed in line with economic theory that
only relative prices matter and used P
M
/P as an explanatory variable. But suppose
consumers actually respond differently to changes in P
M
and in P In that case
we should enter P
M
and P as separate explanatory variables and they would have
different coefﬁcients. In other words we should estimate using the log form
5
ln m
t
c
0
+ c
1
ln gdp
t
+ c
2
ln P
Mt
+ c
3
ln P
t
+ e
t
8.13
rather than
ln m
t
b
0
+ b
1
ln gdp
t
+ b
2
ln pm
t
+ e
t
8.14
where P
M
is the nominal price of imports and P is the nominal price level. We
would expect c
2
0 and c
3
0. Note that 8.14 is a restricted form of equation
8.13 with the restriction c
2
−c
3
imposed. A lack of money illusion implies
that this restriction should be valid and that equation 8.14 is the correct
form of model. The hypothesis to test is therefore H
0
: c
2
−c
3
or alternatively
H
0
: c
2
+ c
3
0.
How can we test this If the restriction is valid equations 8.13 and 8.14
should ﬁt equally well and thus have similar error sums of squares. Conversely
if they have very different ESS values then we would reject the validity of the
restriction. To carry out the test we therefore do the following:
● Estimate the unrestricted model 8.13 and obtain the unrestricted ESS from it
ESS
U
.
● Estimate the restricted model 8.14 and obtain the restricted ESS ESS
R
.
● Form the test statistic
F 8.15
where q is the number of restrictions 1 in this case and k is the number of
explanatory variables in the unrestricted model.
● Compare the test statistic with the critical value of the F distribution with q
and n − k − 1 degrees of freedom. If the test statistic exceeds the critical value
reject the restricted model in favour of the unrestricted one.
ESS
R
− ESS
U
/q
ESS
U
/n − k − 1
5
Note that it is much easier to test the restriction in log form since p
m
and p are entered
additively. It would be much harder to do this in levels form.
STFE_C08.qxd 26/02/2009 09:13 Page 302

slide 320:

Finding the right model
303
We have already estimated the restricted model equation 8.14 and from
that we obtain ESS
R
0.08246. Estimating the unrestricted model gives
ln m
t
−8.77 + 2.31 ln gdp
t
− 0.20 ln P
Mt −1
+ 0.02 ln P
t −1+e t
8.16
with ESS
U
0.0272. The test statistic is therefore
F 54.85 8.17
The critical value at the 95 conﬁdence level is 4.21 so the restriction is
rejected. Consumers do not use relative prices alone in making decisions but are
somehow inﬂuenced by the general rate of inﬂation as well. This is contrary to
what one would expect from economic theory. Interestingly the equation using
nominal prices does not suffer from autocorrelation so imposing the restriction
estimating with the real price of imports induces autocorrelation another
indication that the restriction is inappropriate.
To our earlier ﬁnding we might therefore add that consumers appear to take
account of nominal prices. We do not have space to investigate this issue in
more detail but further analysis of these nominal effects would be worthwhile.
There may be a theoretical reason for nominal prices to have an inﬂuence.
Alternatively there could be measurement problems with the data or inadequacies
in the model which mask the truth that it is after all relative prices that matter.
Whatever the results this method of hypothesis testing is quite general: it is
possible to test any number of linear restrictions by estimating the restricted
and unrestricted forms of the equation and comparing how well they ﬁt the
data. If the restricted model ﬁts almost as well as the unrestricted model it is
preferred on the grounds of simplicity. The F test is the criterion by which we
compare the ﬁt of the two models using error sums of squares.
Omitted variable bias
Omitting a relevant explanatory variable from a regression equation can lead to
serious problems. Not only is the model inadequate because there is no informa-
tion about the effect of the omitted variable but in addition the coefﬁcients
on the variables which are included are usually biased. This is called omitted
variable bias OVB.
We encountered an example of this in the model of import demand. Notice
how the coefﬁcient on income changed from 1.66 to 2.31 when nominal prices
were included. This is a substantial change and shows that the original equation
with only the real price of imports included may be misleading with respect to the
effect of income upon imports. The coefﬁcient on income was biased downwards.
The direction of OVB depends upon two things: the correlation between the
omitted and included explanatory variables and the sign of the coefﬁcient on
the omitted variable. Thus if you have to omit what you believe is a relevant
explanatory variable because the observations are unavailable for example you
might be able to infer the direction of bias on the included variables. Table 8.8
summarises the possibilities where the true model is Y b
0
+ b
1
X
1
+ b
2
X
2
+ e
but the estimated model omits the X
2
variable. Table 8.8 only applies to a single
omitted variable when there are several matters are more complicated see
Maddala Chapter 4.
0.08246 − 0.02720/1
0.02720/31 − 3 − 1
STFE_C08.qxd 26/02/2009 09:13 Page 303

slide 321:

Chapter 8 • Multiple regression
304
In addition to coefﬁcients being biased their standard errors are biased
upwards as well so that inferences and conﬁdence intervals will be incorrect.
The best advice therefore is to ensure you don’t omit a relevant variable
a Calculate the simple correlation coefﬁcients between price income the price of
rail travel and the price of bus travel.
b The prices of rail and bus travel may well inﬂuence the demand for car travel. If
so the models calculated in previous exercises are mis-speciﬁed. What are the
possible consequences of this How might the correlations calculated in part a
help
c Extend the regression equation to include these two extra prices. Estimate in
logs using 1980–1999. Does this change any of your conclusions
d One might expect the bus and rail price variables to have similar coefﬁcients as
they are both substitutes for car travel. Test the hypothesis H
0
: β
rail
− β
bus
0 by
comparing error sums of squares from restricted and unrestricted regressions.
Dummy variables and trends
These are types of artiﬁcial variable which can be very useful in regression.
A dummy variable is one that takes on a restricted range of values usually just 0
and 1. Despite this simplicity it can be useful in a number of situations. For
example suppose we suspect that the UK’s import demand function shifted
after the rise in the oil price in 1979. Ideally we might include oil prices in our
model but suppose these data are unavailable. How could we then explore this
possibility empirically
One answer is to construct a variable D
t
which takes the value 0 for the years
1973–1979 and 1 thereafter i.e. 0 0 ... 0 1 1... 1 the switch occurring
after 1979. We then estimate
ln m
t
b
0
+ b
1
ln gdp
t
+ b
2
ln pm
t
+ b
3
D
t
+ e
t
8.18
The coefﬁcient b
3
gives the size of the shift in 1979. The constant for the
equation is equal to b
0
for 1973–1979 when D
t
0 and equal to b
0
+ b
3
there-
after when D
t
1. The sign of b
3
shows the direction of any shift and one can
Table 8.8 The effects of omitted variable bias
Sign of omitted Correlation Direction Example values of b
1
coefﬁcient b
2
between X
1
and X
2
of bias of b
1
True Estimated
0 0 Upwards 0.5 0.9
−0.5 −0.1
0 0 Downwards 0.5 0.1
−0.5 −0.9
0 0 Downwards 0.5 0.1
−0.5 −0.9
0 0 Upwards 0.5 0.9
−0.5 −0.1
Exercise 8.5
STFE_C08.qxd 26/02/2009 09:13 Page 304

slide 322:

Finding the right model
305
Figure 8.7
The dummy variable
effect
STATISTICS
IN
PR AC TIC E
··
also test its signiﬁcance via the t ratio. If it turns out not to be signiﬁcant then
there was probably no shift in the relationship.
Note that we do not use the log of D – this would be impossible as ln 0 is not
deﬁned. In any case a dummy variable only needs to have two different values
it does not matter what they are although 0 1 is convenient for interpretation.
Note also that b
3
will give the change in ln m which is approximately the
percentage change in m.
Estimating equation 8.18 yields the following result
ln m
t
−4.98 + 1.85 ln gdp
t
− 0.35 ln pm
t
− 0.11 D
t
+ e
t
8.19
s.e. 0.12 0.12 0.02
R
2
0.99 F3 27 1029.1 n 31
We note that the dummy variable has a signiﬁcant coefﬁcient and that after
1979 imports were 11 lower than before after taking account of any price and
income effects. We presume it is the oil shock that has caused this but in fact it
could be due to anything that changed in 1979. Figure 8.7 shows the effect of
introducing such a dummy variable and from the ﬁgure we can see that the
effect of the dummy variable is to shift the regression line downwards for the
years from 1979 onwards.
Trap
There were in fact two oil shocks – in 1973 and 1979. With a longer series of data
you might therefore be tempted to use a dummy variable 0 0 . . . 0 1 . . . 1 2
. . . 2 with the ﬁrst switch in 1973 the second in 1979 this assumes you have
some pre-1973 observations. This is wrong It implicitly assumes that the two
shocks had the same effect upon the dependent variable. The correct technique
is to use two dummies both using only zeros and ones. The ﬁrst dummy would
switch from 0 to 1 in 1973 the second would switch in 1979. Their individual
coefﬁcients would then measure the size of each shock.
A time trend is another useful type of dummy variable used with time-series
data. It takes the values 1 2 3 4... T where there are T observations. It is
used as a proxy for a variable which we cannot measure and which we believe
increases in a linear fashion. For example suppose we are trying to model petrol
consumption of cars. Price and income would obviously be relevant explanatory
variables but in addition technical progress has made cars more fuel-efﬁcient
over time. It is impossible to measure this accurately so we use a time trend as
STFE_C08.qxd 26/02/2009 09:13 Page 305

slide 323:

Chapter 8 • Multiple regression
306
an additional regressor. In this case it should have a negative coefﬁcient which
would measure the annual reduction in consumption due to more fuel-efﬁcient
cars. Remember also that if the dependent variable is in logs the coefﬁcient
on the time trend shows the percentage change per annum or per time period
for example a coefﬁcient of −0.05 would indicate a 5 per annum fall in the
dependent variable independent of movements in other explanatory variables.
a The graph of car travel suggests a break in 1990 – the rise is slower after this
point than before. Test whether this break is signiﬁcant or not using a dummy
variable with a value of 0 up to and including 1990 1 thereafter. Estimate in
logs using all three prices and income 1980–1999.
b The quality of cars has improved steadily over time perhaps leading to increased
travel by car. Add a time trend to the regression equation in part a and re-
estimate. Is there evidence to support this idea
Multicollinearity
Sometimes some or all of the explanatory variables are highly correlated in the
sample data which means that it is difﬁcult to tell which of them is inﬂuencing
the dependent variable. This is known as multicollinearity. Since all variables are
correlated to some degree multicollinearity is a problem of degree also. For ex-
ample if GDP and import prices both rise over time it may be difﬁcult to tell which
of them inﬂuences imports. There has to be some independent movement of the
explanatory variables for us to be able to disentangle their separate inﬂuences.
The symptoms of multicollinearity are:
● high correlation between two or more of the explanatory variables
● high standard errors of the coefﬁcients leading to low t ratios
● a high value of R
2
and signiﬁcant F statistic in spite of the insigniﬁcance of
the individual coefﬁcients.
In this situation one might make the mistake of concluding that a variable is
insigniﬁcant because of a large standard error when in fact multicollinearity is
to blame. It may be useful therefore to examine the correlations between all the
explanatory variables to see if such a problem is apparent. For example the
correlation between nominal import prices and the retail price index is 0.97.
Hence it may be difﬁcult to disentangle their individual effects.
The best cure is to obtain more data which might exhibit more independent
variation of the explanatory variables. This is not always possible however for
example if a sample survey has already been completed. An alternative is to drop
one of the correlated variables from the regression equation though the choice
of which to exclude is somewhat arbitrary. Another procedure is to obtain
alternative estimates of the effects of one of the collinear variables e.g. from
another study. These effects can then be allowed for when estimates of the
remaining coefﬁcients are made.
Measurement error
It is not always possible to measure the variables in a regression equation precisely
so the problem of measurement error arises. Either or both the endogenous or
Exercise 8.6
STFE_C08.qxd 26/02/2009 09:13 Page 306

slide 324:

Summary
307
Exercise 8.7
exogenous variables could be affected. This is more of a problem for estimation
when the measurement error is systematic rather than random in which case it
may disappear into the error term and can result in biased estimates. If trans-
port costs are left out of the measured price of imported goods and these costs
have declined over time then there is systematic measurement error in the price
variable and possible bias in the coefﬁcient.
We noted in Exercise 8.5 that rail and bus prices were highly correlated. This may be
why they both appear to be ‘insigniﬁcant’ in the regression equation. It could be the
case that either of them could be inﬂuencing demand but we cannot tell which. We
can examine this by testing the hypothesis H
0
: β
rail
β
bus
0. The restricted regres-
sion therefore excludes these two variables the unrestricted regression includes
them. One can then use equation 8.15 with q 2 restrictions to test the hypothesis.
What is the result Do not include dummy or trend in the equation.
Some ﬁnal advice on regression
● As always large samples are better than small. Reasonable results were obtained
above with only 31 observations but this is rather a small sample size on which
to base solid conclusions.
● Check the data carefully before calculation. This is especially true if a com-
puter is used to analyse the data. If the data are typed in incorrectly every
subsequent result will be wrong. A substantial part of any research project
should be devoted to verifying the data checking the deﬁnitions of variables
etc. The work is tedious but important.
● Don’t go ﬁshing. Otherwise known as data-mining this is searching through
the data hoping something will turn up. Some idea of what the data are
expected to reveal and why allows the search to be conducted more effect-
ively. It is easy to see imaginary patterns in data if an aimless search is being
conducted. Try looking at the table of random numbers Table A1 see
page 412 which will probably soon reveal something ‘signiﬁcant’ like your
telephone number or your credit card number.
● Don’t be afraid to start with fairly simple techniques. Draw a graph of
demand against price to see what it looks like if it looks linear or log linear
if there are any outliers a data error etc. This will give an overview of the
problem which can be kept in mind when more reﬁned techniques are used.
Summary
● Multiple regression extends the principles of simple regression to models
using several explanatory variables to explain variation in Y.
● The multiple regression equation is derived by minimising the sum of squared
residuals as in simple regression. This principle leads to the formulae for
slope coefﬁcients standard errors etc.
● The signiﬁcance of the individual slope coefﬁcients can be tested using the
t distribution and the overall signiﬁcance of the model is based on the F
distribution.
STFE_C08.qxd 26/02/2009 09:13 Page 307

slide 325:

Chapter 8 • Multiple regression
308
● It is important to check the adequacy of the model. This can be done in various
ways including examining the accuracy of predictions and checking that the
residuals appear random.
● One important form of non-randomness is termed autocorrelation where the
error in one period is correlated with earlier errors this can occur in time-
series data. This can lead to incorrect inferences being drawn.
● The Durbin–Watson statistic is one diagnostic test for autocorrelation. If there
is a problem of autocorrelation it can often be eliminated by including lagged
regressors.
● A good model should be i consistent with economic or some other theory
ii statistically satisfactory and iii simple. Sometimes there is a trade-off
between these different criteria.
● Complex hypothesis tests can often be performed by comparing restricted
and unrestricted forms of the model. If the former ﬁts the data almost as
well as the latter then the simplifying restrictions speciﬁed in the null
hypothesis are accepted.
● Omitting relevant explanatory variables from the model is likely to cause bias
to the estimated coefﬁcients. This suggests it is often best to start off with a
fairly general model and simplify it.
● Regression analysis can become very complicated well beyond the scope
of this book involving issues such as multicollinearity and simultaneous
equations. However the methods given in this chapter can provide helpful
insights into a range of problems especially if the potential shortcomings of
the model are appreciated.
G. S. Maddala Introduction to Econometrics 2001 3rd edn. Wiley.
autocorrelation
dummy variables
measurement error
multicollinearity
omitted variable bias
simultaneity
spurious regression
regression coefﬁcients
Key terms and concepts
Reference
STFE_C08.qxd 26/02/2009 09:13 Page 308

slide 326:

309
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
8.1 a Using the data in Problem 7.1 page 273 estimate a multiple regression model
of the birth rate explained by GNP the growth rate and the income ratio. Comment
upon:
i the sizes and signs of the coefﬁcients
ii the signiﬁcance of the coefﬁcients
iii the overall signiﬁcance of the regression.
b How would you simplify the model
c Test for the joint signiﬁcance of the coefﬁcients on growth and the income ratio.
d Repeat the above steps for all 26 observations. Comment.
e Do you feel your understanding of the birth rate is improved after estimating the
multiple regression equation
f What other possible explanatory variables do you think it might be worth investigating
8.2 The following data show the real price of butter and real incomes to supplement the data
in Problem 7.2 see page 274.
Year Price of butter Real income Year Price of butter Real income
1970 105.5 70.3 1980 119.2 92.1
1971 130.9 71.1 1981 114.2 91.4
1972 131.9 77.1 1982 114.5 90.9
1973 99.5 82.1 1983 110.0 93.3
1974 89.6 81.5 1984 107.9 96.8
1975 92.1 81.9 1985 100.0 100.0
1976 109.1 81.7 1986 104.2 104.5
1977 118.2 79.9 1987 99.8 108.1
1978 123.4 85.8 1988 100.2 114.6
1979 130.6 90.7
a Estimate a multiple regression model of the demand for margarine. Do the coefﬁcients
have the expected signs
b Test the signiﬁcance of the individual coefﬁcients and of the regression as a whole.
c Should the model be simpliﬁed
d Calculate the elasticity of demand. How does it differ from your earlier answer
e Estimate the cross-price demand elasticity.
f Should other variables be added to improve the model in your view
8.3 Using the results from Problem 8.1 forecast the birth rate of a country with the following
characteristics: GNP equal to 3000 a growth rate of 3 p.a. and an income ratio of 7.
Construct the point estimate only.
Problems
Problems
STFE_C08.qxd 26/02/2009 09:13 Page 309

slide 327:

Chapter 8 • Multiple regression
310
8.4 Given the following data for 1989 and 1990:
Year Price of margarine Price of butter Real income
1989 79.3 104.3 120.2
1990 79.3 97.0 122.7
a Predict the levels of margarine consumption in the two years.
b The actual values of consumption for the two years were 3.47 and 3.19. How accurate
are your forecasts
c Test for the stability of the coefﬁcients between sample and forecast periods.
8.5 How would you most appropriately measure the following variables:
a social class in a model of alcohol consumption
b crime
c central bank independence from political interference.
8.6 As Problem 8.5 for:
a the output of a car ﬁrm in a production function equation
b potential trade union inﬂuence in wage bargaining
c the performance of a school
8.7 Would it be better to use time-series or cross-section data in the following models
a the relationship between the exchange rate and the money supply
b the determinants of divorce
c the determinants of hospital costs.
Explain your reasoning.
8.8 As Problem 8.7 for:
a measurement of economies of scale in the production of books
b the determinants of cinema attendances
c the determinants of the consumption of perfume.
8.9 How would you estimate a model explaining the following variables
a airline efﬁciency
b infant mortality
c bank proﬁts.
You should consider such issues as whether to use time-series or cross-section data
the explanatory variables to use and any measurement problems any relevant data
transformations and the expected results.
8.10 As Problem 8.9 for:
a investment
b the pattern of UK exports i.e. which countries they go to
c attendance at football matches.
STFE_C08.qxd 26/02/2009 09:13 Page 310

slide 328:

311
8.11 R. Dornbusch and S. Fischer in R. E. Caves and L. B. Krause Britain’s Economic
Performance Brookings 1980 report the following equation for predicting the UK
balance of payments
B 0.29 + 0.24U + 0.17 ln Y − 0.004t − 0.10 ln P − 0.24 ln C
t 0.56 5.9 2.5 3.8 3.2 3.9
R
2
0.76 s
e
0.01 n 36 quarterly data 1970:1–1978:1
where
B: the current account of the balance of payments as a percentage of gross domestic
product a balance of payments deﬁcit of 3 of GDP would be recorded as −3.0 for
example
U: the rate of unemployment
Y: the OECD index of industrial production
t: a time trend
P: the price of materials relative to the GDP deﬂator price index
C: an index of UK competitiveness a lower value of the index implies greater
competitiveness
ln indicates the natural logarithm of a variable
a Explain why each variable is included in the regression. Do they all have the expected
sign for the coefﬁcient
b Which of the following lead to a higher balance of payments BOP deﬁcit relative to
GDP: i higher unemployment ii higher OECD industrial production iii higher
material prices iv greater competitiveness
c What is the implied shape of the relationship between B and i U ii Y
d Why cannot a double log equation be estimated for this data What implications
does this have for obtaining elasticity estimates Why are elasticity estimates not very
useful in this context
e Given the following values of the explanatory variables estimate the state of the
current account point estimate: unemployment rate 10 OECD index 110 time
trend 37 materials price index 100 competitiveness index 90.
8.12 In a cross-section study of the determinants of economic growth National Bureau of
Economic Research Macroeconomic Annual 1991 Stanley Fischer obtained the following
regression equation
GY 1.38 − 0.52RGDP70 + 2.51PRIM70 + 11.16INV − 4.75INF + 0.17SUR
−5.9 2.69 3.91 2.7 4.34
−0.33DEBT80 − 2.02SSA − 1.98LAC
−0.79 −3.71 −3.76
R
2
0.60 n 73
where
GY: growth per capita 1970–1985
RGDP: real GDP per capita 1970
PRIM70: primary school enrolment rate 1970
INV: investment/GNP ratio
INF: inﬂation rate
SUR: budget surplus/GNP ratio
DEBT80: foreign debt/GNP ratio
Problems
STFE_C08.qxd 26/02/2009 09:13 Page 311

slide 329:

Chapter 8 • Multiple regression
312
SSA: dummy for sub-Saharan Africa
LAC: dummy for Latin America and the Caribbean
a Explain why each variable is included. Does each have the expected sign on its
coefﬁcient Are there any variables which are left out in your view
b If a country were to increase its investment ratio by 0.05 by how much would its
estimated growth rate increase
c Interpret the coefﬁcient on the inﬂation variable.
d Calculate the F statistic for the overall signiﬁcance of the regression equation. Is it
signiﬁcant
e What do the SSA and LAC dummy variables tell us
8.13 Project Build a suitable model to predict car sales in the UK. You should use time-series
data at least 20 annual observations. You should write a report in a similar manner to
Problem 7.12 see page 275.
STFE_C08.qxd 26/02/2009 09:13 Page 312

slide 330:

Answers to exercises
313
Answers to exercises
Exercise 8.1
a Demand rises rapidly until around 1990 then rises more slowly
Price falls quite quickly until 1990 then rises. This may relate to the pattern of
travel demand above
b The cross-plot of travel vertical axis against price is not clear-cut. There may be
a slight negative relationship
Again there is not an obvious bivariate relationship between travel and income
STFE_C08.qxd 26/02/2009 09:13 Page 313

slide 331:

Chapter 8 • Multiple regression
314
c Economic theory would suggest a negative price coefﬁcient and a positive income
coefﬁcient.
d If bus and rail are substitutes for car travel one would expect positive coefﬁcients
on their prices. However they might be complements – commuters may drive to
the station to catch the train.
Exercise 8.2
a The regression is:
Source | SS df MS Number of obs 20
----------+------------------------------ F 2 17 483.10
Model | 138 136.463 2 69 068.2316 Prob F 0.0000
Residual | 2430.48675 17 142.969809 R-squared 0.9827
----------+------------------------------ Adj R-squared 0.9807
Total | 140 566.95 19 7 398.26053 Root MSE 11.957
-----------------------------------------------------------------------
car | Coef. Std. Err. t P|t| 95 Conf. Interval
----------+------------------------------------------------------------
rpcar | -6.390429 0.7639393 -8.37 0.000 -8.0022 -4.778658
rpdi | 6.048783 0.2340236 25.85 0.000 5.555037 6.54253
_cons | 748.1112 83.857 8.92 0.000 571.1884 925.034
-------------------------------------------------------------------
b The signs of the coefﬁcients are as expected. A unit increase in price lowers
demand by 6.4 units a unit rise in income raises demand by about 6 units.
Without knowledge of the units of measurement it is hard to give a more precise
interpretation. Both coefﬁcients are highly signiﬁcant as is the F statistic. 98 of
the variation of car travel demand is explained by these two variables a high ﬁgure.
Exercise 8.3
a The forecast values are 661.9 and 706.3 in 2000 and 2001. These compare with
actual values of 618 and 624 so the errors are −6.6 and −11.7. Assuming 2000
and 2001 would be the same as 1999 would actually give better results.
b In logs the results are:
Source | SS df MS Number of obs 20
----------+------------------------------ F 2 17 599.39
Model | 0.557417045 2 0.278708523 Prob F 0.0000
Residual | 0.007904751 17 0.000464985 R-squared 0.9860
----------+------------------------------ Adj R-squared 0.9844
Total | 0.565321796 19 0.029753779 Root MSE 0.02156
-----------------------------------------------------------------------
lcar | Coef. Std. Err. t P|t| 95 Conf. Interval
----------+------------------------------------------------------------
lrpcar | −1.192195 0.1410668 −8.45 0.000 −1.48982 −0.8945699
lrpdi | 0.8408944 0.0296594 28.35 0.000 0.7783184 0.9034704
_cons | 8.193106 0.7057587 11.61 0.000 6.704085 9.682127
-----------------------------------------------------------------------
These results were produced using Stata. The layout is similar to that of Excel. Probvalues
are indicated by ‘Prob F’ and ‘P |t|’. ‘rpcar’ indicates the real price of car travel ‘rpdi’
indicates real personal disposable income. Later on an ‘l’ in front of a variable name indi-
cates it is in log form.
STFE_C08.qxd 26/02/2009 09:13 Page 314

slide 332:

Answers to exercises
315
Demand is elastic with respect to price e −1.19 and slightly less than elastic for
income e 0.84. The coefﬁcients are again highly signiﬁcant.
c Price and income elasticities from the linear model are −6.4 × 101.1/526.5 −1.23
and 6.0 × 70.2/526.5 0.8. These are very similar to the log coefﬁcients.
d The forecasts in logs are 6.492 and 6.561 which translate into 659.8 and 706.8.
The predictions and errors are similar to the linear model.
e For the linear model the Chow test is
F 18.3
The critical value is F2 17 3.59 so there appears to be a change between esti-
mation and forecast periods. A similar calculation for the log model yields an F
statistic of 13.9 ESS
P
0.0208 also signiﬁcant.
Exercise 8.4
a The residuals from the log regression are as follows:
There is some evidence of positive autocorrelation and in particular the last two
residuals from the forecast period are substantially larger than the rest
7672.6 − 2430.5/2
2430.5/20 − 2 − 1
ESS
p
− ESS
1
/n
2
ESS
1
/n
1
− k − 1
b The Durbin–Watson statistic is DW 1.52 against an upper critical value of
d
U
1.54. The test statistic just falls into the uncertainty region but the
evidence for autocorrelation is very mild.
c Autocorrelation would imply biased standard errors so inference would be
dubious but the coefﬁcients themselves are still unbiased.
Exercise 8.5
a The correlations are:
| rpcar rpdi rprail rpbus
--------+------------------------------
rpcar | 1.0000
rpdi | −0.3112 1.0000
rprail | −0.1468 0.9593 1.0000
rpbus | −0.1421 0.9632 0.9827 1.0000
The price of car travel has a low correlation with the other variables which are
all highly correlated with each other r 0.95.
b There may be omitted variable bias. Since the omitted variables are correlated
with income the income coefﬁcient we have observed may be misleading. The
STFE_C08.qxd 26/02/2009 09:13 Page 315

slide 333:

Chapter 8 • Multiple regression
316
car price variable is unlikely to be affected much as it has a low correlation with
the omitted variables.
c The results are:
Source | SS df MS Number of obs 20
----------+------------------------------- F 4 15 285.36
Model | 0.557989155 4 0.139497289 Prob F 0.0000
Residual | 0.007332641 15 0.000488843 R-squared 0.9870
----------+------------------------------- Adj R-squared 0.9836
Total | 0.565321796 19 0.029753779 Root MSE 0.02211
-----------------------------------------------------------------------
lcar | Coef. Std. Err. t P|t| 95 Conf. Interval
----------+------------------------------------------------------------
lrpcar | −1.195793 0.1918915 −6.23 0.000 −1.6048 −0.786786
lrpdi | 0.8379483 0.1372577 6.10 0.000 0.5453904 1.130506
lrprail | 0.3104458 0.3019337 1.03 0.320 −0.3331106 0.9540023
lrpbus | −0.3085937 0.3166891 −0.97 0.345 −0.9836004 0.3664131
_cons | 8.22269 0.7318088 11.24 0.000 6.662877 9.782503
-----------------------------------------------------------------------
The new price variables are not signiﬁcant so there is unlikely to have been a
serious OVB problem. Neither car price nor income coefﬁcients have changed.
The simpler model seems to be preferred.
d The restricted equation is y β
1
+ β
2
P
CAR
+ β
3
RPDI + β
4
P
RAIL
+ P
BUS
+ u in logs and
estimating this yields ESS
R
0.007901. The test statistic is therefore
F 1.16
This is not signiﬁcant so the hypothesis of equal coefﬁcients is accepted.
Exercise 8.6
a The result is:
Source | SS df MS Number of obs 20
----------+------------------------------- F 5 14 232.28
Model | 0.558588344 5 0.111717669 Prob F 0.0000
Residual | 0.006733452 14 0.000480961 R-squared 0.9881
----------+------------------------------- Adj R-squared 0.9838
Total | 0.565321796 19 0.029753779 Root MSE 0.02193
-----------------------------------------------------------------------
lcar | Coef. Std. Err. t P|t| 95 Conf. Interval
-----------------------------------------------------------------------
lrpcar | −1.107049 0.2062769 −5.37 0.000 −1.549469 −0.6646293
lrpdi | 0.8898566 0.1438706 6.19 0.000 0.581285 1.198428
lrprail | 0.5466294 0.3667016 1.49 0.158 −0.2398673 1.333126
lrpbus | −0.4867887 0.3523676 −1.38 0.189 −1.242542 0.2689648
d1990 | −0.0314327 0.0281614 −1.12 0.283 −0.091833 0.0289676
_cons | 7.352081 1.065511 6.90 0.000 5.066787 9.637375
-----------------------------------------------------------------------
The new coefﬁcient −0.03 suggests car travel is 3 lower after 1990 than before
ceteris paribus. However the coefﬁcient is not signiﬁcantly different from zero so
there is little evidence of structural break. The change in car usage appears due to
changes in prices and income.
0.007901 − 0.007333/1
0.007333/20 − 4 − 1
STFE_C08.qxd 26/02/2009 09:13 Page 316

slide 334:

Answers to exercises
317
b The result is:
Source | SS df MS Number of obs 20
----------+------------------------------- F 6 13 191.34
Model | 0.558991816 6 0.093165303 Prob F 0.0000
Residual | 0.00632998 13 0.000486922 R-squared 0.9888
----------+------------------------------- Adj R-squared 0.9836
Total | 0.565321796 19 0.029753779 Root MSE 0.02207
-----------------------------------------------------------------------
lcar | Coef. Std. Err. t P|t| 95 Conf. Interval
----------+------------------------------------------------------------
lrpcar | −1.116536 0.2078126 −5.37 0.000 −1.565488 −0.6675841
lrpdi | 1.107112 0.2791366 3.97 0.002 0.5040736 1.71015
lrprail | 0.558322 0.3691905 1.51 0.154 −0.2392655 1.355909
lrpbus | −0.2707759 0.4266312 −0.63 0.537 −1.192457 0.6509048
d1990 | −0.036812 0.0289451 −1.27 0.226 −0.099344 0.02572
trend | −0.0099434 0.0109234 −0.91 0.379 −0.033542 0.0136552
_cons | 5.553859 2.247619 2.47 0.028 0.6981737 10.40954
-----------------------------------------------------------------------
The trend is not signiﬁcant. Note that the income coefﬁcient has changed
substantially. This is due to the high correlation between income and the trend
r 0.99. It seems preferable to keep income and exclude the trend.
Exercise 8.7
The F statistic is
F 0.59
This is less than the critical value of F2 15 3.68 so the hypothesis that both
coefﬁcients are zero is accepted.
0.007905 − 0.007333/2
0.007333/20 − 4 − 1
STFE_C08.qxd 26/02/2009 09:13 Page 317

slide 335:

Data collection and sampling
methods 9
Contents
Learning outcomes 318
Introduction 319
Using secondary data sources 319
Make sure you collect the right data 320
Try to get the most up-to-date ﬁgures 320
Keep a record of your data sources 321
Check your data 321
Using electronic sources of data 321
Collecting primary data 323
The meaning of random sampling 324
Types of random sample 326
Simple random sampling 326
Stratiﬁed sampling 327
Cluster sampling 330
Multistage sampling 331
Quota sampling 332
Calculating the required sample size 333
Collecting the sample 335
The sampling frame 335
Choosing from the sampling frame 336
Interviewing techniques 336
Case study: the UK Expenditure and Food Survey 338
Introduction 338
Choosing the sample 338
The sampling frame 339
Collection of information 339
Sampling errors 339
Summary 339
Key terms and concepts 340
References 340
Problems 341
By the end of this chapter you should be able to:
● recognise the distinction between primary and secondary data sources
● avoid a variety of common pitfalls when using secondary data
● make use of electronic sources to gather data
● recognise the main types of random sample and understand their relative merits
● appreciate how such data are collected
● conduct a small sample survey yourself.
Learning
outcomes
318
Complete your diagnostic test for Chapter 9 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C09.qxd 26/02/2009 09:14 Page 318

slide 336:

Using secondary data sources
319
Introduction
It may seem a little odd to look at data collection now after several chapters
covering the analysis of data. Collection of data logically comes ﬁrst but the
fact is that most people’s experience is as a user of data which determines their
priorities. Also it is difﬁcult to have the motivation for learning about data
collection when one does not know what it is subsequently used for. Having
spent considerable time learning how to analyse data it is now time to look at
their collection and preparation.
There are two reasons why you might ﬁnd this chapter useful. First it will
help if you have to carry out some kind of survey yourself. Second it will help
you in your data analysis even if you are using someone else’s data. Knowing
the issues involved in data collection can help your judgement of the quality of
the data you are using.
When conducting statistical research there are two ways of obtaining data:
1 use secondary data sources such as the UN Yearbook or
2 collect sample data personally a primary data source.
The ﬁrst category should nowadays be divided into two subsections: printed
and electronic sources. The latter is obviously becoming more important as time
progresses but printed documentation still has its uses. Using secondary data
sources sounds simple but it is easy to waste valuable time by making element-
ary errors. The ﬁrst part of this chapter provides some simple advice to help you
avoid such mistakes.
Much of this text has been concerned with the analysis of sample evidence
and the inferences that can be drawn from it. It has been stressed that this
evidence must come from randomly drawn samples and although the notion of
randomness was discussed in Chapter 2 the precise details of random sampling
have not been set out.
The second part of this chapter is therefore concerned with the problems of
collecting sample survey data prior to their analysis. The decision to collect the
data personally depends upon the type of problem faced the current availab-
ility of data relating to the problem and the time and cost needed to conduct a
survey. It should not be forgotten that the ﬁrst question that needs answering
is whether the answer obtained is worth the cost of ﬁnding it. It is probably not
worthwhile for the government to spend £50 000 to ﬁnd out how many biscuits
people eat on average although it may be worth biscuit manufacturers doing
this. The sampling procedure is always subject to some limit on cost therefore
and the researcher is trying to obtain the best value for money.
Using secondary data sources
Much of the research in economics and ﬁnance is based on secondary data
sources i.e. data which the researcher did not collect herself. The data may be
in the form of ofﬁcial statistics such as those published in Economic Trends or
they may come from unofﬁcial surveys. In either case one has to use the data as
presented there is no control over sampling procedures.
STFE_C09.qxd 26/02/2009 09:14 Page 319

slide 337:

Chapter 9 • Data collection and sampling methods
320
It may seem easy enough to look up some ﬁgures in a publication but there are
a number of pitfalls for the unwary. The following advice comes from experience
some of it painful and it may help you to avoid wasting time and effort. I have
also learned much from the experiences of my students whom I have also
watched suffer.
A lot of data are now available online so the advice given here covers both
printed and electronic sources with a separate section for the latter.
Make sure you collect the right data
This may seem obvious but most variables can be measured in a variety of dif-
ferent ways. Suppose you want to measure the cost of labour over time to ﬁrms.
Should you use the wage rate or earnings The latter includes payment for extra
hours such as overtime payments and reﬂects general changes in the length of
the working week. Is the wage measured per hour or per week Does it include
part-time workers If so a trend in the proportion of part-timers will bias the
wage series. Does the series cover all workers men only or women only Again
changes in the composition will inﬂuence the wage series. What about tax and
social security costs Are they included There are many questions one could ask.
One needs to have a clear idea therefore of the precise variable one needs to
collect. This will presumably depend upon the issue in question. Economic theory
might provide some guidance: for instance theory suggests that ﬁrms care about
real wage rates i.e. after taking account of inﬂation so related to the price of the
goods the ﬁrm sells so this is what one should measure. Check the deﬁnition of
any series you collect this is often at the back of the printed publication or in
a separate supplement giving explanatory notes and deﬁnitions. Make sure that
the deﬁnition has not changed over the time period you require: the deﬁnition
of unemployment used in the UK changed about 20 times in the 1980s generally
with the effect of reducing measured unemployment even if actual unemployment
was unaffected. In the UK the geographical coverage of data may vary: one series
may relate to the UK another to Great Britain and yet another to England and
Wales. Care should obviously be taken if one is trying to compare such series.
Try to get the most up-to-date ﬁgures
Many macroeconomic series are revised as more information becomes available.
The balance of payments serves as a good example. The ﬁrst edition of this book
showed the balance of payments current balance in £m for the UK for 1970
as published in successive years as follows:
1971 1972 1973 1974 1975 1976 1977 1978 . . . 1986
579 681 692 707 735 733 695 731 . . . 795
The difference between the largest and smallest ﬁgures is of the order of 37
a wide range. In the third edition of this book the ﬁgure was from the 1999
edition of Economic Trends Annual Supplement £911m which is 57 higher than
the initial estimate. The latest ﬁgure at the time of writing is £819m. Most series
are better than this. The balance of payments is hard to measure because it is the
small difference between two large numbers exports and imports. A 5 increase
STFE_C09.qxd 26/02/2009 09:14 Page 320

slide 338:

Using electronic sources of data
321
STATISTICS
IN
PR AC TIC E
··
in measured exports and a 5 decrease in measured imports could thus change
the measured balance by 100 or more.
One should always try to get the most up-to-date ﬁgures therefore which
often means working backwards through printed data publications i.e. use the
current issue ﬁrst and get data back as far as is available then ﬁnd the previous
issue to go back a little further etc. This can be tedious but it will also give some
idea of the reliability of the data from the size of data revisions.
Keep a record of your data sources
You should always keep precise details of where you obtained each item of data.
If you need to go back to the original publication e.g. to check on the deﬁnition
of a series you will then be able to ﬁnd it easily. It is easy to spend hours if not
days trying to ﬁnd the source of some interesting numbers that you wish to
update. ‘Precise details’ means the name of the publication issue number or date
and table or page number. It also helps to keep the library reference number of
the publication if it is obscure. It is best to take a photocopy of the data but
check copyright restrictions rather than just copy it down if possible.
Keeping data in Excel or another spreadsheet
Spreadsheets are ideal for keeping your data. It is often a good idea to keep the data
all together in one worksheet and extract portions of them as necessary and analyse
them in another worksheet. Alternatively it is usually quite easy to transfer data
from the spreadsheet to another program e.g. SPSS or Stata for more sophisticated
analysis. In most spreadsheets you can attach a comment to any cell so you can
use this to keep a record of the source of each observation changes of deﬁnition
etc. Thus you can retain all the information about your data together in one place.
Check your data
Once you have collected your data you must check them. Once you have done
this you must check them again. Better persuade someone else to help with the
second check. Note that if your data are wrong then all your subsequent calcula-
tions could be incorrect and you will have wasted much time. I have known
many students who have spent months or even years on a dissertation or thesis
who have then found an error in the data they collected earlier.
A useful way to check the data is ﬁrst to graph them e.g. a time-series plot.
Obvious outliers will show up and you can investigate them for possible errors.
Do not just rely on the graphs however look through your data and check
them against the original source. Do not forget that the original source could be
wrong too so be wary of ‘unusual’ observations.
Using electronic sources of data
A vast amount of data are now available electronically usually online and this
is becoming increasingly the norm. Sometimes the data are available free but
sometimes they have to be paid for especially if they have a commercial value.
STFE_C09.qxd 26/02/2009 09:14 Page 321

slide 339:

Chapter 9 • Data collection and sampling methods
322
1
I wrote this for the previous edition of this book. I can no longer ﬁnd the same data on
Statbase it seems to have disappeared into the ether
STATISTICS
IN
PR AC TIC E
··
My experience suggests that many students nowadays only consider online
resources which I feel is a mistake. Not everything is online and sometimes
even if it is it is extremely hard to ﬁnd. It can sometimes take less time to go
to the library ﬁnd the appropriate journal and type in the numbers. As an
estimate 100 observations should take no longer than about 10 minutes to type
into a computer which is probably quicker than ﬁnding them electronically
converting to the right format etc. Hence the advantage of online data lies
principally with large datasets.
Obtaining data electronically should avoid input errors and provide con-
sistent up-to-date ﬁgures. However this is not always guaranteed. For example
the UK Ofﬁce for National Statistics ONS online databank provides plenty
of information but some of the series clearly have breaks in them and there is
little warning of this in the on-screen documentation. The series for revenue per
admission to cinemas roughly the price of admission goes:
1963 1964 1965 1966 1967
37.00 40.30 45.30 20.60 21.80
which strongly suggests an artiﬁcial break in the series in 1966 especially as
admissions fell by 12 between 1965 and 1966. Later in the series the observa-
tions appear to be divided by 100. The lesson is that even with electronic data
you should check the numbers to ensure they are correct.
1
You need to follow the same advice with electronic sources as with printed
ones: make sure you collect the right variables and keep a note of your source.
Online sources do not seem to be as good as many printed sources when it
comes to providing deﬁnitions of the variables. It is often unclear if the data are
in real terms seasonally adjusted etc. Sometimes you may need to go to the
printed document to ﬁnd the deﬁnitions even if the data themselves come from
the internet. Keeping a note of your source means taking down the URL of the
site you visit. Remember that some sites generate the page ‘on demand’ so the
web address is not a permanent one and typing it in later on will not take you
back to the same source. In these circumstances is may be better to note the
‘root’ part of the address e.g. www.imf.org/data/ rather than the complete
detail. You should also take a note of the date you accessed the site this may be
needed if you put the source into a bibliography.
Tips on downloading data
● If you are downloading a spreadsheet save it to your hard disk then include the
URL of the source within the spreadsheet itself. You will always know where it
came from. You can do the same with Word documents.
● You cannot do this with PDF ﬁles which are read-only. You could save the ﬁle
to your disk including the URL within the ﬁle name but avoid putting extra full
stops in the ﬁle name that confuses the operating system: replace them with
hyphens..
STFE_C09.qxd 26/02/2009 09:14 Page 322

slide 340:

Collecting primary data
323
● You can use the ‘Text select tool’ within Acrobat to copy items of data from a
PDF ﬁle and then paste them into a spreadsheet.
● Often when pasting several columns of such data into Excel all the numbers
go into a single column. You can ﬁx this using the Data Text to Columns menu.
Experimentation is required but it works well.
Since there are now so many online sources and they are constantly chan-
ging a list of useful data sites rapidly becomes out of date. The following sites
seem to have withstood the test of time so far and have a good chance of
surviving throughout the life of this edition.
● The UK Ofﬁce for National Statistics is at http://www.statistics.gov.uk/ and
their Statbase service supplies over 1000 datasets online for free. This is tied
to information on 13 ‘themes’ such as education agriculture etc.
● The Data and Story Library at http://lib.stat.cmu.edu/DASL/ is just that:
datasets with accompanying statistical analyses which are useful for learning.
● The IMF’s World Economic Database is at http://www.imf.org/ follow the
links to publications World Economic Outlook then the database. It has
macroeconomic series for most countries for several years. It is easy to down-
load in csv text format for use in spreadsheets.
● The Biz/Ed site at http://www.bized.co.uk/ contains useful material on
business including ﬁnancial case studies of companies as well as economic
data. There is a link from here to the Penn World Tables which contain
national accounts data for many countries on a useful comparable basis
from 1960 onwards. Alternatively visit the Penn home page at http://
pwt.econ.upenn.edu/.
● The World Bank provides a lot of information particularly relating to
developing countries at http://www.worldbank.org/data/. Much of the
data appears to be in .pdf format so although it is easy to view on-screen it
cannot be easily transferred into a spreadsheet or similar software.
● Bill Goffe’s Resources for Economists site http://rfe.org contains a data
section which is a good starting point for data sources.
● Google. Possibly the most useful website of all. Intelligent use of this search
tool is often the best way to ﬁnd what you want.
● http://davidmlane.com/hyperstat/ has an online textbook and glossary. This
is useful if you have a computer handy but not a textbook.
● Financial and business databases are often commercial enterprises and hence
are not freely available. Two useful free or partially free sites however are
The Financial Times http://www.ft.com/home/uk and Yahoo Finance http://
ﬁnance.yahoo.com/.
Collecting primary data
Primary data are data that you have collected yourself from original sources
often by means of a sample survey. This has the advantage that you can design
the questionnaire to include the questions of interest to you and you have total
STFE_C09.qxd 26/02/2009 09:14 Page 323

slide 341:

Chapter 9 • Data collection and sampling methods
324
control over all aspects of data collection. You can also choose the size of the
sample as long as you have sufﬁcient funds available so as to achieve the
desired width of any conﬁdence intervals.
Almost all surveys rely upon some method of sampling whether random
or not. The probability distributions which have been used in previous chap-
ters as the basis of the techniques of estimation and hypothesis testing rely
upon the samples having been drawn at random from the population. If this
is not the case then the formulae for conﬁdence intervals hypothesis tests
etc. are incorrect and not strictly applicable they may be reasonable approx-
imations but it is difﬁcult to know how reasonable. In addition the results
about the bias and precision of estimators will be incorrect. For example
suppose an estimate of the average expenditure on repairs and maintenance
by car owners is obtained from a sample survey. A poor estimate would arise
if only Rolls-Royce owners were sampled since they are not representative
of the population as a whole. The precision of the estimator the sample
mean X is likely to be poor because the mean of the sample could either be very
low Rolls-Royce cars are very reliable so rarely need repairs or very high if
they do break down the high quality of the car necessitates a costly repair.
This means the conﬁdence interval estimate will be very wide and thus impre-
cise. It is not immediately obvious if the estimator would be biased upwards
or downwards.
Thus some form of random sampling method is needed to be able to use the
theory of the probability distributions of random variables. Nor should it be
believed that the theory of random sampling can be ignored if a very large
sample is taken as the following cautionary tale shows. In 1936 the Literary
Digest tried to predict the result of the forthcoming US election by sending out
10 million mail questionnaires. Two million were returned but even with this
enormous sample size Roosevelt’s vote was incorrectly estimated by a margin of
19 percentage points. The problem is that those who respond to questionnaires
are not a random sample of those who receive them.
The meaning of random sampling
The deﬁnition of random sampling is that every element of the population
should have a known non-zero probability of being included in the sample.
The problem with the sample of cars used above was that Ford cars for ex-
ample had a zero probability of being included. Many sampling procedures
give an equal probability of being selected to each member of the population
but this is not an essential requirement. It is possible to adjust the sample data
to take account of unequal probabilities of selection. If for example Rolls-
Royce had a much greater chance of being included than Ford then the estimate
of the population mean would be calculated as a weighted average of the
sample observations with greater weight being given to the few ‘Ford’ observa-
tions than to relatively abundant ‘Rolls-Royce’ observations. A very simple
illustration of this is given below. Suppose that for the population we have the
following data:
STFE_C09.qxd 26/02/2009 09:14 Page 324

slide 342:

The meaning of random sampling
325
Rolls-Royce Ford
Number in population 20 000 2 000 000
Annual repair bill £1000 £200
Then the true average repair bill is
μ 207.92
Suppose the sample data are as follows:
Rolls-Royce Ford
Number in sample 20 40
Probability of selection 1/1000 1/50 000
Repair bill £990 £205
To calculate the average repair bill from the sample data we use a weighted
average using the relative population sizes as weights not the sample sizes
X 212.77
If the sample sizes were used as weights the average would come out at £466.67
which is substantially incorrect.
As long as the probability of being in the sample is known and hence the
relative population sizes must be known the weight can be derived but if
the probability is zero this procedure breaks down.
Other theoretical assumptions necessary for deriving the probability distribution
of the sample mean or proportion are that the population is of inﬁnite size and
that each observation is independently drawn. In practice the former condition
is never satisﬁed since no population is of inﬁnite size but most populations are
large enough that it does not matter. For each observation to be independently
drawn i.e. the fact of one observation being drawn does not alter the probabil-
ity of others in the sample being drawn strictly requires that sampling be done
with replacement i.e. each observation drawn is returned to the population
before the next observation is drawn. Again in practice this is often not the case
sampling being done without replacement but again this is of negligible prac-
tical importance where the population is large relative to the sample.
On occasion the population is quite small and the sample constitutes a
substantial fraction of it. In these circumstances the ﬁnite population correction
fpc should be applied to the formula for the variance of X the fpc being given by
fpc 1 − n/N 9.1
where N is the population size and n the sample size. The table below illustrates
its usage:
Variance of Variance of Example values of fpc
from inﬁnite from ﬁnite
n 20 25 50 100
population population
N 50 100 1000 10 000
σ
2
/n σ
2
/n × 1 − n/N 0.60 0.75 0.95 0.99
20 000 × 990 + 2 000 000 × 205
2 020 000
20 000 × 1000 + 2 000 000 × 200
2 020 000
STFE_C09.qxd 26/02/2009 09:14 Page 325

slide 343:

Chapter 9 • Data collection and sampling methods
326
The ﬁnite population correction serves to narrow the conﬁdence interval
because a sample size of say 25 reveals more about a population of 100 than
about a population of 100 000 so there is less uncertainty about population
parameters. When the sample size constitutes only a small fraction of the
population e.g. 5 or less the ﬁnite population correction can be ignored in
practice. If the whole population is sampled n N then the variance becomes
zero and there is no uncertainty about the population mean.
A further important aspect of random sampling occurs when there are
two samples to be analysed when it is important that the two samples are
independently drawn. This means that the drawing of the ﬁrst sample does not
inﬂuence the drawing of the second sample. This is a necessary condition for the
derivation of the probability distribution of the difference between the sample
means or proportions.
Types of random sample
The meaning and importance of randomness in the context of sampling has
been explained. However there are various different types of sampling all of
them random but which have different statistical properties. Some methods
lead to greater precision of the estimates while others can lead to considerable
cost savings in the collection of the sample data but at the cost of lower pre-
cision. The aim of sampling is usually to obtain the most precise estimates of
the parameter in question but the best method of sampling will depend on the
circumstances of each case. If it is costly to sample individuals a sampling
method which lowers cost may allow a much larger sample size to be drawn and
thus good precise estimates to be obtained even if the method is inherently
not very precise. These issues are investigated in more detail below as a number
of different sampling methods are examined.
Simple random sampling
This type of sampling has the property that every possible sample that could
be obtained from the population has an equal chance of being selected. This
implies that each element of the population has an equal probability of being
included in the sample but this is not the deﬁning characteristic of simple
random sampling. As will be shown below there are sampling methods where
every member of the population has an equal chance of being selected but
some samples i.e. certain combinations of population members can never be
selected.
The statistical methods in this book are based upon the assumption of simple
random sampling from the population. It leads to the most straightforward
formulae for estimation of the population parameters. Although many statistical
surveys are not based upon simple random sampling the use of statistical tests
based on simple random sampling is justiﬁed since the sampling process is
often hypothetical. For example if one were to compare annual growth rates
of two countries over a 30-year period a z test on the difference of two sample
means i.e. the average annual growth rate in each country would be con-
ducted. In a sense the data are not a sample since they are the only possible
data for those two countries over that time period. Why not therefore just regard
STFE_C09.qxd 26/02/2009 09:14 Page 326

slide 344:

The meaning of random sampling
327
the data as constituting the whole population Then it would just be a case of
ﬁnding which country had the higher growth rate there would be no uncer-
tainty about it.
The alternative way of looking at the data would be to suppose that there
exists some hypothetical population of annual growth rates and that the data
for the two countries were drawn by simple random sampling from this popu-
lation. Is this story consistent with the data available In other words could the
data we have simply arise by chance If the answer to this is no i.e. the z score
exceeds the critical value then there is something causing a difference between
the two countries it may not be clear what that something is. In this case it
is reasonable to assume that all possible samples have an equal chance of
selection i.e. that simple random sampling takes place. Since the population
is hypothetical one might as well suppose it to have an inﬁnite number of
members again required by sampling theory.
Stratiﬁed sampling
Returning to the practical business of sampling one problem with simple
random sampling is that it is possible to collect ‘bad’ samples i.e. those which
are unrepresentative of the population. An example of this is what we may refer
to as the ‘basketball player’ problem i.e. in trying to estimate the average height
of the population the sample by sheer bad luck contains a lot of basketball
players. One way round this problem is to ensure that the proportion of basket-
ball players in the sample accurately reﬂects the proportion of basketball players
in the population i.e. very small. The way to do this is to divide up the popu-
lation into ‘strata’ e.g. basketball players and non-players and then to ensure
that each stratum is properly represented in the sample. This is best illustrated
by means of an example.
A survey of newspaper readership which is thought to be associated with
age is to be carried out. Older people are thought to be more likely to read
newspapers as younger people are more likely to use other sources principally
the internet to obtain the news. Suppose the population is made up of three age
strata: old middle-aged and young as follows:
Percentage of population in age group
Old Middle aged Young
20 50 30
Suppose a sample of size 100 is taken. With luck it would contain 20 old
people 50 who are middle-aged and 30 young people and thus would be
representative of the population as a whole. But if by bad luck or bad sample
design all 100 people in the sample were middle-aged poor results might be
obtained since newspaper readership differs between age groups.
To avoid this type of problem a stratiﬁed sample is taken which ensures that
all age groups are represented in the sample. This means that the survey would
have to ask people about their age as well as their reading habits. The simplest
form of stratiﬁed sampling is equiproportionate sampling whereby a stratum
which constitutes say 20 of the population also makes up 20 of the sample.
For the example above the sample would be made up as follows:
STFE_C09.qxd 26/02/2009 09:14 Page 327

slide 345:

Chapter 9 • Data collection and sampling methods
328
Class Old Middle-aged Young Total
Number in sample 20 50 30 100
It should be clear why stratiﬁed sampling constitutes an improvement over
simple random sampling since it rules out ‘bad’ samples i.e. those not repres-
entative of the population. It is simply impossible to get a sample consisting
completely of middle-aged people. In fact it is impossible to get a sample in
anything but the proportions 20:50:30 as in the population this is ensured by
the method of collecting the sample.
It is easy to see when stratiﬁcation leads to large improvements over simple
random sampling. If there were no difference between strata age groups in
reading habits then there would be no gain from stratiﬁcation. If reading habits
were the same regardless of age group there would be no point in dividing
up the population according to that factor. On the other hand if there were
large differences between strata but within strata reading habits were similar
then the gains from stratiﬁcation would be large. The fact that reading habits
are similar within strata means that even a small sample from a stratum should
give an accurate picture of that stratum.
Stratiﬁcation is beneﬁcial therefore when
● the between-strata differences are large and
● the within-strata differences are small.
These beneﬁts take the form of greater precision of the estimates i.e. narrower
conﬁdence intervals.
2
The greater precision arises because stratiﬁed sampling
makes use of supplementary information – i.e. the proportion of the population
in each age group. Simple random sampling does not make use of this. Obvi-
ously therefore if those proportions of the population are unknown stratiﬁed
sampling cannot be carried out. However even if the proportions are only
known approximately there could be a gain in precision.
In this example age is a stratiﬁcation factor i.e. a variable which is used to
divide the population into strata. Other factors could of course be used such
as income or even height. A good stratiﬁcation factor is one which is related
to the subject of investigation. Income would therefore probably be a good
stratiﬁcation factor because it is related to reading habits but height is not since
there is probably little difference between tall and short people regarding the
newspaper they read. What is a good stratiﬁcation factor obviously depends
upon the subject of study. A bed manufacturer might well ﬁnd height to be a
good stratiﬁcation factor if conducting an enquiry into preferences about the size
of beds. Although good stratiﬁcation factors improve the precision of estimates
bad factors do not make them worse there will simply be no gain over simple
random sampling. It would be as if there were no differences between the age
groups in reading habits so that ensuring the right proportions in the sample is
irrelevant but it has no detrimental effects.
2
The formulae for calculating conﬁdence intervals with stratiﬁed sampling are not given
here since they merit a whole book to themselves. The interested reader should consult
for example C. A. Moser and G. Kalton Survey Methods in Social Investigation 1971
Heinemann.
STFE_C09.qxd 26/02/2009 09:14 Page 328

slide 346:

The meaning of random sampling
329
STATISTICS
IN
PR AC TIC E
··
Proportional allocation of sample observations to the different strata as done
above is the simplest method but is not necessarily the best. For the optimal
allocation there should generally be a divergence from proportional allocation
and the sample should have more observations in a particular stratum relative
to proportional allocation:
● the more diverse the stratum and
● the cheaper it is to sample the stratum.
Starting from the 20:50:30 proportional allocation derived earlier suppose
that older people all read the same newspaper but youngsters read a variety of
titles. Then the representation of youngsters in the sample should be increased
and that of older people reduced. If it really were true that everyone old person
read the same paper then one observation from that class would be sufﬁcient
to yield all there is to know about it. Furthermore if it is cheaper to sample
younger readers perhaps because they are easier to contact than older people
then again the representation of youngsters in the sample should be increased.
This is because for a given budget it will allow a larger total sample size.
Surveying concert-goers
A colleague and I carried out a survey of people attending a concert in Brighton
by Jamiroquai – hope they’re still popular by the time you read this to ﬁnd out
who they were how much they spent in the town and how they travelled to the
concert. The spreadsheet below gives some of the results.
➔
STFE_C09.qxd 26/02/2009 09:14 Page 329

slide 347:

Chapter 9 • Data collection and sampling methods
330
The data were collected by face-to-face interviews before the concert. We did
not have a sampling frame so the student interviewers simply had to choose the
sample themselves on the night. The one important instruction about sampling
we gave them was that they should not interview more than one person in any
group. People in the same group are likely to be inﬂuenced by each other e.g.
travel together so we would not get independent observations reducing the
effective sample size.
From the results you can see that 41.1 either worked or studied in Brighton
and that only one person in the sample was neither working nor studying. The
second half of the table shows that 64.4 travelled to the show in a car obviously
adding to congestion in the town about half of whom shared a car ride. Perhaps
surprisingly Brighton residents were just as likely to use their car to travel as
were those from out of town.
The average level of spending was £24.20 predominantly on food £7.38
drink £5.97 and shopping £5.37. The last category had a high variance associ-
ated with it – many people spent nothing one person spent £200 in the local
shops.
Cluster sampling
A third form of sampling is cluster sampling which although intrinsically
inefﬁcient can be much cheaper than other forms of sampling allowing a
larger sample size to be collected. Drawing a simple or a stratiﬁed random
sample of size 100 from the whole of Britain would be very expensive to collect
since the sample observations would be geographically very spread out. Inter-
viewers would have to make many long and expensive journeys simply to
collect one or two observations. To avoid this the population can be divided
into ‘clusters’ e.g. regions or local authorities and one or more of these clusters
are then randomly chosen. Sampling takes place only within the selected
clusters and is therefore geographically concentrated and the cost of sampling
falls allowing a larger sample to be collected for the same expenditure of
money.
Within each cluster one can have either a 100 sample or a lower sampling
fraction which is called multistage sampling this is explained further below.
Cluster sampling gives unbiased estimates of population parameters but for a
given sample size these are less precise than the results from simple or stratiﬁed
sampling. This arises in particular when the clusters are very different from each
other but fairly homogeneous within themselves. In this case once a cluster is
chosen if it is unrepresentative of the population a poor inaccurate estimate
of the population parameter is inevitable. The ideal circumstances for cluster
sampling are when all clusters are very similar since in that case examining one
cluster is almost as good as examining the whole population.
Dividing up the population into clusters and dividing it into strata are
similar procedures but the important difference is that sampling is from one or
at most a few clusters but from all strata. This is reﬂected in the characteristics
which make for good sampling. In the case of stratiﬁed sampling it is beneﬁcial
STFE_C09.qxd 26/02/2009 09:14 Page 330

slide 348:

The meaning of random sampling
331
if the between-strata differences are large and the within-strata differences small.
For cluster sampling this is reversed: it is desirable to have small between-cluster
differences but heterogeneity within clusters. Cluster sampling is less efﬁcient
precise for a given sample size but is cheaper and so can offset this disadvant-
age with a larger sample size. In general cluster sampling needs a much larger
sample to be effective so is only worthwhile where there are signiﬁcant gains
in cost.
Multistage sampling
Multistage sampling was brieﬂy referred to in the previous section and is com-
monly found in practice. It may consist of a mixture of simple stratiﬁed and
cluster sampling at the various stages of sampling. Consider the problem of
selecting a random sample of 1000 people from a population of 25 million to
ﬁnd out about voting intentions. A simple random sample would be extremely
expensive to collect for the reasons given above so an alternative method must
be found. Suppose further that it is suspected that voting intentions differ
according to whether one lives in the north or south of the country and whether
one is a home owner or renter. How is the sample to be selected The following
would be one appropriate method.
First the country is divided up into clusters of counties or regions and a
random sample of these taken say one in ﬁve. This would be the ﬁrst way
of reducing the cost of selection since only one-ﬁfth of all counties now need
to be visited. This one-in-ﬁve sample would be stratiﬁed to ensure that north
and south were both appropriately represented. To ensure that each voter
has an equal chance of being in the sample the probability of a county being
drawn should be proportional to its adult population. Thus a county with twice
the population of another should have twice the probability of being in the
sample.
Having selected the counties the second stage would be to select a random
sample of local authorities within each selected county. This might be a one-in-
ten sample from each county and would be a simple random sample within
each cluster. Finally a selection of voters from within each local authority would
be taken stratiﬁed according to tenure. This might be a one in 500 sample. The
sampling fractions would therefore be
××
So from the population of 25 million voters a sample of 1000 would be
collected. For different population sizes the sampling fractions could be adjusted
so as to achieve the goal of a sample size of 1000.
The sampling procedure is a mixture of simple stratiﬁed and cluster samp-
ling. The two stages of cluster sampling allow the selection of 50 local author-
ities for study and so costs are reduced. The north and south of the country are
both adequately represented and housing tenures are also correctly represented
in the sample by the stratiﬁcation at the ﬁnal stage. The resulting conﬁdence
intervals will be complicated to calculate but should give an improvement over
the method of simple random sampling.
1
25 000
1
500
1
10
1
5
STFE_C09.qxd 26/02/2009 09:14 Page 331

slide 349:

Chapter 9 • Data collection and sampling methods
332
STATISTICS
IN
PR AC TIC E
··
The UK Time Use Survey
The UK Time Use Survey provides a useful example of the effects of multistage
sampling. It uses a mixture of cluster and stratiﬁed sampling and the results are
weighted to compensate for unequal probabilities of selection into the sample
and for the effects of non-response. Together these act to increase the size of
standard errors relative to those obtained from a simple random sample of the
same size. This increase can be measured by the design factor deﬁned as the
ratio of the true standard error to the one arising from a simple random sample
of the same size. For the time use survey the design factor is typically 1.5 or more.
Thus the standard errors are increased by 50 or more but a simple random
sample of the same size would be much more expensive to collect e.g. the
clustering means that only a minority of geographical areas are sampled.
The following table shows the average amount of time spent sleeping by
16–24 year olds in minutes per day:
Mean True s.e. 95 CI Design n Effective
factor sample size
Male 544.6 6.5 531.9 557.3 1.63 1090 412
Female 545.7 4.2 537.3 554.0 1.14 1371 1058
The true standard error taking account of the sample design is 6.5 minutes
for men. The design factor is 1.63 meaning this standard error is 63 larger
than for a similar sized n 1090 simple random sample. Equivalently a simple
random sample of size n 412 1090/1.63
2
would achieve the same precision
but at greater cost.
How the design factor is made up is shown in the following table:
Design factor Deft due to Deft due Deft due to
deft stratiﬁcation to clustering weighting
1.63 1.00 1.17 1.26
It can be seen that stratiﬁcation has no effect on the standard error but both
clustering and the post-sample weighting serve to increase the standard errors.
Source: The UK 2000 Time Use Survey Technical Report 2003 Her Majesty’s Stationery Ofﬁce.
Quota sampling
Quota sampling is a non-random method of sampling and therefore it is
impossible to use sampling theory to calculate conﬁdence intervals from the
sample data or to ﬁnd whether or not the sample will give biased results. Quota
sampling simply means obtaining the sample information as best one can
for example by asking people in the street. However it is by far the cheapest
method of sampling and so allows much larger sample sizes. As shown above
large sample sizes can still give biased results if sampling is non-random but in
some cases the budget is too small to afford even the smallest properly conducted
random sample so a quota sample is the only alternative.
STFE_C09.qxd 26/02/2009 09:14 Page 332

slide 350:

Calculating the required sample size
333
STATISTICS
IN
PR AC TIC E
··
Even with quota sampling where the interviewer is simply told to go out
and obtain say 1000 observations it is worth making some crude attempt at
stratiﬁcation. The problem with human interviewers is that they are notoriously
non-random so that when they are instructed to interview every tenth person
they see a reasonably random method if that person turns out to be a shabbily
dressed tramp slightly the worse for drink they are quite likely to select the
eleventh person instead. Shabbily dressed tramps slightly the worse for drink
are therefore under-represented in the sample. To combat this sort of problem
the interviewers are given quotas to fulﬁl for example 20 men and 20 women
10 old-age pensioners one shabbily dressed tramp etc. so that the sample will
at least broadly reﬂect the population under study and give reasonable results.
It is difﬁcult to know how accurate quota samples are since it is rare for their
results to be checked against proper random samples or against the population
itself. Probably the most common quota samples relate to voting intentions and
so can be checked against actual election results. The 1992 UK general election
provides an interesting illustration. The opinion polls predicted a fairly sub-
stantial Labour victory but the outcome was a narrow Conservative majority. An
enquiry concluded that the erroneous forecast occurred because a substantial
number of voters changed their minds at the last moment and that there was
‘differential turn-out’ i.e. Conservative supporters were more likely to vote than
Labour ones. Since then pollsters have tried to take this factor into account
when trying to predict election outcomes.
Can you always believe surveys
Many surveys are more interested in publicising something than in ﬁnding out the
facts. One has to be wary of surveys ﬁnding that people enjoy high-rise living . . .
when the survey is sponsored by an elevator company. In July 2007 a survey of
1000 adults found that ‘the average person attends 3.4 weddings each year’. This
sounds suspiciously high to me. I’ve never attended three or more weddings in
a year nor have friends I have asked. Let’s do some calculations. There were
283 730 weddings in the UK in 2005. There are about 45m adults so if they each
attend 3.4 weddings that makes 45 × 3.4 153 million attendees. This means 540
per wedding. That seems excessively high remember this excludes children and
probably means the sample design was poor obtaining an unrepresentative result.
A good way to make a preliminary judgement on the likely accuracy of a survey
is to ask ‘who paid for this’
Calculating the required sample size
Before collecting sample data it is obviously necessary to know how large the
sample size has to be. The required sample size will depend upon two factors:
● the desired level of precision of the estimate and
● the funds available to carry out the survey.
The greater the precision required the larger the sample size needs to be other
things being equal. But a larger sample will obviously cost more to collect and
STFE_C09.qxd 26/02/2009 09:14 Page 333

slide 351:

Chapter 9 • Data collection and sampling methods
334
this might conﬂict with a limited amount of funds being available. There is a
trade-off therefore between the two desirable objectives of high precision and
low cost. The following example shows how these two objectives conﬂict.
A ﬁrm producing sweets wishes to ﬁnd out the average amount of pocket
money children receive per week. It wants to be 99 conﬁdent that the estimate
is within 20 pence of the correct value. How large a sample is needed
The problem is one of estimating a conﬁdence interval turned on its head.
Instead of having the sample information X s and n and calculating the con-
ﬁdence interval for μ the desired width of the conﬁdence interval is given and
it is necessary to ﬁnd the sample size n which will ensure this. The formula
for the 99 conﬁdence interval assuming a Normal rather than t distribution
i.e. it is assumed that the required sample size will be large is
X − 2.58 × X + 2.58 × 9.2
Diagrammatically this can be represented as in Figure 9.1.
The ﬁrm wants the distance between X and μ to be no more than 20 pence in
either direction which means that the conﬁdence interval must be 40 pence
wide. The value of n which makes the conﬁdence interval 40 pence wide has to
be found. This can be done by solving the equation
20 2.58 ×
and hence by rearranging:
n 9.3
All that is now required to solve the problem is the value of s
2
the sample
variance but since the sample has not yet been taken this is not available. There
are a number of ways of trying to get round this problem:
● using the results of existing surveys if available
● conducting a small preliminary survey
● guessing.
These may not seem very satisfactory particularly the last but something
has to be done and some intelligent guesswork should give a reasonable estimate
of s
2
. Suppose for example that a survey of children’s spending taken ﬁve years
previously showed a standard deviation of 30 pence. It might be reasonable to
2.58
2
× s
2
20
2
sn
2
/
sn
2
/
sn
2
/
Figure 9.1
The desired width of the
conﬁdence interval
STFE_C09.qxd 26/02/2009 09:14 Page 334

slide 352:

Collecting the sample
335
expect that the standard deviation of spending would be similar to the standard
deviation of income so 30 pence updated for inﬂation can be used as an estimate
of the standard deviation. Suppose that ﬁve years’ inﬂation turns the 30 pence
into 50 pence. Using s 50 we obtain
n 41.6
giving a required sample size of 42 the sample size has to be an integer. This is
a large n 25 sample size so the use of the Normal distribution was justiﬁed.
Is the ﬁrm willing to pay for such a large sample Suppose it was willing
to pay out £1000 in total for the survey which costs £600 to set up and then
£6 per person sampled. The total cost would be £600 + 42 × 6 £852 which
is within the ﬁrm’s budget. If the ﬁrm wished to spend less than this it would
have to accept a smaller sample size and thus a lower precision or a lower level
of conﬁdence. For example if only a 95 conﬁdence level were required the
appropriate z score would be 1.96 yielding
n 24.01
A sample size of 24 would only cost £600 + 6 × 24 £804. At this sample size
the assumption that X follows a Normal distribution becomes less tenable so the
results should be treated with caution. Use of the t distribution is tricky because
the appropriate t value depends upon the number of degrees of freedom which
in turn depends on sample size which is what is being looked for
The general formula for ﬁnding the required sample size is
n 9.4
where z
α
is the z score appropriate for the 100 − α conﬁdence level and p is
the desired accuracy 20 pence in this case.
Collecting the sample
The sampling frame
We now move on to the ﬁne detail of how to select the individual observations
which make up the sample. In order to do this it is necessary to have some sort
of sampling frame i.e. a list of all the members of the population from which
the sample is to be drawn. This can be a problem if the population is extremely
large for example the population of a country since it is difﬁcult to manipulate
so much information cutting up 50 million pieces of paper to put into a hat
for a random draw is a tedious business. Alternatively the list might not even
exist or if it does not be in one place convenient for consultation and use.
In this case there is often an advantage to multistage sampling for the selection
of regions or even local authorities is fairly straightforward and not too time-
consuming. Once at this lower level the sampling frame is more manageable
– each local authority has an electoral register for example – and individual
z
2
α
× s
2
p
2
1.96
2
× 50
2
20
2
2.58
2
× 50
2
20
2
STFE_C09.qxd 26/02/2009 09:14 Page 335

slide 353:

Chapter 9 • Data collection and sampling methods
336
observations can be relatively easily chosen. Thus it is not always necessary to
have a complete sampling frame for the entire population in one place.
Choosing from the sampling frame
There is a variety of methods available for selecting a sample of say 1000 observa-
tions from a sampling frame of 25 000 names varying from the manual to the
electronic. The oldest method is to cut up 25 000 pieces of paper put them in
a large hat shake it to randomise and pick out 1000. This is fairly time-
consuming however and has some pitfalls – if the pieces are not all cut to the
same size is the probability of selection the same It is much better if the popula-
tion in the sampling frame is numbered in some way for then one only has to
select random numbers. This can be done by using a table of random numbers
see Table A1 on page 412 for example or a computer. The use of random
number tables for such purposes is an important feature of statistics and in 1955
the Rand Corporation produced a book entitled A Million Random Digits with
100 000 Normal Deviates. This book as the title suggests contained nothing
but pages of random numbers which allowed researchers to collect random
samples. Interestingly the authors did not bother fully to proofread the text
since a few random errors here and there wouldn’t matter These numbers
were calculated electronically and nowadays every computer has a facility for
rapidly choosing a set of random numbers. It is an interesting question how a
computer which follows rigid rules of behaviour can select random numbers
which by deﬁnition are unpredictable by any rule.
A further alternative if a 1 in 25 sample is required is to select a random
starting point between 1 and 25 and then select every subsequent 25th observa-
tion e.g. the 3rd 28th 53rd etc.. This is a satisfactory procedure if the sampling
frame is randomly sorted to start with but otherwise there can be problems. For
example if the list is sorted by income poorest ﬁrst a low starting value will
almost certainly give an underestimate of the population mean. If all the numbers
were randomly selected this ‘error’ in the starting value would not be important.
Interviewing techniques
Good training of interviewers is vitally important to the results of a survey. It is
very easy to lead an interviewee into a particular answer to a question. Consider
the following two sets of questions:
A
1 Do you know how many people were killed by the atomic bomb at
Hiroshima
2 Do you think nuclear weapons should be banned
B
1 Do you believe in nuclear deterrence
2 Do you think nuclear weapons should be banned
A2 is almost certain to get a higher ‘yes’ response than B2. Even a different
ordering of the questions can have an effect upon the answers consider asking
A2 before A1. The construction of the questionnaire has to be done with care
therefore. The manner in which the questions are asked is also important since
STFE_C09.qxd 26/02/2009 09:14 Page 336

slide 354:

Collecting the sample
337
STATISTICS
IN
PR AC TIC E
··
it can often suggest the answer. Good interviewers are trained to avoid these
problems by sticking precisely to the wording of the question and not to suggest
an expected answer.
Telephone surveys
An article by M. Collins in the Journal of the Royal Statistical Society reveals some
of the difﬁculties in conducting surveys by telephone. First the sampling frame is
incomplete since although most people have a telephone some are not listed in
the directory. In the late 1980s this was believed to be around 12 of all numbers
but it has been growing since to around 40. Part of this trend of course may
be due to people growing fed up of being pestered by salespersons and ‘market
researchers’. Researchers have responded with ‘random digit dialling’ which is
presumably made easier by modern computerised equipment.
Matters are unlikely to improve for researchers in the future. The answering
machine is often used as a barrier to unwanted calls and many residential lines
connect to fax machines. Increasing deregulation and mobile phone use mean it
will probably become more and more difﬁcult to obtain a decent sampling frame
for a proper survey.
Source: M. Collins Sampling for UK telephone surveys J. Royal Statistical Society Series A 1999
162 1 1–4.
Even when these procedures are adhered to there can be various types of
response bias. The ﬁrst problem is of non-response due to the subject not being
at home when the interviewer calls. There might be a temptation to remove that
person from the sample and call on someone else but this should be resisted.
There could well be important differences between those who are at home
all day and those who are not especially if the survey concerns employment or
spending patterns for example. Continued efforts should be made to contact
the subject. One should be wary of surveys which have low response rates par-
ticularly where it is suspected that the non-response is in some way systematic
and related to the goal of the survey.
A second problem is that subjects may not answer the question truthfully
for one reason or another sometimes inadvertently. An interesting example of
this occurred in the survey into sexual behaviour carried out in Britain in 1992
see Nature 3 December 1992. Among other things this found the following
● The average number of heterosexual partners during a woman’s lifetime is 3.4.
● The average number of heterosexual partners during a man’s lifetime is 9.9.
This may be in line with one’s beliefs about behaviour but in fact the ﬁgures
must be wrong. The total number of partners of all women must by deﬁnition
equal the total number for all men. Since there are approximately equal num-
bers of males and females in the UK the averages must therefore be about the
same. So how do the above ﬁgures come about
It is too much to believe that international trade holds the answer. It seems
unlikely that British men are so much more attractive to foreign women than
British women are to foreign men. Nor is an unrepresentative sample likely. It
was carefully chosen and quite large around 20 000. The answer would appear
STFE_C09.qxd 26/02/2009 09:14 Page 337

slide 355:

Chapter 9 • Data collection and sampling methods
338
to be that some people are lying. Either women are being excessively modest
or more likely men are boasting. Perhaps the answer is to divide by three
whenever a man talks about his sexual exploits
For an update on this story see the article by J. Wadsworth et al. What is
a mean An examination of the inconsistency between men and women in
reporting sexual partnerships Journal of the Royal Statistical Society Series A 1996
159 1 111–123.
Case study: the UK Expenditure and Food Survey
Introduction
The Expenditure and Food Survey EFS is an example of a large government
survey which examines households’ expenditure patterns with a particular focus
on food expenditures and income receipts. It is worth having a brief look at
it therefore to see how the principles of sampling techniques outlined in this
chapter are put into practice. The EFS succeeded the Family Expenditure Survey
in 2001 and uses a similar design. The EFS is used for many different purposes
including the calculation of weights to be used in the UK Retail Price Index
and the assessment of the effects of changes in taxes and state beneﬁts upon
different households.
Choosing the sample
The sample design follows is known as a three-stage rotating stratiﬁed random
sample. This is obviously quite complex so will be examined stage by stage.
Stage 1
The country is ﬁrst divided into around 150 strata each stratum made up of a
number of local authorities sharing similar characteristics. The characteristics
used as stratiﬁcation factors are
● geographic area
● urban or rural character based on a measure of population density
● prosperity based on a measure of property values.
A stratum might therefore be made up of local authorities in the South West
region of medium population density and high prosperity.
In each quarter of the year one local authority from each stratum is chosen
at random the probability of selection being proportional to population. Once an
authority has been chosen it remains in the sample for one year four quarters
before being replaced. Only a quarter of the authorities in the sample are replaced
in any quarter which gives the sample its ‘rotating’ characteristic. Each quarter
some authorities are discarded some kept and some new ones brought in.
Stage 2
From each local authority selected four wards smaller administrative units are
selected one to be used in each of the four quarters for which the local authority
appears in the sample.
STFE_C09.qxd 26/02/2009 09:14 Page 338

slide 356:

Summary
339
Stage 3
Finally within each ward 16 addresses are chosen at random and these con-
stitute the sample.
The sampling frame
The Postcode Address File a list of all postal delivery addresses is used as the
sampling frame. Previously the register of electors in each ward was used but had
some drawbacks: it was under-representative of those who have no permanent
home or who move frequently e.g. tramps students etc.. The fact that many
people took themselves off the register in the early 1990s in order to avoid pay-
ing the Community Charge could also have affected the sample. The addresses
are chosen from the register by interval sampling from a random starting point.
About 12 000 addresses are targeted each year but around 11 prove to be
business addresses leaving approximately 11 000 households. The response rate
is about 60 meaning that the actual sample consists of about 6500 households
each year. Given the complexity of the information gathered this is a remark-
ably good ﬁgure.
Collection of information
The data are collected by interview and by asking participants to keep a diary
in which they record everything they purchase over a two-week period. Highly
skilled interviewers are required to ensure accuracy and compliance with the
survey and each participating family is visited serveral times. As a small induce-
ment to cooperate each member of the family is paid a small sum of money
£10 it is to be hoped that the anticipation of this does not distort their expend-
iture patterns.
Sampling errors
Given the complicated survey design it is difﬁcult to calculate sampling errors
exactly. The multistage design of the sample actually tends to increase the
sampling error relative to a simple random sample but of course this is offset
by cost savings which allow a greatly increased sample size. Overall the results
of the survey are of good quality and can be veriﬁed by comparison with other
statistics such as retail sales for example.
Summary
● A primary data source is one where you obtain the data yourself or have
access to all the original observations.
● A secondary data source contains a summary of the original data usually in
the form of tables.
● When collecting data always keep detailed notes of the sources of all informa-
tion how it was collected precise deﬁnitions of the variables etc.
STFE_C09.qxd 26/02/2009 09:14 Page 339

slide 357:

Chapter 9 • Data collection and sampling methods
340
● Some data can be obtained electronically which saves having to type them
into a computer but the data still need to be checked for errors.
● There are various types of random sample including simple stratiﬁed and
clustered random samples. The methods are sometimes combined in multi-
stage samples.
● The type of sampling affects the size of the standard errors of the sample
statistics. The most precise sampling method is not necessarily the best if it
costs more to collect since the overall sample size that can be afforded will
be smaller.
● Quota sampling is a non-random method of sampling which has the advant-
age of being extremely cheap. It is often used for opinion polls and surveys.
● The sampling frame is the list or lists from which the sample is drawn. If it
omits important elements of the population its use could lead to biased results.
● Careful interviewing techniques are needed to ensure reliable answers are
obtained from participants in a survey.
C. A. Moser and G. Kalton Survey Methods in Social Investigations 1971
Heinemann.
Rand Corporation A Million Random Digits with 100 000 Normal Deviates 1955
The Glencoe Press.
cluster sampling
ﬁnite population correction
multistage sampling
online data sources
primary and secondary data
quota sampling
random sample
sampling frame
sampling methods
simple random sampling
spreadsheet
stratiﬁed sampling
Key terms and concepts
References
STFE_C09.qxd 26/02/2009 09:14 Page 340

slide 358:

341
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
9.1 What issues of deﬁnition arise in trying to measure ‘output’
9.2 What issues of deﬁnition arise in trying to measure ‘unemployment’
9.3 Find the gross domestic product for both the UK and the US for the period 1995–2003.
Obtain both series in constant prices.
9.4 Find ﬁgures for the monetary aggregate M0 for the years 1995–2003 in the UK in nominal
terms.
9.5 A ﬁrm wishes to know the average weekly expenditure on food by households to within £2
with 95 conﬁdence. If the variance of food expenditure is thought to be about 400 what
sample size does the ﬁrm need to achieve its aim
9.6 A ﬁrm has £10 000 to spend on a survey. It wishes to know the average expenditure on gas
by businesses to within £30 with 99 conﬁdence. The variance of expenditure is believed
to be about 40 000. The survey costs £7000 to set up and then £15 to survey each ﬁrm. Can
the ﬁrm achieve its aim with the budget available
9.7 Project Visit your college library or online sources to collect data to answer the follow-
ing question. Has women’s remuneration risen relative to men’s over the past 10 years
You should write a short report on your ﬁndings. This should include a section describing
the data collection process including any problems encountered and decisions you had
to make. Compare your results with those of other students. It might be interesting to
compare your experiences of using online and ofﬂine sources of data.
9.8 Project Do a survey to ﬁnd the average age of cars parked on your college campus.
A letter or digit denoting the registration year can be found on the number plate – precise
details can be obtained in various guides to used-car prices. You might need stratiﬁed
sampling e.g. if administrators have newer cars than faculty and students. You could
extend the analysis by comparing the results with a public car park. You should write a
brief report outlining your survey methods and the results you obtain. If several students
do such a survey you could compare results.
Problems
Problems
STFE_C09.qxd 26/02/2009 09:14 Page 341

slide 359:

Index numbers
10
Contents
Learning outcomes 343
Introduction 343
A simple index number 344
A price index with more than one commodity 345
Using base-year weights: the Laspeyres index 346
Using current-year weights: the Paasche index 349
Units of measurement 351
Using expenditures as weights 353
Comparison of the Laspeyres and Paasche indices 354
The story so far – a brief summary 355
Quantity and expenditure indices 355
The Laspeyres quantity index 355
The Paasche quantity index 356
Expenditure indices 356
Relationships between price quantity and expenditure indices 357
Chain indices 359
The Retail Price Index 360
Discounting and present values 362
An alternative investment criterion: the internal rate of return 364
Nominal and real interest rates 365
Inequality indices 366
The Lorenz curve 367
The Gini coefﬁcient 370
Is inequality increasing 371
A simpler formula for the Gini coefﬁcient 372
Concentration ratios 374
Summary 376
Key terms and concepts 376
References 376
Problems 377
Answers to exercises 382
Appendix: Deriving the expenditure share form of the Laspeyres
price index 385
342
STFE_C10.qxd 26/02/2009 09:16 Page 342

slide 360:

343
By the end of this chapter you should be able to:
● represent a set of data in index number form
● understand the role of index numbers in summarising or presenting data
● recognise the relationship between price quantity and expenditure index
numbers
● turn a series measured at current prices into one at constant prices or in
volume terms
● splice separate index number series together
● measure inequality using index numbers.
Learning
outcomes
Introduction
‘Consumer price index up 3.8. Retail price index up 4.6.’ UK June 2008
‘Vietnam reports an inﬂation rate of 27.04’ July 2008
‘Zimbabwe inﬂation at 2200000’ July 2008
The above headlines reveal startling differences between the inﬂation rates of
three different countries. This chapter is concerned with how such measures
are constructed and then interpreted. Index numbers are not restricted to
measuring inﬂation though that is one of the most common uses. There are
also indexes of national output of political support of corruption in different
countries of the world and even of happiness Danes are the happiest it seems.
An index number is a descriptive statistic in the same sense as the mean or
standard deviation which summarises a mass of information into some readily
understood statistic. As such it shares the advantages and disadvantages of
other summary statistics: it provides a useful overview of the data but misses out
the ﬁner detail. The retail price index RPI referred to above is one example
which summarises information about the prices of different goods and services
aggregating them into a single number. We have used index numbers earlier
in the book e.g. in the chapters on regression without fully explaining their
derivation or use. This will now be remedied.
Index numbers are most commonly used for following trends in data over
time such as the RPI measuring the price level or the index of industrial pro-
duction IIP measuring the output of industry. The RPI also allows calculation
of the rate of inﬂation which is simply the rate of change of the price index
and from the IIP it is easy to measure the rate of growth of output. Index
numbers are also used with cross-section data for example an index of regional
house prices would summarise information about the different levels of house
prices in different regions of the country at a particular point in time. There
are many other examples of index numbers in use common ones being the
Introduction
Complete your diagnostic test for Chapter 10 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C10.qxd 26/02/2009 09:16 Page 343

slide 361:

Chapter 10 • Index numbers
344
Financial Times All Share index the trade weighted exchange rate index and
the index of the value of retail sales.
This chapter will explain how index numbers are constructed from original
data and the problems which arise in doing this. There is also a brief discussion
of the RPI to illustrate some of these problems and to show how they are
resolved in practice. Finally a different set of index numbers is examined which
are used to measure inequality such as inequality in the distribution of income
or in the market shares held by different ﬁrms competing in a market.
A simple index number
We begin with the simplest case where we wish to construct an index number
series for a single commodity. In this case we shall construct an index number
series representing the price of coal. This is a series of numbers showing in each
year the price of coal and how it changes over time. More precisely we measure
the cost of coal to industrial users for the years 2002–2006. Later in the chapter
we will expand the analysis to include other fuels and thereby construct an index
of the price of energy as a whole. The raw data for coal are given in Table 10.1
adapted from the Digest of UK Energy Statistics available on the internet.
We assume that the product itself has not changed from year to year so that the
index provides a fair representation of costs. This means for example that the
quality of coal has not changed during the period.
To construct a price index from these data we choose one year as the refer-
ence year 2002 in this case and set the price index in that year equal to 100.
The prices in the other years are then measured relative to the reference year
ﬁgure of 100. The index and its construction are presented in Table 10.2.
All we have done so far is to change the form in which the information is
presented. We have perhaps gained some degree of clarity for example it is easy
to see that the price in 2006 is 18 higher than in 2002 but we have lost the
original information about the actual level of prices. Since it is usually relative
prices that are of interest this loss of information about the actual price level
is not too serious and information about relative prices is retained by the price
Table 10.1 The price of coal 2002–2006
2002 2003 2004 2005 2006
Price £/tonne 36.97 34.03 37.88 44.57 43.63
Table 10.2 The price index for coal 2002 100
Year Price Index
2002 36.97 100.0 36.97/36.97 × 100
2003 34.03 92.0 34.03/36.97 × 100
2004 37.88 102.5 37.88/36.97 × 100
2005 44.57 120.6 Etc.
2006 43.63 118.0
STFE_C10.qxd 26/02/2009 09:16 Page 344

slide 362:

A price index with more than one commodity
345
Table 10.3 The price index for coal 2004 100
Year Price Index
2002 36.97 97.6 36.97/37.88 × 100
2003 34.03 89.8 34.03/37.88 × 100
2004 37.88 100.0 37.88/37.88 × 100
2005 44.57 117.7 Etc.
2006 43.63 115.2
Exercise 10.1
index. For example using either the index or actual prices we can see that the
price of coal was 8 lower in 2003 than in 2002.
In terms of a formula we have calculated
P
t
× 100
where P
t
represents the value of the index in year t.
The choice of reference year is arbitrary and we can easily change it for a
different year. If we choose 2004 to be the reference year then we set the price
in that year equal to 100 and again measure all other prices relative to it. This is
shown in Table 10.3 which can be derived from Table 10.2 or directly from the
original data on prices. You should choose whichever reference year is most con-
venient for your purposes. Whichever year is chosen the informational content
is the same.
a Average house prices in the UK for 2000–2004 were:
Year 2000 2001 2002 2003 2004
Price £ 86 095 96 337 121 137 140 687 161 940
a Turn this into an index with a reference year of 2000.
b Recalculate the index with reference year 2003.
c Check that the ratio of house prices in 2004 relative to 2000 is the same for both
indexes.
A price index with more than one commodity
Constructing an index for a single commodity is a simple process but only of
limited use mainly in terms of presentation. Once there is more than a single
commodity index numbers become more useful but are more difﬁcult to cal-
culate. Industry uses other sources of energy as well as coal such as gas petroleum
and electricity and managers might wish to know the overall price of energy
which affects their costs. This is a more common requirement in reality rather
than the simple index number series calculated above. If the price of each fuel
were rising at the same rate say at 5 per year then it is straightforward to say
price of coal in year t
price of coal in 2002
STFE_C10.qxd 26/02/2009 09:16 Page 345

slide 363:

Chapter 10 • Index numbers
346
that the price of energy is also rising at 5 per year. But supposing as is likely
that the prices are all rising at different rates as shown in Table 10.4. Is it now
possible to say how fast the price of energy is increasing Several different prices
now have to be combined in order to construct an index number a more com-
plex process than the simple index number calculated above.
From the data presented in Table 10.4 we can calculate that the price of coal
has risen by 18 over the ﬁve-year period petrol has risen by 97 electricity
by 85 and gas by 131. It is fairly clear prices are rising rapidly but how do
we measure this precisely
Using base-year weights: the Laspeyres index
We tackle the problem by taking a weighted average of the price changes of the
individual fuels the weights being derived from the quantities of each fuel used
by the industry. Thus if industry uses relatively more coal than petrol more
weight is given to the rise in the price of coal in the calculation.
We put this principle into effect by constructing a hypothetical ‘shopping
basket’ of the fuels used by industry and measure how the cost of this basket
has risen or fallen over time. Table 10.5 gives the quantities of each fuel con-
sumed by industry in 2002 again from the Digest of UK Energy Statistics and
it is this which forms the shopping basket. 2002 is referred to as the base year
since it is the quantities consumed in this year which are used to make up the
shopping basket.
The cost of the basket in 2002 prices therefore works out as shown in
Table 10.6 using information from Tables 10.4 and 10.5.
The ﬁnal column of the table shows the expenditure on each of the four energy
inputs and the total cost of the basket is 8581.01 this is in £m so altogether
about £8.58bn was spent on energy by industry. This sum may be written as
∑
i
p
0i
q
0i
8581.01
where the summation is calculated over all the four fuels. Here p refers to prices
q to quantities. The ﬁrst subscript 0 refers to the year the second i to each
Table 10.4 Fuel prices to industry 2002–2006
Year Coal £/tonne Petroleum £/tonne Electricity £/MWh Gas £/therm
2002 36.97 132.24 29.83 0.780
2003 34.03 152.53 28.68 0.809
2004 37.88 153.71 31.26 0.961
2005 44.57 204.28 42.37 1.387
2006 43.63 260.47 55.07 1.804
Table 10.5 Quantities of fuel used by industry 2002
Coal m. tonnes 1.81
Petroleum m. tonnes 5.70
Electricity m. MWh 112.65
Gas m. therms 5641
STFE_C10.qxd 26/02/2009 09:16 Page 346

slide 364:

A price index with more than one commodity
347
energy source in turn. We refer to 2002 as year 0 2003 as year 1 etc. for brevity
of notation. Thus for example p
01
means the price of coal in 2002 q
12
the
consumption of petroleum by industry in 2003.
We now need to ﬁnd what the 2002 basket of energy would cost in each of
the subsequent years using the prices pertaining to those years. For example
for 2003 we value the 2002 basket using the 2003 prices. This is shown in
Table 10.7 and yields a cost of £87.25 bn.
Firms would therefore have to spend an extra £144m 8725 − 8581 in 2003
to buy the same quantities of energy as in 2002. This amounts to an additional
1.7 over the expenditure in 2002. The sum of £8725m may be expressed
as ∑p
1i
q
0i
since it is obtained by multiplying the prices in year 1 2003 by
quantities in year 0 2002.
Similar calculations for subsequent years produce the costs of the 2002 basket
as shown in Table 10.8.
It can be seen that if ﬁrms had purchased the same quantities of each energy
source in the following years they would have had to pay more in each sub-
sequent year up to 2006.
To obtain the energy price index from these numbers we measure the cost
of the basket in each year relative to its 2002 cost i.e. we divide the cost of the
basket in each successive year by ∑p
0i
q
0i
and multiply by 100.
Table 10.6 Cost of the energy basket 2002
Price Quantity Price × × quantity
Coal £/tonne 36.97 1.81 66.916
Petroleum £/tonne 132.24 5.70 753.768
Electricity £/MWh 29.83 112.65 3360.350
Gas £/million therms 0.780 5641 4399.980
Total 8581.013
Table 10.7 The cost of the 2002 energy basket at 2003 prices
2003 Price 2002 Quantity Price × × quantity
Coal £/tonne 34.03 1.81 61.594
Petroleum £/tonne 152.53 5.70 869.421
Electricity £/MWh 28.68 112.65 3230.802
Gas £/million therms 0.809 5641 4563.569
Total 8725.386
Table 10.8 The cost of the energy basket 2002–2006
Formula Cost
2002 ∑p
0
q
0
8581.01
2003 ∑p
1
q
0
8725.39
2004 ∑p
2
q
0
9887.15
2005 ∑p
3
q
0
13 842.12
2006 ∑p
4
q
0
17 943.65
Note: For brevity we have dropped the i subscript in the formula.
STFE_C10.qxd 26/02/2009 09:16 Page 347

slide 365:

Chapter 10 • Index numbers
348
This index is given in Table 10.9 and is called the Laspeyres price index after
its inventor. We say that it uses base-year weights i.e. quantities in the base
year 2002 form the weights in the basket.
We have set the value of the index to 100 in 2002 i.e. the reference year and
the base year coincide though this is not essential.
The Laspeyres index for year n with the base year as year 0 is given by the
following formula
P
n
L
× 100 10.1
Henceforth we shall omit the i subscript on prices and quantities in the formulae
for index numbers for brevity. The index shows that energy prices increased
by 109.11 over the period – a rapid rate of increase. The rise amounts to an
average increase of 20.25 p.a. in the cost of energy. During the same period
prices in general rose by 12.5 or 3.0 p.a. so in relative terms energy became
markedly more expensive.
The choice of 2002 as the base year for the index was an arbitrary one
any year will do. If we choose 2003 as the base year then the cost of the 2003
basket is evaluated in each year including 2002 and this will result in a slightly
different Laspeyres index. The calculations are in Table 10.10. The ﬁnal two
columns of the table compare the Laspeyres index constructed using the 2003
and 2002 baskets respectively the former adjusted to 2002 100. A very small
∑p
ni
q
0i
∑p
0i
q
0i
Table 10.9 The Laspeyres price index
Year Formula Index
2002 × 100 100 8725.39/8581.01 × 100
2003 × 100 101.68 9887.15/8581.01 × 100
2004 × 100 115.22 etc.
2005 × 100 161.31
2006 × 100 209.11
∑p
4
q
0
∑p
0
q
0
∑p
3
q
0
∑p
0
q
0
∑p
2
q
0
∑p
0
q
0
∑p
1
q
0
∑p
0
q
0
∑p
0
q
0
∑p
0
q
0
Table 10.10 The Laspeyres price index using the 2003 basket
Cost of 2003 Laspeyres index Laspeyres index Laspeyres index
basket 2003 100 2002 100 using 2002 basket
2003 basket
2002 8707.50 98.24 100 100
2003 8863.52 100 101.79 101.68
2004 10 033.45 113.20 115.23 115.22
2005 14 040.80 158.41 161.25 161.31
2006 18 198.34 205.32 209.00 209.11
STFE_C10.qxd 26/02/2009 09:16 Page 348

slide 366:

A price index with more than one commodity
349
difference can be seen which is due to the fact that consumption patterns were
very similar in 2002 and 2003. It would not be uncommon to get a larger
difference between the series than in this instance.
The Laspeyres price index shows the increase in the price of energy for the
‘average’ ﬁrm i.e. one which consumes energy in the same proportions as the
2002 basket overall. There are probably very few such ﬁrms: most would use per-
haps only one or two energy sources. Individual ﬁrms may therefore experience
price rises quite different from those shown here. For example a ﬁrm depend-
ing upon electricity alone would face an 85 price increase over the four years
signiﬁcantly different from the ﬁgure of 109 suggested by the Laspeyres index.
a The prices of fuels used by industry 1999–2003 were:
Year Coal £/tonne Petroleum £/tonne Electricity £/MWh Gas £/therm
1999 34.77 104.93 36.23 0.546
2000 35.12 137.90 34.69 0.606
2001 38.07 148.10 31.35 0.816
2002 34.56 150.16 29.83 0.780
2003 34.50 140.00 28.44 0.807
and quantities consumed by industry were:
Coal m tonnes Petroleum m tonnes Electricity m MWh Gas m therms
1999 2.04 5.33 110.98 6039
Calculate the Laspeyres price index of energy based on these data. Use 1999
as the reference year.
b Recalculate the index making 2001 the reference year.
c The quantities consumed in 2000 were:
Coal m tonnes Petroleum m tonnes Electricity m MWh Gas m therms
2000 0.72 5.52 114.11 6265
Calculate the Laspeyres index using this basket and compare to the answer to
part a.
Using current-year weights: the Paasche index
Firms do not of course consume the same basket of energy every year. One
would expect them to respond to changes in the relative prices of fuels and to
other factors. Technological progress means that the efﬁciency with which the
fuels can be used changes causing ﬂuctuations in demand. Table 10.11 shows
the quantities consumed in the years after 2002 and indicates that ﬁrms did
indeed alter their pattern of consumption.
Each of these annual patterns of consumption could be used as the ‘shopping
basket’ for the purpose of constructing the Laspeyres index and each would
give a slightly different price index as we saw with the usage of the 2002 and
2003 baskets. One cannot say that one of these is more correct than the others.
Exercise 10.2
STFE_C10.qxd 26/02/2009 09:16 Page 349

slide 367:

Chapter 10 • Index numbers
350
One further problem is that whichever basket is chosen remains the same
over time and eventually becomes unrepresentative of the current pattern of
consumption.
The Paasche index denoted P
n
P
to distinguish it from the Laspeyres index
overcomes these problems by using current-year weights to construct the index
in other words the basket is continually changing. Suppose 2002 is to be the
reference year so P
0
P
100. To construct the Paasche index for 2003 we use
the 2003 weights or basket for the 2004 value of the index we use the 2004
weights and so on. An example will clarify matters.
The Paasche index for 2003 will be the cost of the 2003 basket at 2003 prices
relative to its cost at 2002 prices i.e.
P
1
P
× 100
P
1
P
× 100 101.79
The general formula for the Paasche index in year n is given in equation 10.2.
P
n
P
× 100 10.2
Table 10.12 shows the calculation of this index for the later years.
The Paasche formula gives a slightly different result than does the Laspeyres
as is usually the case. The Paasche should generally give a slower rate of increase
than does the Laspeyres index. This is because one would expect proﬁt-
maximising ﬁrms to respond to changing relative prices by switching their con-
sumption in the direction of the inputs which are becoming relatively cheaper.
The Paasche index by using the current weights captures this change but the
Laspeyres assuming ﬁxed weights does not. This may happen slowly as it takes
time for ﬁrms to switch to different fuels even if technically possible. This is
why the Paasche can increase faster than the Laspeyres in some years e.g. 2003
although in the long run it should increase more slowly.
∑p
n
q
n
∑p
0
q
n
8863.52
8707.50
∑p
1
q
1
∑p
0
q
1
Table 10.11 Quantities of energy used 2000–2006
C Co oa al l m m t to on nn ne es s P Pe et tr ro ol le eu um m m m t to on nn ne es s E El le ec ct tr ri ic ci it ty y m m M MW Wh h G Ga as s m m t th he er rm ms s
2003 1.86 6.27 113.36 5677
2004 1.85 6.45 115.84 5258
2005 1.79 6.57 118.52 5226
2006 1.71 6.55 116.31 4910
Table 10.12 The Paasche price index
Cost of basket at current prices Cost at 2002 prices Index
2002 8581.01 8581.01 100
2003 8863.52 8707.50 101.79
2004 9735.60 8478.09 114.83
2005 13 692.05 8546.72 160.20
2006 17 043.52 8228.72 207.12
STFE_C10.qxd 26/02/2009 09:16 Page 350

slide 368:

A price index with more than one commodity
351
STATISTICS
IN
PR AC TIC E
··
Is one of the indices more ‘correct’ than the other The answer is that neither
is deﬁnitively correct. It can be shown that the ‘true’ value lies somewhere
between the two but it is difﬁcult to say exactly where. If all the items which
make up the index increase in price at the same rate then the Laspeyres and
Paasche indices would give the same answer so it is the change in relative prices
and the resultant change in consumption patterns which causes problems.
Units of measurement
It is important that the units of measurement in the price and quantity tables
be consistent. Note that in the example the price of coal was measured in
£/tonne and the consumption was measured in millions of tonnes. The other
fuels were similarly treated in the case of electricity one MWh equals one
million watt-hours. But suppose we had measured electricity consumption in
kWh instead of MWh 1 MWh 1000 kWh but still measured its price in £ per
MWh We would then have 2002 data of 29.83 for price as before but 112 650
for quantity. It is as if electricity consumption has been boosted 1000-fold and
this would seriously distort the results. The Laspeyres energy price index would
be by a similar calculation to the one above:
2002 2003 2004 2005 2006
100 96.2 104.8 142.09 184.68
This is incorrect and shows a much lower value than the correct Laspeyres
index because electricity is now given too much weight in the calculation and
electricity prices were rising less rapidly than others.
The Human Development Index
One of the more interesting indices to appear in recent years is the Human
Development Index HDI produced by the United Nations Development Pro-
gramme UNDP. The HDI aims to provide a more comprehensive socioeconomic
measure of a country’s progress than GDP national output. Output is a measure
of how well-off we are in material terms but makes no allowance for the quality
of life and other factors.
The HDI combines a measure of well-being GDP per capita with longevity life
expectancy and knowledge based on literacy and years of schooling. As a result
each country obtains a score from 0 poor to 1 good. Some selected values are
given in the following table.
Country HDI 1970 HDI 1980 HDI 2003 Rank HDI 92 Rank GDP
Canada 0.887 0.911 0.932 1 11
UK 0.873 0.892 0.919 10 19
Hong Kong 0.737 0.830 0.875 24 22
Gabon 0.378 0.468 0.525 114 42
Senegal 0.176 0.233 0.322 143 114
One can see that there is an association between the HDI and GDP but not a
perfect one. Canada has the world’s 11th highest GDP per capita but comes top of
➔
STFE_C10.qxd 26/02/2009 09:16 Page 351

slide 369:

Chapter 10 • Index numbers
352
the HDI rankings. In contrast Gabon some way up the GDP rankings is much
lower when the HDI is calculated.
So how is the HDI calculated from the initial data How can we combine
life expectancy which can stretch from 0 to 80 years or more with literacy the
proportion of the population who can read and write The answer is to score all
of the variables on a scale from 0 to 100.
The HDI sets a range for national average life expectancy between 25 and
85 years. A country with a life expectancy of 52.9 the case of Gabon therefore
scores 0.465 i.e. 52.9 is 46.5 of the way between 25 and 85.
Adult literacy can vary between 0 and 100 of the population so needs no
adjustment. Gabon’s ﬁgure is 0.625. The scale used for years of schooling is 0 to
15 so Gabon’s very low average of 2.6 yields a score of 0.173. Literacy and school-
ing are then combined in a weighted average with a
2
/3 weight on literacy to give
a score for knowledge of
2
/ 3 × 0.625 +
1
/3 × 0.173 0.473.
For income Gabon’s average of 3498 is compared to the global average of
5185 to give a score of 0.636. Incomes above 5185 are manipulated to avoid
scores above 1.
A simple average of 0.465 0.473 and 0.636 then gives Gabon’s ﬁnal ﬁgure of
0.525. One can see that its average income is brought down by the poorer scores
in the two other categories resulting in a poorer HDI ranking.
The construction of this index number shows how disparate information can be
brought together into a single index number for comparative purposes. Further
work by UNDP adjusts the HDI on the basis of gender and reveals the stark result
that no country treats its women as well as it does its men.
Adapted from: Human Development Report 1994 and other years. More on the HDI can be found at
http://www.undp.org/
It is possible to make some manipulations of the units of measurement
usually to make calculation easier as long as all items are treated alike. If for
example all prices were measured in pence rather than pounds so all prices
in Table 10.4 were multiplied by 100 then this would have no effect on the
resultant index as you would expect. Similarly if all quantity ﬁgures were
measured in thousands of tonnes thousands of therms and thousands of MWh
there would be no effects on the index even if prices remained in £/tonne etc.
But if electricity were measured in pence per MWh while all other fuels were
in £/tonne a wrong answer would again be obtained. Quantities consumed
should also be measured over the same time period for example millions of
therms per annum. It does not matter what the time period is days weeks
months or years as long as all the items are treated similarly.
The quantities of energy used in subsequent years were:
Coal m tonnes Petroleum m tonnes Electricity m MWh Gas m therms
2001 1.69 6.60 111.34 6142
2002 1.10 5.81 112.37 5650
2003 0.69 6.69 113.93 5880
Calculate the Paasche index for 1999–2003 with 1999 as reference year. Compare
this to the Laspeyres index result.
Exercise 10.3
STFE_C10.qxd 26/02/2009 09:16 Page 352

slide 370:

Using expenditures as weights
353
Using expenditures as weights
On occasion the quantities of each commodity consumed are not available
but expenditures are and a price index can still be constructed using slightly
modiﬁed formulae. It is often easier to ﬁnd the expenditure on a good than to
know the actual quantity consumed think of housing as an example. We shall
illustrate the method with a simpliﬁed example using the data on energy prices
and consumption for the years 2002 and 2003 only. The data are repeated in
Table 10.13.
The data for consumption are assumed to be no longer available but only
the expenditure on each energy source as a percentage of total expenditure.
Expenditure is derived as the product of price and quantity consumed.
The formula for the Laspeyres index can be easily manipulated to accord with
the data as presented in Table 10.13.
The Laspeyres index formula based on expenditure shares is given in equa-
tion 10.3
1
P
n
L
∑
× s
0
× 100 10.3
Equation 10.3 is made up of two component parts. The ﬁrst p
n
/p
0
is simply
the price in year n relative to the base-year price for each energy source. The
second component s
0
p
0
q
0
/∑p
0
q
0
is the share or proportion of total expend-
iture spent on each energy source in the base year the data for which are in
Table 10.13. It should be easy to see that the sum of the s
0
values is 1 so that
equation 10.3 calculates a weighted average of the individual price increases
the weights being the expenditure shares.
The calculation of the Laspeyres index for 2003 using 2002 as the base year is
therefore
P
n
L
× 0.008 +× 0.088 +× 0.392 +× 0.513
1.0168
giving the value of the index as 101.68 the same value as derived earlier using
the more usual methods. Values of the index for subsequent years are calculated
0.809
0.780
28.68
29.83
152.53
132.24
34.03
36.97
p
n
p
0
Table 10.13 Expenditure shares 2002
Prices Quantities Expenditure Share
Coal £/tonne 36.97 1.81 66.92 0.8
Petroleum £/tonne 132.24 5.7 753.77 8.8
Electricity £/MWh 29.83 112.65 3360.35 39.2
Gas £/therm 0.78 5641 4399.98 51.3
Total 8581.01 100.0
Note: The 0.8 share of coal is calculated as 66.92/8581.01 × 100 others are calculated
similarly.
1
See the Appendix to this chapter page 385 for the derivation of this formula.
STFE_C10.qxd 26/02/2009 09:16 Page 353

slide 371:

Chapter 10 • Index numbers
354
by appropriate application of equation 10.3 above. This is left as an exercise for
the reader who may use Table 10.9 to verify the answers.
The Paasche index may similarly be calculated from data on prices and
expenditure shares as long as these are available for each year for which the
index is required. The formula for the Paasche index is
P
n
P
× 100 10.4
The calculation of the Paasche index is also left as an exercise.
Comparison of the Laspeyres and Paasche indices
The advantages of the Laspeyres index are that it is easy to calculate and that it
has a fairly clear intuitive meaning i.e. the cost each year of a particular basket
of goods. The Paasche index involves more computation and it is less easy to
envisage what it refers to. As an example of this point consider the following
simple case. The Laspeyres index values for 2004 and 2005 are 115.22 and
161.31. The ratio of these two numbers 1.40 would suggest that prices rose by
40 between these years. What does this ﬁgure actually represent The 2005
Laspeyres index has been divided by the same index for 2004 i.e.
which is the ratio of the cost of the 2002 basket at 2005 prices to its cost at 2004
prices. This makes some intuitive sense. Note that it is not the same as the
Laspeyres index for 2005 with 2004 as base year which would require using
q
2
in the calculation.
If the same is done with the Paasche index numbers a rise of 39.5 is
obtained between 2004 and 2005 virtually the same result. But the meaning
of this is not so clear since the relevant formula is
which does not simplify further. This is a curious mixture of 2004 and 2005
quantities and 2002 2004 and 2005 prices
The major advantage of the Paasche index however is that the weights are
continuously updated so that the basket of goods never becomes out of date. In
the case of the Laspeyres index the basket remains unchanged over a period
becoming less and less representative of what is being bought by consumers.
When revision is ﬁnally made there may therefore be a large change in the
weighting scheme. The extra complexity of calculation involved in the Paasche
index is less important now that computers do most of the work.
a Calculate the share of expenditure going to each of the four fuel types in the pre-
vious exercises and use this result to recalculate the Laspeyres and Paasche
indexes using equations 10.3 and 10.4.
b Check that the results are the same as calculated in previous exercises.
∑p
2
q
2
∑p
0
q
2
∑p
3
q
3
∑p
0
q
3
P
3
P
P
2
P
∑p
3
q
0
∑p
2
q
0
∑p
2
q
0
∑p
0
q
0
∑p
3
q
0
∑p
0
q
0
P
3
L
P
2
L
1
∑
p
0
s
n
p
n
Exercise 10.4
STFE_C10.qxd 26/02/2009 09:16 Page 354

slide 372:

Quantity and expenditure indices
355
The story so far – a brief summary
We have encountered quite a few different concepts and calculations thus far
and it might be worthwhile to brieﬂy summarise what we have covered before
moving on. In order we have examined:
● a simple index for a single commodity
● a Laspeyres price index which uses base year weights
● a Paasche price index which uses current year weights and is an alternative
to the Laspeyres formulation
● the same Laspeyres and Paasche indices but calculated using the data in a
slightly different form using expenditure shares rather than quantities.
We now move on to examine quantity and expenditure indices then look at
the relationship between them all.
Quantity and expenditure indices
Just as one can calculate price indices it is also possible to calculate quantity and
value or expenditure indices. We ﬁrst concentrate on quantity indices which
provide a measure of the total quantity of energy consumed by industry each
year. The problem again is that we cannot easily aggregate the different sources
of energy. It makes no sense to add together tonnes of coal and petroleum
therms of gas and megawatts of electricity. Some means has to be found to put
these different fuels on a comparable basis. To do this we now reverse the roles
of prices and quantities: the quantities of the different fuels are weighted by
their different prices prices represent the value to the ﬁrm at the margin of
each different fuel. As with price indices one can construct both Laspeyres and
Paasche quantity indices.
The Laspeyres quantity index
The Laspeyres quantity index for year n is given by
Q
n
L
× 100 10.5
i.e. it is the ratio of the cost of the year n basket to the cost of the year 0 basket
both valued at year 0 prices. Note that it is the same as equation 10.1 but with
prices and quantities reversed.
Using 2002 as the base year the cost of the 2003 basket at 2002 prices is
∑q
1
p
0
1.86 × 36.97 + 6.27 × 132.24 + 113.36 × 29.83 + 5677 × 0.78
8707.50
and the cost of the 2002 basket at 2002 prices is 8581.01 calculated earlier. The
value of the quantity index for 2003 is therefore
Q
1
L
× 100 101.47
8707.50
8581.01
∑q
n
p
0
∑q
0
p
0
STFE_C10.qxd 26/02/2009 09:16 Page 355

slide 373:

Chapter 10 • Index numbers
356
In other words if prices had remained constant between 2002 and 2003
industry would have consumed 1.47 more energy and spent 1.47 more also.
The value of the index for subsequent years is shown in Table 10.14 using
the formula given in equation 10.5.
The Paasche quantity index
Just as there are Laspeyres and Paasche versions of the price index the same is
true for the quantity index. The Paasche quantity index is given by
Q
n
P
× 100 10.6
and is the analogue of equation 10.2 with prices and quantities reversed. The
calculation of this index is shown in Table 10.15 which shows a similar trend
to the Laspeyres index in Table 10.14. Normally one would expect the Paasche
to show a slower increase than the Laspeyres quantity index: ﬁrms should
switch to inputs whose relative prices fall the Paasche gives lesser weight cur-
rent prices to these quantities than does the Laspeyres base-year prices and
thus shows a slower rate of increase.
Expenditure indices
The expenditure or value index is simply an index of the cost of the year n
basket at year n prices and so it measures how expenditure changes over time.
The formula for the index in year n is
E
n
× 100 10.7
There is obviously only one value index and one does not distinguish between
Laspeyres and Paasche formulations. The index can be easily derived as shown
∑p
n
q
n
∑p
0
q
0
∑q
n
p
n
∑q
0
p
n
Table 10.14 Calculation of the Laspeyres quantity index
∑p
0
q
n
Index
2002 8581.01 100
2003 8707.50 101.47 8707.5/8581.01 × 100
2004 8478.09 98.80 8478.09/8581.01 × 100
2005 8546.72 99.60
2006 8228.72 95.89
Table 10.15 Calculation of the Paasche quantity index
∑p
n
q
n
∑p
n
q
0
Index
2002 8581.01 8581.01 100
2003 8863.52 8725.39 101.58
2004 9735.60 9887.15 98.47
2005 13 692.05 13 842.12 98.92
2006 17 043.52 17 943.65 94.98
Note: The ﬁnal column is calculated as the ratio of the previous two columns.
STFE_C10.qxd 26/02/2009 09:16 Page 356

slide 374:

Quantity and expenditure indices
357
in Table 10.16. The expenditure index shows how industry’s expenditure on
energy is changing over time. Thus expenditure in 2006 was 99 higher than
in 2002 for example.
The increase in expenditure over time is a consequence of two effects:
i changes in the prices of energy and ii changes in quantities purchased. It
should therefore be possible to decompose the expenditure index into price and
quantity effects. You many not be surprised to learn that these effects can be
measured by the price and quantity indices we have already covered. We look at
this decomposition in more detail in the next section.
Relationships between price quantity and expenditure indices
Just as multiplying a price by a quantity gives total value or expenditure the
same is true of index numbers. The value index can be decomposed as the
product of a price index and a quantity index. In particular it is the product
of a Paasche quantity index and a Laspeyres price index or the product of
a Paasche price index and a Laspeyres quantity index. This can be very simply
demonstrated using Σ notation
E
n
× Q
n
P
× P
n
L
Paasche quantity times Laspeyres price index 10.8
or
E
n
× P
n
P
× Q
n
L
Paasche price times Laspeyres quantity index 10.9
Thus increases in value or expenditure can be decomposed into price and quantity
effects. Two decompositions are possible and give slightly different answers.
It is also evident that a quantity index can be constructed by dividing a value
index by a price index since by simple manipulation of equations 10.8 and
10.9 we obtain
Q
P
n
E
n
/P
L
n
10.10
and
Q
L
n
E
n
/P
P
n
10.11
Note that dividing the expenditure index by a Laspeyres price index gives a
Paasche quantity index and dividing by a Paasche price index gives a Laspeyres
∑p
0
q
n
∑p
0
q
0
∑p
n
q
n
∑p
0
q
n
∑p
n
q
n
∑p
0
q
0
∑p
n
q
0
∑p
0
q
0
∑p
n
q
n
∑p
n
q
0
∑p
n
q
n
∑p
0
q
0
Table 10.16 The expenditure index
∑p
n
q
n
Index
2002 8581.01 100
2003 8863.52 103.29
2004 9735.60 113.46
2005 13 692.05 159.56
2006 17 043.52 198.62
Note: The expenditure index is a simple index of the expenditures in the previous column.
STFE_C10.qxd 26/02/2009 09:16 Page 357

slide 375:

Chapter 10 • Index numbers
358
Table 10.17 Deﬂating the expenditure series
Expenditure at Laspeyres Expenditure in Index
current prices price index volume terms
2002 8581.01 100 8581.01 100
2003 8863.52 101.68 8716.86 101.58
2004 9735.60 115.22 8449.49 98.47
2005 13 692.05 161.31 8487.99 98.92
2006 17 043.52 209.11 8150.55 94.98
STATISTICS
IN
PR AC TIC E
··
quantity index. In either case we go from a series of expenditures to one repres-
enting quantities having taken out the effect of price changes. This is known
as deﬂating a series and is a widely used and very useful technique. We shall
reconsider our earlier data in the light of this. Table 10.17 provides the detail.
Column 2 of the table shows the expenditure on fuel at current prices or in cash
terms. Column 3 contains the Laspeyres price index repeated from Table 10.9
above. Deﬂating dividing column 2 by column 3 and multiplying by 100 yields
column 4 which shows expenditure on fuel in quantity or volume terms. The
ﬁnal column turns the volume series in column 4 into an index with 2002 100.
This ﬁnal index is equivalent to a Paasche quantity index as illustrated by
equation 10.7 and can be seen by comparison to Table 10.15 above.
Trap
A common mistake is to believe that once a series has been turned into an index
it is inevitably in real or volume terms. This is not the case. One can have an
index of a cash or nominal series e.g. in Table 10.16 above or of a real series
the ﬁnal column of Table 10.17. An index number is really just a change of the
units of measurement to something more useful for presentation purposes it is
not the same as deﬂating the series.
In the example above we used the energy price index to deﬂate the expend-
iture series. However it is also possible to use a general price index such as the
retail price index or the GDP deﬂator to deﬂate. This gives a slightly different result
both in numerical terms and in its interpretation. Deﬂating by a general price
index yields a series of expenditures in constant prices or in real terms. Deﬂating
by a speciﬁc price index e.g. of energy results in a quantity or volume series.
An example should clarify this see Problem 10.11 for data. The government
spends billions of pounds each year on the health service. If this cash expendi-
ture series is deﬂated by a general price index e.g. the GDP deﬂator then we
obtain expenditure on health services at constant prices or real expenditure on
the health service. If the NHS pay and prices index is used as a deﬂator then the
result is an index of the quantity or volume of health services provided. Since
the NHS index tends to rise more rapidly than the GDP deﬂator the volume
series rises more slowly than the series of expenditure at constant prices. This
can lead to a vigorous if pointless political debate. The government claims it is
spending more on the health service in real terms while the opposition claims
that the health service is getting fewer resources. As we have seen both can be right.
STFE_C10.qxd 26/02/2009 09:16 Page 358

slide 376:

Quantity and expenditure indices
359
Exercise 10.5
STATISTICS
IN
PR AC TIC E
··
a Use the data from earlier exercises to calculate the Laspeyres quantity index.
b Calculate the Paasche quantity index.
c Calculate the expenditure index.
d Check that dividing the expenditure index by the price index gives the quantity
index remember that there are two ways of doing this.
The real rate of interest
Another example of ‘deﬂating’ is calculating the ‘real’ rate of interest. This adjusts
the actual sometimes called ‘nominal’ rate of interest for changes in the value
of money i.e. inﬂation. If you earn a 7 rate of interest on your money over a year
but the price level rises by 5 at the same time you are clearly not 7 better off.
The real rate of interest in this case would be given by
real interest rate − 1 0.019 1.9 10.12
In general if r is the interest rate and i is the inﬂation rate the real rate of
interest is given by
real interest rate − 1 10.13
A simpler method is often used in practice which gives virtually identical results
for small values of r and i. This is to subtract the inﬂation rate from the interest
rate giving 7 − 5 2 in this case.
Chain indices
Whenever an index number series over a long period of time is wanted it is
usually necessary to link together a number of separate shorter indices result-
ing in a chain index. Without access to the original raw data it is impossible to
construct a proper Laspeyres or Paasche index so the result will be a mixture
of different types of index number but it is the best that can be done in the
circumstances.
Suppose that the following two index number series are available. Access to
the original data is assumed to be impossible.
Laspeyres price index for energy 2002–2006 from Table 10.9
2002 2003 2004 2005 2006
100 101.68 115.22 161.31 209.11
Laspeyres price index for energy 1999–2003
1998 1999 2000 2001 2002
104.54 100 104.63 116.68 111.87
The two series have different reference years and use different shopping
baskets of consumption. The ﬁrst index measures the cost of the 2002 basket in
each of the subsequent years. The second measures the price of the 1999 basket
1 + r
1 + i
1 + 0.07
1 + 0.05
STFE_C10.qxd 26/02/2009 09:16 Page 359

slide 377:

Chapter 10 • Index numbers
360
Table 10.18 A chain index of energy prices 1998–2006
‘Old’ index ‘New’ index Chain index
1998 104.54 – 104.54
1999 100 – 100
2000 104.63 – 104.63
2001 116.68 – 116.68
2002 111.87 100 111.87
2003 – 101.68 113.75
2004 – 115.22 128.90
2005 – 161.31 180.46
2006 – 209.11 233.93
Note: After 2001 the chain index values are calculated by multiplying the ‘new’ index by
1.1187 e.g. 113.75 101.68 × 1.1187 for 2003.
in surrounding years. There is an ‘overlap’ year which is 2002. How do we
combine these into one continuous index covering the whole period
The obvious method is to use the ratio of the costs of the two baskets in
2002 111.87/100 1.1187 to alter one of the series. To base the continuous
series on 1999 100 requires multiplying each of the post-2002 ﬁgures by
1.1187 as is demonstrated in Table 10.18. Alternatively the continuous series
could just as easily be based on 2002 100 by dividing the pre-2002 numbers
by 1.1187.
The continuous series is not a proper Laspeyres index number as can be seen
if we examine the formulae used. We shall examine the 2006 ﬁgure 233.93 by
way of example. This ﬁgure is calculated as 233.93 209.11 × 111.87/100 which
in terms of our formulae is
× 100 10.14
The proper Laspeyres index for 2006 using 1999 weights is
× 100 10.15
There is no way that this latter equation can be derived from equation 10.14
proving that the former is not a properly constructed Laspeyres index number.
Although it is not a proper index number series it does have the advantage of the
weights being revised and therefore more up-to-date.
Similar problems arise when deriving a chain index from two Paasche index
number series. Investigation of this is left to the reader the method follows that
outlined above for the Laspeyres case.
The Retail Price Index
As an example consider the UK Retail Price Index which is one of the more
sophisticated of index numbers involving the recording of the prices of
around 550 items each month and weighting them on the basis of households’
∑p
06
q
99
∑p
99
q
99
∑p
02
q
99
∑p
02
q
99
∑p
06
q
02
∑p
02
q
02
STFE_C10.qxd 26/02/2009 09:16 Page 360

slide 378:

The Retail Price Index
361
STATISTICS
IN
PR AC TIC E
··
expenditure patterns as revealed by the Expenditure and Food Survey the EFS
was explained in more detail in Chapter 9 on sampling methods. The prin-
ciples involved in the calculation are similar to those set out above with slight
differences due to a variety of reasons.
The RPI is something of a compromise between a Laspeyres and a Paasche
index. It is calculated monthly and within each calendar year the weights used
remain constant so that it takes the form of a Laspeyres index. Each January
however the weights are updated on the basis of evidence from the EFS so that
the index is in fact a set of chain-linked Laspeyres indices the chaining taking
place in January each year. Despite the formal appearance as a Laspeyres index
the RPI measured over a period of years has the characteristics of a Paasche
index due to the annual change in the weights.
Another departure from principle is the fact that about 14 of households
are left out when expenditure weights are calculated. These consist of most
pensioner households 10 and the very rich 4 because they tend to have
signiﬁcantly different spending patterns from the rest of the population and
their inclusion would make the index too unrepresentative. A separate RPI is
calculated for pensioners while the very rich have to do without one.
A change in the quality of goods purchased can also be problematic as
alluded to earlier. If a manufacturer improves the quality of a product and
charges more is it fair to say that the price has gone up Sometimes it is pos-
sible to measure improvement if the power of a vacuum cleaner is increased for
example but other cases are more difﬁcult such as if the punctuality of a train
service is improved. By how much has quality improved In many circumst-
ances the statistician has to make a judgement about the best procedure to
adopt. The ONS does make explicit allowance for the increase for quality of per-
sonal computers for example taking account of such as factors as increased
memory and processing speed.
Prices in the long run
Table 10.19 shows how prices have changed over the longer term. The ‘inﬂation-
adjusted’ column shows what the item would have cost if it had risen in line with
the overall retail price index. It is clear that some relative prices have changed
substantially and you can try to work out the reasons.
Table 10.19 80 years of prices: 1914–1994
Item 1914 price Inﬂation-adjusted 1994 price
price
Car £730 £36 971 £6995
London–Manchester 1st class rail fare £2.45 £124.08 £130
Pint of beer 1p 53p £1.38
Milk quart 1.5p 74p 70p
Bread 2.5p £1.21 51p
Butter 6p £3.06 68p
Double room at Savoy Hotel London £1.25 £63.31 £195
➔
STFE_C10.qxd 26/02/2009 09:16 Page 361

slide 379:

Chapter 10 • Index numbers
362
The Ofﬁce for National Statistics has gone back even further and shown that
since 1750 prices have increased about 140 times. Most of this occurred after
1938: up till then prices had only risen by about three times over two centuries
about half a per cent per year on average since then prices have risen 40-fold
or about 6 per annum.
The index of energy prices for the years 1995–1999 was:
1995 1996 1997 1998 1999
100 86.3 85.5 88.1 88.1
Use these data to calculate a chain index from 1995 to 2006 setting 1995 100.
Discounting and present values
Deﬂating makes expenditures in different years comparable by correcting for the
effect of inﬂation. The future sum is deﬂated reduced because of the increase
in the general price level. Discounting is a similar procedure for comparing amounts
across different years correcting for time preference. For example suppose that
by investing £1000 today a ﬁrm can receive £1100 in a year’s time. To decide if
the investment is worthwhile the two amounts need to be compared.
If the prevailing interest rate is 12 then the ﬁrm could simply place
its £1000 in the bank and earn £120 interest giving it £1120 at the end of the
year. Hence the ﬁrm should not invest in this particular project it does better
keeping money in the bank. The investment is not undertaken because
£1000 × 1 + r £1100
where r is the interest rate 12 or 0.12. Alternatively this inequality may be
expressed as
£1000
The expression on the right-hand side of the inequality sign is the present value
PV of £1100 received in one year’s time. Here r is the rate of discount and is
equal to the rate of interest in this example because this is the rate at which the
ﬁrm can transform present into future income and vice versa. In what follows
we use the terms interest rate and discount rate interchangeably. The term
1/1 + r is known as the discount factor. Multiplying an amount by the discount
factor results in the present value of the sum.
We can also express the inequality as follows by subtracting £1000 from
each side:
0 −£1000 +
The right-hand side of this expression is known as the net present value NPV
of the project. It represents the difference between the initial outlay and the
present value of the return generated by the investment. Since this is negative
£1100
1 + r
£1100
1 + r
Exercise 10.6
STFE_C10.qxd 26/02/2009 09:16 Page 362

slide 380:

The Retail Price Index
363
the investment is not worthwhile the money would be better placed on deposit
in a bank. The general rule is to invest if the NPV is positive.
Similarly the present value of £1100 to be received in two years’ time is
PV £876.91
when r 12. In general the PV of a sum S to be received in t years is
PV
The PV may be interpreted as the amount a ﬁrm would be prepared to pay
today to receive an amount S in t years’ time. Thus a ﬁrm would not be prepared
to make an outlay of more than £876.91 in order to receive £1100 in two years’
time. It would gain more by putting the money on deposit and earning 12
interest per annum.
Most investment projects involve an initial outlay followed by a series of
receipts over the following years as illustrated by the ﬁgures in Table 10.20. In
order to decide if the investment is worthwhile the present value of the income
stream needs to be compared to the initial outlay. The PV of the income stream
is obtained by adding together the present value of each year’s income. Thus we
calculate
2
PV +++ 10.16
or more concisely using Σ notation
PV
∑
10.17
Columns 3 and 4 of the table show the calculation of the present value. The
discount factors 1/1 + r
t
are given in column 3. Multiplying column 2 by
column 3 gives the individual elements of the PV calculation as in equation
10.16 above and their sum is 1034.14 which is the present value of the
returns. Since the PV is greater than the initial outlay of 1000 the investment
generates a return of at least 12 and so is worthwhile.
S
t
1 + r
t
S
4
1 + r
4
S
3
1 + r
3
S
2
1 + r
2
S
1
1 + r
S
1 + r
t
£1100
1 + 0.12
2
£1100
1 + r
2
Table 10.20 The cash ﬂows from an investment project
Year Outlay or income Discount factor Discounted income
2001 Outlay −1000
2002 Income 300 0.893 267.86
2003 400 0.797 318.88
2004 450 0.712 320.30
2005 200 0.636 127.10
Total 1034.14
Note: The discount factors are calculated as 0.893 1/1.12 0.797 1/1.12
2
etc.
2
This present value example has only four terms but in principle there can be any
number of terms stretching into the future.
STFE_C10.qxd 26/02/2009 09:16 Page 363

slide 381:

Chapter 10 • Index numbers
364
An alternative investment criterion: the internal rate of return
The investment rule can be expressed in a different manner using the internal
rate of return IRR. This is the rate of discount which makes the NPV equal to
zero i.e. the present value of the income stream is equal to the initial outlay.
An IRR of 10 equates £1100 received next year to an outlay of £1000 today.
Since the IRR is less than the market interest rate 12 this indicates that the
investment is not worthwhile: it only yields a rate of return of 10. The rule
‘invest if the IRR is greater than the market rate of interest’ is equivalent to the
rule ‘invest if the net present value is positive using the interest rate to discount
future revenues’.
In general it is mathematically difﬁcult to ﬁnd the IRR of a project with a
stream of future income except by trial and error methods. The IRR is the value
of r which sets the NPV equal to zero i.e. it is the solution to
NPV −S
0
+
∑
0 10.18
where S
0
is the initial outlay. Fortunately most spreadsheet programs have an
internal routine for its calculation. This is illustrated in Figure 10.1 which shows
the calculation of the IRR for the data in Table 10.20 above.
Cell C13 contains the formula ‘ IRRC6:C10 0.1’ – this can be seen just
above the column headings – which is the function used in Excel to calculate the
internal rate of return. The ﬁnancial ﬂows of the project are in cells C6:C10 the
value 0.1 10 is an initial guess at the answer – Excel starts from this value and
then tries to improve upon it. The IRR for this project is found to be 13.7
which is indeed above the market interest rate of 12. The ﬁnal two columns
show that the PV of the income stream when discounted using the internal rate
of return is equal to the initial outlay as it should be. The discount factors in
the penultimate column are calculated using r 13.7.
S
t
1 + r
t
Figure 10.1
Calculation of IRR
Note: Note that the first term in the series is the initial outlay cell C4 and that it is entered
as a negative number. If a positive value is entered the IRR function will not work.
STFE_C10.qxd 26/02/2009 09:16 Page 364

slide 382:

The Retail Price Index
365
The IRR is particularly easy to calculate if the income stream is a constant
monetary sum. If the initial outlay is S
0
and a sum S is received each year in
perpetuity like a bond then the IRR is simply
IRR
For example if an outlay of £1000 yields a permanent income stream of
£120 p.a. then the IRR is 12. This should be intuitively obvious since invest-
ing £1000 at an interest rate of 12 would give you an annual income of £120.
Although the NPV and IRR methods are identical in the above example this
is not always the case in more complex examples. When comparing two invest-
ment projects of different sizes it is possible for the two methods to come up
with different rankings. Delving into this issue is beyond the scope of this book
but in general the NPV method is the more reliable of the two.
Nominal and real interest rates
The above example took no account of possible inﬂation. If there were a
high rate of inﬂation part of the future returns to the project would be purely
inﬂationary gains and would not reﬂect real resources. Is it possible our calcula-
tion is misleading under such circumstances
There are two ways of dealing with this problem:
1 use the actual cash ﬂows and the nominal market interest rate to discount
or
2 use real inﬂation-adjusted ﬂows and the real interest rate.
These two methods should give the same answer.
If an income stream has already been deﬂated to real terms then the present
value should be obtained by discounting by the real interest rate not the
nominal market rate. Table 10.21 illustrates the principle. Column 1 repeats
the income ﬂows in cash terms from Table 10.20. Assuming an inﬂation rate
of i 7 per annum gives the price index shown in column 2 based on 2001
100. This is used to deﬂate the cash series to real terms shown in column 3.
This is in constant 2001 prices. If we were presented only with the real income
series and could not obtain the original cash ﬂows we would have to discount
the real series by the real interest rate r
r
deﬁned by
S
S
0
Table 10.21 Discounting a real income stream
Year Cash Price Real Real discount Discounted
ﬂows index income factor sums
1 2 3 4 5
2001 Outlay −1000 100
2002 Income 300 107.0 280.37 0.955 267.86
2003 400 114.5 349.38 0.913 318.88
2004 450 122.5 367.33 0.872 320.30
2005 200 131.1 152.58 0.833 127.10
Total 1034.14
STFE_C10.qxd 26/02/2009 09:16 Page 365

slide 383:

Chapter 10 • Index numbers
366
Exercise 10.7
Exercise 10.8
1 + r
r
10.19
With a nominal interest rate of 12 and an inﬂation rate of 7 this gives
1 + r
r
1.0467 10.20
so that the real interest rate is 4.67 and in this example is the same every
year. The discount factors used to discount the real income ﬂows are shown in
column 4 of the table based on the real interest rate the discounted sums are
in column 5 and the present value of the real income series is £1034.14. This
is the same as was found earlier by discounting the cash ﬁgures by the nominal
interest rate. Thus one can discount either the nominal cash values using the
nominal discount rate or the real ﬂows by the real interest rate. Make sure you
do not confuse the nominal and real interest rates.
The real interest rate can be approximated by subtracting the inﬂation rate
from the nominal interest rate i.e. 12 − 7 5. This gives a reasonably
accurate approximation for low values of the interest and inﬂation rates below
about 10 p.a.. Because of the simplicity of the calculation this method is
often preferred.
a An investment of £100000 yields returns of £25000 £35000 £30000 and
£15 000 in each of the subsequent four years. Calculate the present value of the
income stream and compare to the initial outlay using an interest rate of 10
per annum.
b Calculate the internal rate of return on this investment.
a An investment of £50 000 yields cash returns of £20 000 £25 000 £30 000 and
£10 000 in each subsequent year. The rate of inﬂation is a constant 5 and the
rate of interest is constant at 9. Use the rate of inﬂation to construct a price
index and discount the cash ﬂows to real terms.
b Calculate the real discount rate.
c Use the real discount rate to calculate the present value of the real income ﬂows.
d Compare the answer to part c to the result where the nominal cash ﬂows and
nominal interest rate are used.
Inequality indices
A separate set of index numbers is used speciﬁcally in the measurement of
inequality such as inequality in the distribution of income. We have already
seen how we can measure the dispersion of a distribution via the variance and
standard deviation. This is based upon the deviations of the observations about
the mean. An alternative idea is to measure the difference between every pair of
observations and this forms the basis of a statistic known as the Gini coefﬁcient.
This would probably have remained an obscure measure due to the complexity
1 + 0.12
1 + 0.07
1 + r
1 + i
STFE_C10.qxd 26/02/2009 09:16 Page 366

slide 384:

The Lorenz curve
367
Table 10.22 The distribution of gross income in the UK 2006–2007
Range of weekly Mid-point of interval Number of
household income households
0– 50 516
100– 150 3095
200– 250 3869
300– 350 3095
400– 450 2579
500– 550 2063
600– 650 2063
700– 750 1548
800– 850 1290
900– 950 1032
1000– 1250 4385
Total 25 534
of calculation were it not for Konrad Lorenz who showed that there is an
attractive visual interpretation of it now known as the Lorenz curve and a
relatively simple calculation of the Gini coefﬁcient based on this curve.
We start off by constructing the Lorenz curve based on data for the UK
income distribution in 2006 and proceed then to calculate the Gini coefﬁcient.
We then use these measures to look at inequality both over time in the UK and
across different countries.
We then examine another manifestation of inequality in terms of market
shares of ﬁrms. For this analysis we look at the calculation of concentration
ratios and at their interpretation.
The Lorenz curve
Table 10.22 shows the data for the distribution of income in the UK based on
data from the Family Resources Survey 2006–07 published by the ONS. The data
report the total weekly income of each household which means that income
is recorded after any cash beneﬁts from the state e.g. a pension have been
received but before any taxes have been paid.
The table indicates a substantial degree of inequality. For example the poorest
14 of households earn £200 per week or less while the richest 17 earn
more than £1000 ﬁve times as much. Although these ﬁgures give some idea
of the extent of inequality they relate only to relatively few households at the
extremes of the distribution. A Lorenz curve is a way of graphically presenting
the whole distribution. A typical Lorenz curve is shown in Figure 10.2.
Households are ranked along the horizontal axis from poorest to richest
so that the median household for example is halfway along the axis. On the
vertical axis is measured the cumulative share of income which goes from 0
to 100. A point such as A on the diagram indicates that the poorest 30 of
households earn 5 of total income. Point B shows that the poorest half of
STFE_C10.qxd 26/02/2009 09:16 Page 367

slide 385:

Chapter 10 • Index numbers
368
the population earn only 18 of income and hence the other half earn 82.
Joining up all such points maps out the Lorenz curve.
A few things are immediately obvious about the Lorenz curve:
● Since 0 of households earn 0 of income and 100 of households earn
100 of income the curve must run from the origin up to the opposite
corner.
● Since households are ranked from poorest to richest the Lorenz curve must
lie below the 45° line which is the line representing complete equality. The
further away from the 45° line is the Lorenz curve the greater is the degree
of inequality.
● The Lorenz curve must be concave from above: as we move to the right we
encounter successively richer individuals so the cumulative income grows
faster.
Table 10.23 shows how to generate a Lorenz curve for the data given in
Table 10.22. The task is to calculate the x y coordinates for the Lorenz curve.
These are given in columns 6 and 8 respectively of the table. Column 5 of the
table calculates the proportion of households in each income category i.e. the
relative frequencies as in Chapter 1 and these are then cumulated in column
6. These are the ﬁgures which are used along the horizontal axis. Column 4
calculates the total income going to each income class by multiplying the class
frequency by the mid-point. The proportion of total income going to each class
is then calculated in column 7 class income divided by total income. Column
8 cumulates the values in column 7.
Using columns 6 and 8 of the table we can see for instance that the poorest
2 of the population have about 0.2 of total income one-tenth of their ‘fair
share’ the poorer half have about 25 of income and the top 20 have about
40 of total income. Figure 10.3 shows the Lorenz curve plotted using the data
in columns 6 and 8 of the table above.
Figure 10.2
Typical Lorenz curve
STFE_C10.qxd 26/02/2009 09:16 Page 368

Chapter 10 • Index numbers
370
The Gini coefﬁcient
The Gini coefﬁcient is a numerical representation of the degree of inequality in
a distribution and can be derived directly from the Lorenz curve. The Lorenz
curve is illustrated once again in Figure 10.4 and the Gini coefﬁcient is simply
the ratio of area A to the sum of areas A and B.
Denoting the Gini coefﬁcient by G we have
G 10.21
and it should be obvious that G must lie between 0 and 1. When there is total
equality the Lorenz curve coincides with the 45 line area A then disappears and
G 0. With total inequality one household having all the income area B dis-
appears and G 1. Neither of these extremes is likely to occur in real life instead
one will get intermediate values but the lower the value of G the less inequality
there is though see the caveats listed below. One could compare two countries
for example simply by examining the values of their Gini coefﬁcients.
The Gini coefﬁcient may be calculated from the following formulae for areas
A and B using the x and y co-ordinates from Table 10.23
B x
1
− x
0
× y
1
+ y
0
10.22
+ x
2
− x
1
× y
2
+ y
1
3
+ x
k
− x
k−1
× y
k
+ y
k−1
x
0
y
0
0 and x
k
y
k
100 represent the two end-points of the Lorenz curve and
the other x and y values are the coordinates of the intermediate points. k is the
number of classes for income in the frequency table. Area A is then given by
3
1
2
A
A + B
Figure 10.4
Calculation of the Gini
coefﬁcient from the
Lorenz curve
3
The value 5000 is correct if one uses percentages as here it is 100 × 100 × the area of
the triangle. If one uses percentages expressed as decimals then A 0.5 − B.
1
2
STFE_C10.qxd 26/02/2009 09:16 Page 370

slide 388:

The Gini coefﬁcient
371
A 5000 − B 10.23
and the Gini coefﬁcient is then calculated as
G or 10.24
Thus for the data in Table 10.23 we have
B × 2.0 − 0 × 0.2 + 0 10.25
+ 14.1 − 2.0 × 3.3 + 0.2
+ 29.3 − 14.1 × 9.8 + 3.3
+ 41.4 − 29.3 × 17.1 + 9.8
+ 51.5 − 41.4 × 24.8 + 17.1
+ 59.6 − 51.5 × 32.5 + 24.8
+ 67.7 − 59.6 × 41.5 + 32.5
+ 73.7 − 67.7 × 49.3 + 41.5
+ 78.8 − 73.7 × 56.6 + 49.3
+ 82.8 − 78.8 × 63.2 + 56.6
+ 100 − 82.8 × 100 + 63.2
3210.5
Therefore area A 5000 − 3210.5 1789.5 and we obtain
G 0.3579 10.26
or approximately 36.
This method implicitly assumes that the Lorenz curve is made up of straight
line segments connecting the observed points which is in fact not true – it
should be a smooth curve. Since the straight lines will lie inside the true Lorenz
curve area B is over-estimated and so the calculated Gini coefﬁcient is biased
downwards. The true value of the Gini coefﬁcient is slightly greater than 36
therefore. The bias will be greater a the fewer the number of observations and
b the more concave is the Lorenz curve i.e. the greater is inequality. The bias
is unlikely to be substantial however so is best left untreated.
An alternative method of calculating G is simply to draw the Lorenz curve on
gridded paper and count squares. This has the advantage that you can draw a
smooth line joining the observations and avoid the bias problem mentioned
above. This alternative method can prove reasonably quick and accurate but
has the disadvantage that you cannot use a computer to do it
Is inequality increasing
The Gini coefﬁcient is only useful as a comparative measure for looking at
trends in inequality over time or for comparing different countries or regions.
Table 10.24 taken from the Statbase website shows the value of the Gini
coefﬁcient for the UK over the past 10 years and shows how it was affected by
the tax system. The results are based on equivalised income i.e. after making a
correction for differences in family size.
4
For this reason there is a slight differ-
ence from the Gini coefﬁcient calculated above which uses unadjusted data.
1789.5
5000
1
2
A
5000
A
A + B
4
This is because a larger family needs more income to have the same living standard as a
smaller one.
STFE_C10.qxd 26/02/2009 09:16 Page 371

slide 389:

Chapter 10 • Index numbers
372
Using equivalised income appears to make little difference in this case
compare the ‘gross income’ column with the earlier calculation.
The table shows essentially two things:
1 The Gini coefﬁcient changes little over time suggesting that the income
distribution is fairly stable.
2 The biggest reduction in inequality comes through cash beneﬁts paid out
by the state rather than through taxes. In fact the tax system appears to
increase inequality rather than to reduce it primarily because of the effects
of indirect taxes.
Recent increases in inequality are a reversal of the historical trend. The ﬁgures
presented in Table 10.25 from L. Soltow
5
provide estimates of the Gini coefﬁci-
ent in earlier times. These ﬁgures suggest that a substantial decline in the Gini
coefﬁcient has occurred in the last century or so perhaps related to the process
of economic development. It is difﬁcult to compare Soltow’s ﬁgures directly with
the modern ones because of such factors as the quality of data and different
deﬁnitions of income.
A simpler formula for the Gini coefﬁcient
Kravis Heston and Summers
6
provide estimates of ‘world’ GDP by decile and
these ﬁgures presented in Table 10.26 will be used to illustrate another method
of calculating the Gini coefﬁcient.
These ﬁgures show that the poorer half of the world population earns
only about 10 of world income and that a third of world income goes to the
richest 10 of the population. This suggests a higher degree of inequality than
for a single country such as the UK as one might expect.
Table 10.24 Gini coefﬁcients for the UK 1995/96–2005/06
Original income Gross income Disposable income Post-tax income
1995/96 51.9 35.7 32.5 36.5
2000/01 51.3 37.5 34.6 38.9
2005/06 51.9 37.3 33.6 37.3
Note: Gross income is original income plus certain state beneﬁts such as pensions. Taking
off direct taxes gives disposable income and subtracting other taxes gives post-tax income.
5
Long run changes in British income inequality Economic History Review 1968 21 17–29.
6
Real GDP per capita for more than 100 countries Economic Journal 1978 88 215–242.
Table 10.25 Gini coefﬁcients in past times
Year Gini
1688 0.55
1801–03 0.56
1867 0.52
1913 0.43–0.63
STFE_C10.qxd 26/02/2009 09:16 Page 372

slide 390:

The Gini coefﬁcient
373
When the class intervals contain equal numbers of households e.g. when the
data are given for deciles of the income distribution as here formula 10.22 for
area B simpliﬁes to
B y
0
+ 2y
1
+ 2y
2
+ ... + 2y
k−1
+ y
k
y
i
− 50 10.27
where k is the number of intervals e.g. 10 in the case of deciles 5 for quintiles.
Thus you simply sum the y values subtract 50
7
and divide by the number
of classes k. The y values for the Kravis et al. data appear in the ﬁnal row of
Table 10.26 and their sum is 282.3. We therefore obtain
B 282.3 − 50 2323
Hence
A 5000 − 2323 2677
and
G 0.5354
or about 53. This is surprisingly similar to the ﬁgure for original income in the
UK but of course differences in deﬁnition measurement etc. may make direct
comparison invalid. While the Gini coefﬁcient may provide some guidance
when comparing inequality over time or across countries one needs to take care
in its interpretation.
a The same data as used in the text are presented below but with fewer class
intervals:
Range of income Mid-point of interval Number of households
0– 100 3611
200– 300 6964
400– 500 4643
600– 700 3611
800– 900 2321
1000– 1250 4385
Total 25 534
Draw the Lorenz curve for these data.
b Calculate the Gini coefﬁcient for these data and compare to that calculated earlier.
2677
5000
100
10
D
F
ik
∑
i0
A
C
100
k
100
2k
Table 10.26 The world distribution of income by decile
Decile 1 2 3 4 5 6 7 8 9 10
GDP 1.5 2.1 2.4 2.4 3.3 5.2 8.4 17.1 24.1 33.5
Cumulative 1.5 3.6 6.0 8.4 11.7 16.9 25.3 42.4 66.5 100.0
7
If using decimal percentages subtract 0.5.
Exercise 10.9
STFE_C10.qxd 26/02/2009 09:16 Page 373

slide 391:

Chapter 10 • Index numbers
374
Exercise 10.10
STATISTICS
IN
PR AC TIC E
··
Given shares of total income of 8 15 22 25 and 30 by each quintile of a
country’s population calculate the Gini coefﬁcient.
Inequality and development
Table 10.27 presents ﬁgures for the income distribution in selected countries
around the world. They are in approximately ascending order of national income.
Table 10.27 Income distribution ﬁgures in selected countries
Year Quintiles Top 10 Gini
1 2 345
Bangladesh 1981–82 6.6 10.7 15.3 22.1 45.3 29.5 0.36
Kenya 1976 2.6 6.3 11.5 19.2 60.4 45.8 0.51
Côte d’Ivoire 1985–86 2.4 6.2 10.9 19.1 61.4 43.7 0.52
El Salvador 1976–77 5.5 10.0 14.8 22.4 47.3 29.5 0.38
Brazil 1972 2.0 5.0 9.4 17.0 66.6 50.6 0.56
Hungary 1982 6.9 13.6 19.2 24.5 35.8 20.5 0.27
Korea Rep. 1976 5.7 11.2 15.4 22.4 45.3 27.5 0.36
Hong Kong 1980 5.4 10.8 15.2 21.6 47.0 31.3 0.38
New Zealand 1981–82 5.1 10.8 16.2 23.2 44.7 28.7 0.37
UK 1979 7.0 11.5 17.0 24.8 39.7 23.4 0.31
Netherlands 1981 8.3 14.1 18.2 23.2 36.2 21.5 0.26
Japan 1979 8.7 13.2 17.5 23.1 37.5 22.4 0.27
The table shows that countries have very different experiences of inequality
even for similar levels of income e.g. compare Bangladesh and Kenya. Hungary
the only former communist country shows the greatest equality although whether
income accurately measures people’s access to resources in such a regime is
perhaps debatable. Note that countries with fast growth such as Korea and Hong
Kong do not have to have a high degree of inequality. Developed countries seem
to have uniformly low Gini coefﬁcients.
Source: World Development Report 2002.
Concentration ratios
Another type of inequality is the distribution of market shares of the ﬁrms in an
industry. We all know that Microsoft currently dominates the software market
with a large market share. In contrast an industry such as bakery has many dif-
ferent suppliers and there is little tendency to dominance. The concentration
ratio is a commonly used measure to examine the distribution of market shares
among ﬁrms competing in a market. Of course it would be possible to measure
this using the Lorenz curve and Gini coefﬁcient but the concentration ratio has
the advantage that it can be calculated on the basis of less information and also
tends to focus attention on the largest ﬁrms in the industry. The concentration
STFE_C10.qxd 26/02/2009 09:16 Page 374

slide 392:

Concentration ratios
375
Table 10.28 Sales ﬁgures for an industry millions of units
Firm A B C D E F G H I J
Sales 180 115 90 62 35 25 19 18 15 10
Exercise 10.11
ratio is often used as a measure of the competitiveness of a particular market but
as with all statistics it requires careful interpretation.
A market is said to be concentrated if most of the demand is met by a small
number of suppliers. The limiting case is monopoly where the whole of the
market is supplied by a single ﬁrm. We shall measure the degree of concentra-
tion by the ﬁve-ﬁrm concentration ratio which is the proportion of the market
held by the largest ﬁve ﬁrms and it is denoted C
5
. The larger is this proportion
the greater the degree of concentration and potentially the less competitive is
that market. Table 10.28 gives the imaginary sales ﬁgures of the 10 ﬁrms in a
particular industry.
For convenience the ﬁrms have already been ranked by size from A the
largest to J smallest. The output of the ﬁve largest ﬁrms is 482 out of a total
of 569 so the ﬁve-ﬁrm concentration ratio is C
5
84.7 i.e. 84.7 of the
market is supplied by the ﬁve largest ﬁrms.
Without supporting evidence it is hard to interpret this ﬁgure. Does it mean
that the market is not competitive and the consumer being exploited Some
industries such as the computer industry have a very high concentration ratio
yet it is hard to deny that they are ﬁercely competitive. On the other hand some
industries with no large ﬁrms have restrictive practices entry barriers etc.
which mean that they are not very competitive lawyers might be one example.
A further point is that there may be a threat of competition from outside the
industry which keeps the few ﬁrms acting competitively.
Concentration ratios can be calculated for different numbers of largest ﬁrms
for example the three-ﬁrm or four-ﬁrm concentration ratio. Straightforward
calculation reveals them to be 67.7 and 78.6 respectively for the data given
in Table 10.28. There is little reason in general to prefer one measure to the
others and they may give different pictures of the degree of concentration in
an industry.
The concentration ratio calculated above relates to the quantity of output
produced by each ﬁrm but it is possible to do the same with sales revenue
employment investment or any other variable for which data are available. The
interpretation of the results will be different in each case. For example the largest
ﬁrms in an industry while producing the majority of output might not provide
the greater part of employment if they use more capital-intensive methods of
production. Concentration ratios obviously have to be treated with caution
therefore and are probably best combined with case studies of the particular
industry before conclusions are reached about the degree of competition.
Total sales in an industry are 400m. The largest ﬁve ﬁrms have sales of 180m
70m 40m 25m and 15m. Calculate the three- and ﬁve-ﬁrm concentration ratios.
STFE_C10.qxd 26/02/2009 09:16 Page 375

slide 393:

Chapter 10 • Index numbers
376
base year
chain index
concentration ratio
deﬂating a data series
discounting
Gini coefﬁcient
internal rate of return
Laspeyres index
Lorenz curve
Paasche index
present value
reference year
retail price index
weighted average
Key terms and concepts
Kravis I. B. Heston A. and Summers R. Real GDP per capita for more than
one hundred countries Economic Journal 1978 349 88 215–242.
L. Soltow Long run changes in British income inequality Economic History
Review 1968 211 17–29.
References
Summary
● An index number summarises the variation of a variable over time or across
space in a convenient way.
● Several variables can be combined into one index providing an average
measure of their individual movements. The retail price index is an example.
● The Laspeyres price index combines the prices of many individual goods
using base-year quantities as weights. The Paasche index is similar but uses
current-year weights to construct the index.
● Laspeyres and Paasche quantity indices can also be constructed combining a
number of individual quantity series using prices as weights. Base-year prices
are used in the Laspeyres index current-year prices in the Paasche.
● A price index series multiplied by a quantity index series results in an index
of expenditures. Rearranging this demonstrates that deﬂating dividing an
expenditure series by a price series results in a volume quantity index. This
is the basis of deﬂating a series in cash or nominal terms to one measured
in real terms i.e. adjusted for price changes.
● Two series covering different time periods can be spliced together as long as
there is an overlapping year to give one continuous chain index.
● Discounting the future is similar to deﬂating but corrects for the rate of time
preference rather than inﬂation. A stream of future income can thus be
discounted and summarised in terms of its present value.
● An investment can be evaluated by comparing the discounted present value
of the future income stream to the initial outlay. The internal rate of return
of an investment is a similar but alternative way of evaluating an investment
project.
● The Gini coefﬁcient is a form of index number that is used to measure
inequality e.g. of incomes. It can be given a visual representation using a
Lorenz curve diagram.
● For measuring the inequality of market shares in an industry the concentra-
tion ratio is commonly used.
STFE_C10.qxd 26/02/2009 09:16 Page 376

slide 394:

377
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
10.1 The data below show exports and imports for the UK 1987–1992 in £bn at current prices.
1987 1988 1989 1990 1991 1992
Exports 120.6 121.2 126.8 133.3 132.1 135.5
Imports 122.1 137.4 147.6 148.3 140.2 148.3
a Construct index number series for exports and imports setting the index equal to 100
in 1987 in each case.
b Is it possible using only the two indices to construct an index number series for the
balance of trade If so do so if not why not
10.2 The following data show the gross trading proﬁts of companies 1987–1992 in the UK in £m.
1987 1988 1989 1990 1991 1992
61 750 69 180 73 892 74 405 78 063 77 959
a Turn the data into an index number series with 1987 as the reference year.
b Transform the series so that 1990 is the reference year.
c What increase has there been in proﬁts between 1987 and 1992 Between 1990 and 1992
10.3 The following data show energy prices and consumption in 1995–1999 analogous to the
data in the chapter for the years 2002–2006.
Prices Coal £/tonne Petroleum £/tonne Electricity £/MWh Gas £/therm
1995 37.27 92.93 40.07 0.677
1996 35.41 98.33 39.16 0.464
1997 34.42 90.86 36.87 0.509
1998 35.16 87.23 36.67 0.560
1999 34.77 104.93 36.23 0.546
Quantities Coal m tonnes Petroleum m tonnes Electricity m MWh Gas m therms
1995 2.91 6.37 102.88 4938
1996 2.22 6.21 105.45 5406
1997 2.14 5.64 107.31 5565
1998 1.81 5.37 107.97 5639
1999 2.04 5.33 110.98 6039
a Construct a Laspeyres price index using 1995 as the base year.
b Construct a Paasche price index. Compare this result with the Laspeyres index.
Do they differ signiﬁcantly
c Construct Laspeyres and Paasche quantity indices. Check that they satisfy the
conditions that E
n
P
L
× Q
P
etc.
Problems
Problems
STFE_C10.qxd 26/02/2009 09:16 Page 377

slide 395:

Chapter 10 • Index numbers
378
10.4 The prices of different house types in south-east England are given in the table below:
Year Terraced houses Semi-detached Detached Bungalows Flats
1991 59 844 77 791 142 630 89 100 47 676
1992 55 769 73 839 137 053 82 109 43 695
1993 55 571 71 208 129 414 82 734 42 746
1994 57 296 71 850 130 159 83 471 44 092
a If the numbers of each type of house in 1991 were 1898 1600 1601 499 and 1702
respectively calculate the Laspeyres price index for 1991–1994 based on 1991 100.
b Calculate the Paasche price index based on the following numbers of dwellings:
Year Terraced houses Semi-detached Detached Bungalows Flats
1992 1903 1615 1615 505 1710
1993 1906 1638 1633 511 1714
1994 1911 1655 1640 525 1717
c Compare Paasche and Laspeyres price series.
10.5 a Using the data in Problem 10.3 calculate the expenditure shares on each fuel in 1995
and the individual price index number series for each fuel with 1995 100.
b Use these data to construct the Laspeyres price index using the expenditures shares
approach. Check that it gives the same answer as in Problem 10.3a.
10.6 The following table shows the weights in the retail price index and the values of the index
itself for 1990 and 1994.
Food Alcohol Housing Fuel Household Clothing Personal Travel Leisure
and and items goods
tobacco light
Weights
1990 205 111 185 50 111 69 39 152 78
1994 187 111 158 45 123 58 37 162 119
Prices
1990 121.0 120.7 163.7 115.9 116.9 115.0 122.7 121.2 117.1
1994 139.5 162.1 156.8 133.9 132.4 116.0 152.4 150.7 145.7
a Calculate the Laspeyres price index for 1994 based on 1990 100.
b Draw a bar chart of the expenditure weights in 1990 and 1994 to show how spending
patterns have changed. What major changes have occurred Do individuals seem to
be responding to changes in relative prices
c The pensioner price index is similar to the general index calculated above except that
it excludes housing. What effect does this have on the index What do you think is the
justiﬁcation for this omission
d If consumers spent on average £188 per week in 1990 and £240 per week in 1994
calculate the real change in expenditure on food.
e Do consumers appear rational i.e. do they respond as one would expect to relative
price changes If not why not
STFE_C10.qxd 26/02/2009 09:16 Page 378

slide 396:

379
10.7 Construct a chain index from the following data series:
1998 2002 2000 2001 2002 2006 2004
Series 1 100 110 115 122 125
Series 2 100 107 111 119 121
What problems arise in devising such an index and how do you deal with them
10.8 Construct a chain index for 1995–2004 using the following data setting 1998 100.
1995 1996 1997 1998 2002 2000 2001 2002 2006 2004
87 95 100 105
98 93 100 104 110
100 106 112
10.9 Industry is complaining about the rising price of energy. It demands to be compensated
for any rise over 5 in energy prices between 2003 and 2004. How much would this com-
pensation cost Which price index should be used to calculate the compensation and what
difference would it make Use the energy price data in the chapter.
10.10 Using the data in Problem 10.6 above calculate how much the average consumer would
need to be compensated for the rise in prices between 1990 and 1994.
10.11 The following data show expenditure on the National Health Service in cash terms the
GDP deﬂator the NHS pay and prices index population and population of working age:
Year NHS GDP NHS pay and Population Population of
expenditure deﬂator price index 000 working age
£m 1973 100 1973 100 000
1 2 3 4 5
1987 21 495 442 573 56 930 34 987
1988 23 601 473 633 57 065 35 116
1989 25 906 504 678 57 236 35 222
1990 28 534 546 728 57 411 35 300
1991 32 321 585 792 57 801 35 467
In all the following answers set your index to 1987 100.
a Turn the expenditure cash ﬁgures into an index number series.
b Calculate an index of ‘real’ NHS expenditure using the GDP deﬂator. How does this
alter the expenditure series
c Calculate an index of the volume of NHS expenditure using the NHS pay and prices
index. How and why does this differ from the answer arrived at in b
d Calculate indices of real and volume expenditure per capita. What difference does this
make
e Suppose that those not of working age cost twice as much to treat on average as
those of working age. Construct an index of the need for health care and examine how
health care expenditures have changed relative to need.
f How do you think the needs index calculated in e could be improved
Problems
STFE_C10.qxd 26/02/2009 09:16 Page 379

slide 397:

Chapter 10 • Index numbers
380
10.12 a If w represents the wage rate and p the price level what is w/p
b If Δw represents the annual growth in wages and i is the inﬂation rate what is Δw − i
c What does lnw − lnp represent ln natural logarithm
10.13 A ﬁrm is investing in a project and wishes to receive a rate of return of at least 15 on it.
The stream of net income is:
Year 1234
Income 600 650 700 400
a What is the present value of this income stream
b If the investment costs £1600 should the ﬁrm invest What is the net present value of
the project
10.14 A ﬁrm uses a discount rate of 12 for all its investment projects. Faced with the follow-
ing choice of projects which yields the higher NPV
Project Outlay Income stream
12345 6
A 5600 1000 1400 1500 2100 1450 700
B 6000 800 1400 1750 2500 1925 1200
10.15 Calculate the internal rate of return for the project in Problem 10.13. Use either trial and
error methods or a computer to solve.
10.16 Calculate the internal rates of return for the projects in Problem 10.14.
10.17 a Draw a Lorenz curve and calculate the Gini coefﬁcient for the wealth data in
Chapter 1 Table 1.3.
b Why is the Gini coefﬁcient typically larger for wealth distributions than for income
distributions
10.18 a Draw a Lorenz curve and calculate the Gini coefﬁcient for the 1979 wealth data
contained in Problem 1.5 Chapter 1. Draw the Lorenz curve on the same diagram as
you used in Problem 10.17.
b How does the answer compare to 2003
10.19 The following table shows the income distribution by quintile for the UK in 2006–2007 for
various deﬁnitions of income:
Quintile Income measure
Original Gross Disposable Post-tax
1 bottom 3 7 7 6
2 7 10 12 11
3 15 16 16 16
4 24 23 22 2
5 top 51 44 42 44
STFE_C10.qxd 26/02/2009 09:16 Page 380

slide 398:

381
a Use equation 10.27 to calculate the Gini coefﬁcient for each of the four categories of
income.
b For the ‘original income’ category draw a smooth Lorenz curve on a piece of gridded
paper and calculate the Gini coefﬁcient using the method of counting squares. How
does your answer compare to that for part a
10.20 For the Kravis Heston and Summers data Table 10.26 combine the deciles into
quintiles and calculate the Gini coefﬁcient from the quintile data. How does your answer
compare with the answer given in the text based on deciles What do you conclude about
the degree of bias
10.21 Calculate the three-ﬁrm concentration ratio for employment in the following industry:
Firm A B C D E F G H
Employees 3350 290 440 1345 821 112 244 352
10.22 Compare the degrees of concentration in the following two industries. Can you say which
is likely to be more competitive
Firm A B C D E F G H I J
Sales 337 384 696 321 769 265 358 521 880 334
Sales 556 899 104 565 782 463 477 846 911 227
10.23 Project The World Development Report contains data on the income distributions of
many countries around the world by quintile. Use these data to compare income dis-
tributions across countries focusing particularly on the differences between poor countries
middle-income and rich countries. Can you see any pattern emerging Are there countries
which do not ﬁt into this pattern Write a brief report summarising your ﬁndings.
Problems
STFE_C10.qxd 26/02/2009 09:16 Page 381

slide 399:

Chapter 10 • Index numbers
382
Answers to exercises
Exercise 10.1
a 100 111.9 140.7 163.4 188.1.
b 61.2 68.5 86.1 100 115.1.
c 115.1/61.2 1.881.
Exercise 10.2
1999 2000 2001 2002 2003
a 1999 100 100 104.63 116.68 111.87 111.30
b 2001 100 85.70 89.67 100 95.87 95.39
c Using 2000 basket 100 104.69 116.86 112.08 111.52
Exercise 10.3
The Paasche index is:
1999 2000 2001 2002 2003
100 104.69 117.27 111.09 110.93
Exercise 10.4
a Expenditure shares in 1999 are:
Expenditure Share
Coal 70.93 0.9
Petroleum 559.28 7.0
Electricity 4020.81 50.6
Gas 3297.29 41.5
giving the Laspeyres index for 2000 as
P
n
1
× 0.009 +× 0.070 +× 0.506 +× 0.415
1.0463 or 104.63.
The expenditure shares in 2000 are 0.3 8.9 46.3 44.4 which allows the
2000 Paasche index to be calculated as
P
n
1
× 100
× 0.003 +× 0.089 +× 0.463 +× 0.444
1.0469 or 104.69.
Later years can be calculated in similar fashion.
0.546
0.606
36.23
34.69
104.93
137.9
34.77
35.12
1
0.606
0.546
34.69
36.23
137.90
104.93
35.12
34.77
STFE_C10.qxd 26/02/2009 09:16 Page 382

slide 400:

Answers to exercises
383
Exercise 10.5
a/b The Laspeyres and Paasche quantity indexes are:
Laspeyres index Paasche index
1999 100 100
2000 102.65 102.71
2001 102.40 102.91
2002 98.18 97.50
2003 101.46 101.12
c The expenditure index is 100 107.46 120.08 109.07 112.55.
d The Paasche quantity index times Laspeyres price index or vice versa gives the
expenditure index.
Exercise 10.6
The full index is using Laspeyres indexes:
Chain index
1995 100 100
1996 86.3 86.3
1997 85.5 85.5
1998 88.1 88.1
1999 88.1 100 88.1
2000 104.63 92.2
2001 116.68 102.8
2002 111.87 100 98.6
2003 111.30 101.68 100.2
2004 115.22 113.6
2005 161.31 159.0
2006 209.11 206.1
Exercise 10.7
a The discounted ﬁgures are:
Year Investment/yield Discount factor Discounted yield
0 −100 000
1 25 000 0.9091 22 727.3
2 35 000 0.8264 28 925.6
3 30 000 0.7513 22 539.4
4 15 000 0.6830 10 245.2
Total 84 437.5
The present value is less than the initial outlay.
b The internal rate of return is 2.12.
STFE_C10.qxd 26/02/2009 09:16 Page 383

slide 401:

Chapter 10 • Index numbers
384
Exercise 10.8
a Deﬂating to real income gives:
Year Investment/yield Price index Real income
0 −50 000 100 −50 000.0
1 20 000 105 19 047.6
2 25 000 110.250 22 675.7
3 30 000 115.763 25 915.1
4 10 000 121.551 8 227.0
b The real discount rate is 1.09/1.05 1.038 or 3.8 p.a.
c/dNominal Discount Discounted Real Discount Discounted
values factor value values factor value
−50 000 −50 000.0
20 000 0.917 18 348.6 19 047.6 0.963 18 348.6
25 000 0.842 21 042.0 22 675.7 0.928 21 042.0
30 000 0.772 23 165.5 25 915.1 0.894 23 165.5
10 000 0.708 7 084.3 8 227.0 0.861 7 084.3
Totals 69 640.4 69 640.38
The present value is the same in both cases and exceeds the initial outlay.
Exercise 10.9
b Range of Mid- Number of Total
income point households income Households Cumulative Income Cumulative
households income
xy
1 2 3 4 5 6 7 8
0– 100 3611 361 100 14.1 14.1 2.4 2.4
200– 300 6964 2 089 200 27.3 41.4 14.1 16.5
400– 500 4643 2 321 500 18.2 59.6 15.6 32.1
600– 700 3611 2 527 700 14.1 73.7 17.0 49.1
800– 900 2321 2 088 900 9.1 82.8 14.0 63.1
1000– 1250 4385 5 481 250 17.2 100.0 36.9 100.0
Totals 25 535 14 869 650 100.0 100.0
The Gini coefﬁcient is then calculated as follows: B 0.5 × 14.1 × 2.4 + 0 + 27.3
× 16.5 + 2.4 + 18.2 × 32.1 + 16.5 + 14.1 × 49.1 + 32.1 + 9.1 × 63.1 + 49.1 +
17.2 × 100 + 63.1 3201. Area A 5000 − 3301 1799. Hence Gini 1799/5000
0.360 very similar to the value in the text using more categories of income.
Exercise 10.10
B 100/5 × 246 − 50 3920. Hence A 1080 and Gini 0.216. 246 is the sum of
the cumulative y values.
Exercise 10.11
C
3
290/400 72.5 and C
5
82.5.
STFE_C10.qxd 26/02/2009 09:16 Page 384

slide 402:

Appendix: Deriving the expenditure share form of the Laspeyres price index
385
Appendix Deriving the expenditure share form of
the Laspeyres price index
We can obtain the expenditure share version of the formula from the standard
formula given in equation 10.1
P
n
L
∑
∑
× s
0
which is equation 10.3 in the text the × 100 is omitted from this derivation for
simplicaty.
p
n
p
0
p
0
q
0
∑p
0
q
0
p
n
p
0
∑
p
n
p
0
q
0
p
0
∑p
0
q
0
∑
p
0
p
0
q
0
p
0
∑p
0
q
0
∑
p
n
p
0
q
0
p
0
∑
p
0
p
0
q
0
p
0
∑p
n
q
0
∑p
0
q
0
STFE_C10.qxd 26/02/2009 09:16 Page 385

slide 403:

Seasonal adjustment of
time-series data 11
Contents
Learning outcomes 386
Introduction 387
The components of a time series 387
Isolating the trend 390
Isolating seasonal factors 393
Seasonal adjustment 396
An alternative method for ﬁnding the trend 398
Forecasting 399
Further issues 400
Summary 401
Key terms and concepts 401
Problems 402
Answers to exercises 404
By the end of this chapter you should be able to:
● recognise the different elements that make up a time series
● isolate the trend from a series by either the additive or multiplicative method
or by using linear regression
● ﬁnd the seasonal factors in a series
● use the seasonal factors to seasonally adjust the data
● forecast the series taking account of seasonal factors
● appreciate the issues involved in the process of seasonal adjustment.
Learning
outcomes
386
Complete your diagnostic test for Chapter 11 now to create your personal study
plan. Exercises with an icon are also available for practice in MathXL with
additional supporting resources.
STFE_C11.qxd 26/02/2009 09:18 Page 386

slide 404:

The components of a time series
387
Introduction
‘Economists noticed some signs in the data that suggest a turning point may be in
the ofﬁng. The claimant count although down also showed February’s ﬁgure had
been revised to show a rise of 600 between January and February the ﬁrst occasion in
17 months that there had been an increase.’
The Guardian 17 April 2008
The quote above describes economists trying to spot a ‘turning point’ in the
unemployment data early in 2008. This is an extremely difﬁcult task for several
reasons:
● By deﬁnition a turning point is a point at which a previous trend changes.
● Data are ‘noisy’ containing a lot of random movements.
● There may be seasonal factors involved e.g. perhaps February’s ﬁgures are
usually considerably higher than January’s so what are we to make of a small
increase.
This chapter is concerned with the interpretation of time series data such
as unemployment retail sales stock prices etc. Agencies such as government
businesses and trade unions are interested in knowing how the economy is
changing over time. Government may want to lower interest rates in response
to an economic slowdown businesses may want to know how much extra stock
they need for Christmas and trade unions will ﬁnd pay bargaining more difﬁcult
if economic conditions worsen. For all of them an accurate picture of the
economy is important.
In this chapter we will show how to decompose a time series such as
unemployment into its component parts: trend cycle seasonal and random.
We then use this breakdown to seasonally adjust the original data i.e. to remove
any variation due solely to time of year effects a vivid example would be the
Christmas season in the case of retail sales. This allows us to see any changes to
the underlying data more easily. Knowing the seasonal pattern to data also help
with forecasting: knowing that unemployment tends to be above trend in
September can aid us in forecasting future levels in that month.
The methods used in this chapter are relatively straightforward compared
to other more sophisticated methods that are available. However they do illus-
trate the essential principles and give similar answers to the more advanced
methods. Later in the chapter we discuss some of the more complex issues that
can arise.
The components of a time series
Unemployment data will be used to illustrate the methods involved in decom-
posing a time series into its component parts. A similar analysis could be carried
out for other time series data common examples being monthly sales data for a
ﬁrm or quarterly data on the money supply. As always one should begin by
looking at the raw data and the best way of doing this is via a time-series chart.
STFE_C11.qxd 26/02/2009 09:18 Page 387

slide 405:

Chapter 11 • Seasonal adjustment of time-series data
388
Table 11.1 presents the monthly unemployment ﬁgures for the period
January 2004 to December 2006 and Figure 11.1 shows a plot of the data. The
chart shows an upwards trend to unemployment around which there also
appears to be a cycle of some kind.
Any time series such as this is made up of two types of elements:
1 systematic components such as a trend cycle and seasonals and
2 random elements which are by deﬁnition unpredictable.
It would be difﬁcult to analyse a series which is completely random such as
the result of tossing a coin see Figure 2.1 in Chapter 2. A look at the unemploy-
ment data however suggests that the series is deﬁnitely non-random – there is
evidence of an upward trend and there does appear to be a seasonal component.
The latter can be seen better if we superimpose each year on the same graph as
shown in Figure 11.2 which adds 2003 and 2007.
Table 11.1 UK unemployment 2004–2006
2004 2005 2006
January 1401 1406 1539
February 1430 1405 1589
March 1425 1397 1602
April 1380 1379 1615
May 1389 1392 1649
June 1427 1433 1718
July 1462 1482 1761
August 1466 1509 1773
September 1445 1552 1753
October 1422 1556 1701
November 1383 1525 1662
December 1373 1494 1645
Note: The data are in 000s so there were 1 401 000 people unemployed in January 2004. This
is according to the International Labour Ofﬁce ILO deﬁnition of unemployment series MGTP
in Statbase.
Figure 11.1
Chart of unemployment
data
STFE_C11.qxd 26/02/2009 09:18 Page 388

slide 406:

The components of a time series
389
The series generally show peaks around February and August with dips
around May and December time. The autumn peak occurs a little later in 2005
which may be associated with the increase in the general level of unemploy-
ment note the high levels of unemployment through 2006. If one wished to
predict unemployment for February 2008 the trend would be projected for-
wards and account taken of the fact that unemployment tends to be slightly
above the trend in February. This also sheds some light upon the Guardian quote
at the top of the chapter. A slight rise in unemployment in February is not sur-
prising and may not indicate a longer term increase in unemployment though
note the quote refers to the claimant count measure of unemployment slightly
different from the measure used in Table 11.1.
A time series can be decomposed into four components: three of them
systematic and one random.
1 A trend: many economic variables are trended over time as noted in
Chapter 1 the investment series. This measures the longer term direction
of the series whether increasing decreasing or unchanging.
2 A cycle: most economies tend to progress unevenly mixing periods of rapid
growth with periods of relative stagnation. This business cycle can vary in
length which makes it difﬁcult to analyse. Consequently it is often ignored
or combined together with the trend.
3 A seasonal component: this is a regular short-term one year cycle. Sales of
ice cream vary seasonally for obvious reasons. Since it is a regular cycle it
is relatively easy to isolate.
4 A random component: this is what is left over after the above factors have
been taken into account. By deﬁnition it cannot be predicted.
These four elements can be combined in either an additive or multiplicative
model. The additive model of unemployment is
X
t
T + C + S + R 11.1
where X represents unemployment T the trend component C the cycle S the
seasonal component and R the random element.
The multiplicative model is
X
t
T × C × S × R 11.2
Figure 11.2
Superimposed time
series graphs of
unemployment
STFE_C11.qxd 26/02/2009 09:18 Page 389

slide 407:

Chapter 11 • Seasonal adjustment of time-series data
390
There is little to choose between the two alternatives the multiplicative
formulation will be used in the rest of this chapter. This is the method generally
used by the Ofﬁce of National Statistics in ofﬁcially published series.
The analysis of unemployment proceeds as follows.
1 First the trend is isolated from the original data by the method of moving
averages.
2 Second the actual employment ﬁgures are then compared to the trend to
see which months tend to have unemployment above trend. This allows
seasonal factors to be extracted from the data.
3 Finally the seasonal factors are used to seasonally adjust the data so that
the underlying movement in the ﬁgures can be observed.
Isolating the trend
There is a variety of methods for isolating the trend from time-series data. The
method used here is that of moving averages one of several methods of smooth-
ing the data. These smoothing methods iron out the short-term ﬂuctuations
in the data by averaging successive observations. For example to calculate the
three-month moving average ﬁgure for the month of July one would take the
average of the unemployment ﬁgures for June July and August. The three-month
moving average for August would be the average of the July August and
September ﬁgures. The ﬁgures are therefore as follows for 2004
July: 1452.67
August: 1457.67
Note that two values 1462 and 1466 are common to the two calculations
so that the two averages tend to be similar and the data series is smoothed out.
Thus the moving average is calculated by moving through the data series
taking successive three-month averages.
The choice of the three-month moving average was arbitrary it could just as
easily have been a four- ﬁve- or 12-month moving average process. How should
the appropriate length of the moving average process be chosen This depends
upon the degree of smoothing of the data that is desired and upon the nature
of the ﬂuctuations. The longer the period of the moving average process the
greater the smoothing of the data since the greater is the number of terms
in the averaging process. In the case of unemployment data the ﬂuctuations
are probably fairly consistent from year to year since for example school leavers
arrive on the unemployment register at the same time every year causing a
jump in the ﬁgures. A 12-month moving average process would therefore be
appropriate to smooth this data series.
Table 11.2 shows how the 12-month moving average series is calculated. The
calculation is the same in principle as the three-month moving average but there
is one slight complication that of centring the data. The Unemployment column
1 of the table repeats the raw data from Table 11.1. In column 2 is calculated
the successive 12-month totals. Thus the total of the ﬁrst 12 observations
Jan–Dec 2004 is 17 003 and this is placed in the middle of 2004 between the
1462 + 1466 + 1445
3
1427 + 1462 + 1466
3
STFE_C11.qxd 26/02/2009 09:18 Page 390

slide 408:

The components of a time series
391
Table 11.2 Calculation of the moving average series
Month Unemployment 12-month Centred Moving
total 12-month total average
1 2 3 4
2004 Jan 1401
2004 Feb 1430
2004 Mar 1425
2004 Apr 1380
2004 May 1389
2004 Jun 1427
17 003
2004 Jul 1462
17 008
17 005.5 1417.1
2004 Aug 1466
16 983
16 995.5 1416.3
2004 Sep 1445
16 995
16 969.0 1414.1
2004 Oct 1422
16 954
16 954.5 1412.9
2004 Nov 1383
16 957
16 955.5 1413.0
2004 Dec 1373
16 963
16 960.0 1413.3
2005 Jan 1406
16 983
16 973.0 1414.4
2005 Feb 1405
17 026
17 004.5 1417.0
2005 Mar 1397
17 133
17 079.5 1423.3
2005 Apr 1379
17 267
17 200.0 1433.3
2005 May 1392
17 409
17 338.0 1444.8
2005 Jun 1433
17 530
17 469.5 1455.8
2005 Jul 1482
17 663
17 596.5 1466.4
2005 Aug 1509
17 847
17 755.0 1479.6
2005 Sep 1552
18 052
17 949.5 1495.8
2005 Oct 1556
18 288
18 170.0 1514.2
2005 Nov 1525
18 545
18 416.5 1534.7
2005 Dec 1494
18 830
18 687.5 1557.3
2006 Jan 1539
19 109
18 969.5 1580.8
2006 Feb 1589
19 373
19 241.0 1603.4
2006 Mar 1602
19 574
19 473.5 1622.8
2006 Apr 1615
19 719
19 646.5 1637.2
2006 May 1649
19 856
19 787.5 1649.0
2006 Jun 1718
20 007
19 931.5 1661.0
2006 Jul 1761
2006 Aug 1773
2006 Sep 1753
2006 Oct 1701
2006 Nov 1662
2006 Dec 1645
Note: In column 2 are the 12-month totals e.g. 17 003 is the sum of the values from 1401 to
1373. In column 3 these totals are centred on the appropriate month e.g. 17 005.5 17 003
+ 17 008/2. The ﬁnal column is column 3 divided by 12.
STFE_C11.qxd 26/02/2009 09:18 Page 391

slide 409:

Chapter 11 • Seasonal adjustment of time-series data
392
months of June and July. The sum of observations 2–13 is 17 008 and falls
between July and August and so on. Notice that it is impossible to calculate any
total before June/July by the moving average process using the data from the
table. A similar effect occurs at the end of the series in the second half of 2006.
Values at the beginning and end of the period in question are always lost by this
method of smoothing. The greater the length of the moving average process the
greater the number of observations lost.
It is inconvenient to have this series falling between the months so it is
centred in column 3. This is done by averaging every two consecutive months’
ﬁgures so the June/July and July/August ﬁgures are averaged to give the July
ﬁgure as follows
17 005.5
This centring problem always arises when the length of the moving average
process is an even number. An alternative to having to centre the data is to use
a 13-month moving average. This gives virtually identical results but it seems
more natural to use a 12-month average for monthly data.
Column 4 of Table 11.2 is equal to column 3 divided by 12 and so gives the
average of 12 consecutive observations and this is the moving average series.
Comparison of the original data with the smoothed series shows the latter to
be free of the short-term ﬂuctuations present in the former. The two series are
graphed together in Figure 11.3.
The chart shows the upward trend clearly starting around January 2005 and
also reveals how this trend appears to level off at the end of 2006. Note also that
the trend appears to start levelling off around the middle of 2006 while actual
unemployment is increasing quite rapidly at that point. Actual unemployment
does not start to drop until September 2006. The moving average thus anticipates
the movements in unemployment. This is not really so surprising since future
values of unemployment are used in the calculation of the moving average
ﬁgure for each month. Note that for this chart we have used data from late 2003
and early 2007 to derive the moving average values at the beginning and end of
the period i.e. we have ﬁlled in the missing values in Table 11.2.
17 003 + 17 008
2
Figure 11.3
Unemployment and its
moving average
STFE_C11.qxd 26/02/2009 09:18 Page 392

slide 410:

The components of a time series
393
The moving average captures the trend and the cycle. How much of the cycle
is included is debatable – the longer the period of the moving average the less
the cycle is captured i.e. the smoother the series in general. The difﬁculty of
disentangling the cyclical element is that it is unclear how long the cycle is or
even whether it exists in the data. For the sake of argument we will assume that
our moving average fully captures both trend and cycle.
Use the quarterly data below to calculate the four quarter moving average series
and draw a graph of the two series for 2001–2004:
Q1 Q2 Q3 Q4
2000 – – 152 149
2001 155 158 155 153
2002 159 166 160 155
2003 162 167 164 160
2004 170 172 172 165
2005 175 179 – –
Isolating seasonal factors
Having obtained the trend-cycle henceforth we refer to this as the trend for
brevity the original data may be divided by the trend values from Table 11.2
column 4 to leave only the seasonal and random components. This can be
understood by manipulating equation 11.2. Ignoring the cyclical component
we have
X T × S × R 11.3
Dividing the original data series X by the trend values T therefore gives the
seasonal and random components
S × R 11.4
Table 11.3 gives the results of this calculation
The ﬁnal column of the table shows the ratio of the actual unemployment
level to the trend value again the trend values for January–June 2004 and
July–December 2006 were calculated using data from outside the sample range
which are not shown. The value for January 2004 0.968 shows the unem-
ployment level in that month to be 3.2 below the trend. The July 2004 ﬁgure
is 3.2 above trend etc. Other months’ ﬁgures can be interpreted in the same
way. Closer examination of the table shows that unemployment tends to be
above its trend in the summer months July to September below trend in
winter November to January and otherwise to be on or about the trend line.
The next task is to disentangle the seasonal and random components which
make up the ‘Ratio’ value in the ﬁnal column of the table. We make the assump-
tion that the random component has a mean value of zero. Then if we average
the ‘Ratio’ values for a particular month the random components should
approximately cancel out leaving just the seasonal component.
X
T
Exercise 11.1
STFE_C11.qxd 26/02/2009 09:18 Page 393

slide 411:

Chapter 11 • Seasonal adjustment of time-series data
394
Hence the seasonal factor S can be obtained by averaging the three S × R com-
ponents for 2004 2005 2006 for each month. For example for January the
seasonal component is obtained as follows
S 0.979 11.5
The more years we have available entering this averaging process the more
accurate is the estimate of the seasonal component. The seasonal component in
this case for January is therefore 1 − 0.979 −0.021 −2.1 and there are
0.968 + 0.994 + 0.974
3
Table 11.3 Isolating seasonal and random components
Unemployment Trend Ratio
Jan 04 1401 1446.9 0.968
Feb 04 1430 1438.2 0.994
Mar 04 1425 1430.4 0.996
Apr 04 1380 1424.6 0.969
May 04 1389 1420.6 0.978
Jun 04 1427 1418.0 1.006
Jul 04 1462 1417.1 1.032
Aug 04 1466 1416.3 1.035
Sep 04 1445 1414.1 1.022
Oct 04 1422 1412.9 1.006
Nov 04 1383 1413.0 0.979
Dec 04 1373 1413.3 0.971
Jan 05 1406 1414.4 0.994
Feb 05 1405 1417.0 0.992
Mar 05 1397 1423.3 0.982
Apr 05 1379 1433.3 0.962
May 05 1392 1444.8 0.963
Jun 05 1433 1455.8 0.984
Jul 05 1482 1466.4 1.011
Aug 05 1509 1479.6 1.020
Sep 05 1552 1495.8 1.038
Oct 05 1556 1514.2 1.028
Nov 05 1525 1534.7 0.994
Dec 05 1494 1557.3 0.959
Jan 06 1539 1580.8 0.974
Feb 06 1589 1603.4 0.991
Mar 06 1602 1622.8 0.987
Apr 06 1615 1637.2 0.986
May 06 1649 1649.0 1.000
Jun 06 1718 1661.0 1.034
Jul 06 1761 1672.5 1.053
Aug 06 1773 1682.0 1.054
Sep 06 1753 1688.6 1.038
Oct 06 1701 1691.0 1.006
Nov 06 1662 1689.8 0.984
Dec 06 1645 1686.5 0.975
Note: The ‘Ratio’ column is simply unemployment divided by its trend value e.g.
0.968 1 401/1446.9.
STFE_C11.qxd 26/02/2009 09:18 Page 394

slide 412:

The components of a time series
395
negative positive and slightly negative random components in 2004 2005 and
2006 respectively.
Table 11.4 shows the calculation of the seasonal components for each month
using the method described above.
Previous editions of this book calculated seasonal factors for earlier time periods
and Table 11.5 provides a comparison of three time periods. First it should be
stated that the deﬁnition has changed between 2004–2006 and the earlier decades
which used the claimant count rather than the ILO deﬁnition so one has to
be wary of the comparison. It is noticeable however that the pattern over the year
has changed considerably since the 1980s indeed it has approximately reversed
itself. This demonstrates that a seasonal pattern is not necessarily unchanging
over time and can be altered by factors such as changes in the law unemploy-
ment beneﬁt entitlements etc. and the changing pattern of the labour market
in general. This also highlights the importance of the length of the moving
average process that is used. If one calculated this over the 22-year period in the
table the seasonal effects from one decade might cancel out those from another.
Using the data from Exercise 11.1 calculate the seasonal factors for each quarter.
Table 11.4 Calculating the seasonal factors
2004 2005 2006 Average
January 0.968 0.994 0.974 0.979
February 0.994 0.992 0.991 0.992
March 0.996 0.982 0.987 0.988
April 0.969 0.962 0.986 0.972
May 0.978 0.963 1.000 0.980
June 1.006 0.984 1.034 1.008
July 1.032 1.011 1.053 1.032
August 1.035 1.020 1.054 1.036
September 1.022 1.038 1.038 1.033
October 1.006 1.028 1.006 1.013
November 0.979 0.994 0.984 0.985
December 0.971 0.959 0.975 0.969
Table 11.5 Comparison of seasonal factors in different decades
1982–1984 1991–1993 2004–2006
January 1.042 1.028 0.979
February 1.033 1.028 0.992
March 1.019 1.022 0.988
April 1.009 1.021 0.972
May 0.983 0.997 0.980
June 0.963 0.980 1.008
July 0.982 1.006 1.032
August 0.983 1.018 1.036
September 1.001 1.006 1.033
October 0.992 0.979 1.013
November 0.997 0.982 0.985
December 1.002 1.004 0.969
Exercise 11.2
STFE_C11.qxd 26/02/2009 09:18 Page 395

slide 413:

Chapter 11 • Seasonal adjustment of time-series data
396
Seasonal adjustment
Having found the seasonal factors the original data can now be seasonally
adjusted. This procedure eliminates the seasonal component from the original
series leaving only the trend cyclical and random components. It therefore
removes the regular month by month differences and makes it easier directly
to compare one month with another. Seasonal adjustment is now simple: the
original data are divided by the seasonal factors shown in Table 11.4 above.
Equation 11.6 demonstrates the principle
T × C × R 11.6
Table 11.6 shows the calculation of the seasonally adjusted ﬁgures.
The ﬁnal column of the table adds the ofﬁcial seasonally adjusted ﬁgures avail-
able on Statbase series MGSC. Although that uses slightly more sophisticated
X
S
Table 11.6 Seasonally adjusted unemployment
Unemployment Seasonal factor Seasonally S.A. series
adjusted series from Statbase
Jan 04 1401 0.979 1432 1434
Feb 04 1430 0.992 1441 1431
Mar 04 1425 0.988 1442 1437
Apr 04 1380 0.972 1419 1435
May 04 1389 0.980 1417 1439
Jun 04 1427 1.008 1415 1425
Jul 04 1462 1.032 1417 1407
Aug 04 1466 1.036 1415 1404
Sep 04 1445 1.033 1399 1398
Oct 04 1422 1.013 1403 1411
Nov 04 1383 0.985 1404 1423
Dec 04 1373 0.969 1417 1425
Jan 05 1406 0.979 1437 1444
Feb 05 1405 0.992 1416 1413
Mar 05 1397 0.988 1414 1417
Apr 05 1379 0.972 1418 1435
May 05 1392 0.980 1420 1438
Jun 05 1433 1.008 1421 1426
Jul 05 1482 1.032 1436 1426
Aug 05 1509 1.036 1456 1442
Sep 05 1552 1.033 1503 1503
Oct 05 1556 1.013 1536 1543
Nov 05 1525 0.985 1548 1566
Dec 05 1494 0.969 1542 1549
Jan 06 1539 0.979 1573 1578
Feb 06 1589 0.992 1601 1601
Mar 06 1602 0.988 1621 1627
Apr 06 1615 0.972 1661 1666
May 06 1649 0.980 1682 1687
Jun 06 1718 1.008 1704 1704
Jul 06 1761 1.032 1707 1699
Aug 06 1773 1.036 1711 1701
Sep 06 1753 1.033 1698 1697
Oct 06 1701 1.013 1679 1680
Nov 06 1662 0.985 1687 1696
Dec 06 1645 0.969 1698 1695
Note: The adjusted series is obtained by dividing the ‘Unemployment’ column by the
‘Seasonal factor’ column.
STFE_C11.qxd 26/02/2009 09:18 Page 396

slide 414:

The components of a time series
397
Figure 11.4
Unemployment and
seasonally adjusted
unemployment
STATISTICS
IN
PR AC TIC E
··
methods of adjustment the results are similar to those we have calculated.
Figure 11.4 graphs unemployment and the seasonally adjusted series.
Note that in some months the two series move in opposite directions. For
example in November 2005 the unadjusted series showed a fall in unemploy-
ment of about 2 yet the adjusted series rises slightly by about 0.8. In other
words the fall in unemployment was discounted as unemployment usually falls
in November compare October and November seasonal factors and this fall
was relatively small. Hence the correct conclusion was that unemployment was
still rising and indeed the graph continues to rise through much of 2006.
Fitting a moving average to a series using Excel
Many software programs can automatically produce a moving average of a data
series. Microsoft Excel does this using a 12-period moving average which is not
centred but located at the end of the averaged values. For example the average
of the Jan–Dec 1991 ﬁgures is placed against December 1991 not between June
and July as was done above. This cuts off 11 observations at the beginning of the
period but none at the end. Figure 11.5 compares the moving averages calculated
by Excel and by the centred moving average method described earlier.
The Excel method appears much less satisfactory: it is always lagging behind
the actual series in contrast to the centred method. However it has the advant-
age that the trend value for the latest month can always be calculated.
Figure 11.5
Chart of Excel moving
average series
STFE_C11.qxd 26/02/2009 09:18 Page 397

slide 415:

Chapter 11 • Seasonal adjustment of time-series data
398
Exercise 11.3 Again using the data from Exercise 11.1 construct the seasonally adjusted series for
2001–2004 and graph the unadjusted and adjusted series.
An alternative method for ﬁnding the trend
Chapter 9 on regression showed how a straight line could be ﬁtted to a set of
data as a means of summarising it. This offers an alternative means of smooth-
ing data and ﬁnding a trend line. The dependent variable in the regression
is unemployment which is regressed on a time trend variable. This is simply
measured 1 2 3 . . . 36 and is denoted by the letter t. January 2004 is therefore
represented by 1 February 2004 by 2 etc. Since the trend appears to be non-
linear however a ﬁtted linear trend is unlikely to be accurate for forecasting.
The regression equation can be made non-linear by including a t
2
term for
example. For January 2004 this would be 1 for February 2004 it would be 4 etc.
The equation thus becomes
X
t
a + bt + ct
2
+ e
t
11.7
where e
t
is the error term which in this case is composed of the cyclical seasonal
and random elements of the cycle. The trend component is given by a + bt + ct
2
.
The calculated regression equation is calculation not shown
X
t
1416.1 − 4.15t + 0.390t
2
+ e
t
11.8
The trend values for each month can easily be calculated from this equation
by inserting the values t 1 2 3 etc. as appropriate. January 2004 for example
is found by substituting t 1 and t
2
1
2
into equation 11.8 giving
X
t
1416.1 − 4.15 × 1 + 0.390 × 1
2
+ 0 1412.29 11.9
which compares to 1446.88 using the moving average method. For July 2004
t 7 we obtain
X
t
1416.1 − 4.15 × 7 + 0.390 × 7
2
+ 0 1406.12 11.10
compared to the moving average estimate of 1417.13. The two methods give
slightly different results but not by a great deal.
The analysis can then proceed as before. The seasonal factors are calculated
for each month and year by ﬁrst dividing the actual value by the estimated trend
value hence 1401/1412.29 0.992 for January 2004 and then averaging the
January values across the three years gives the January seasonal factor. This is
left as an exercise see Exercise 11.4 and Problem 11.5 and gives similar results
to the moving average method. One ﬁnal point to note is that the regression
method has the advantage of not losing observations at the beginning and end
of the sample period.
a Using the data from Exercise 11.1 calculate a regression of X on t and t
2
and a
constant to ﬁnd the trend-cycle series. Use observations for 2001–2004 only for
the regression equation.
b Graph the original series and the calculated trend line.
c Use the new trend line to calculate the seasonal factors.
Exercise 11.4
STFE_C11.qxd 26/02/2009 09:18 Page 398

slide 416:

Forecasting
399
Forecasting
It is possible to forecast future levels of unemployment using the methods out-
lined above. Each component of the series is forecast separately and the results
multiplied together. As an example the level of unemployment for January 2007
will be forecast.
The trend can only be forecast using the regression method since the mov-
ing average method requires future values of unemployment which is what is
being forecast January 2007 corresponds to time period t 37 so the forecast of
the trend by the regression method is
X
t
1416.1 − 4.15 × 37 + 0.390 × 37
2
+ 0 1796.9 11.11
The seasonal factor for January is 0.988 so the trend ﬁgure is multiplied by
this giving
1796.9 × 0.988 1776.2 11.12
The cyclical component is ignored and the random component set to a value
of 1 in the multiplicative model zero in the additive model. This leaves 1776.2
as the forecast for January 2007. In the event the actual ﬁgure was 1664 so the
forecast is not very good with an error of −6.3. A chart of unemployment the
trend using the regression method and the forecast for the ﬁrst six months of
2007 reveals the problem see Figure 11.6.
The forecast relentlessly follows the trend line upwards. Because it is a
quadratic trend i.e. involving terms up to t
2
it cannot predict a turning point
which seems to have occurred around the end of 2006. Nevertheless the error
in the forecast for January 2007 would alert observers to the likelihood that
some kind of change has occurred and that unemployment is no longer follow-
ing its upwards trend.
Use the results of Exercise 11.5 to forecast the values of X for 2005Q1 and 2005Q2.
How do they compare to the actual values
Figure 11.6
Forecasting
unemployment
Exercise 11.5
STFE_C11.qxd 26/02/2009 09:18 Page 399

slide 417:

Chapter 11 • Seasonal adjustment of time-series data
400
Further issues
As stated earlier ofﬁcial methods of seasonal adjustment are more sophisticated
than those shown here though with similar results. The main additional
features that we have omitted are as follows:
● Ad hoc adjustments – the original data may be ‘incorrect’ for an obvious
reason. A strike for example might lower output in a particular month.
This not only gives an atypical ﬁgure for the month but will also affect the
calculation of the seasonal factors. Hence such an observation might be
corrected in some way before the seasonal factors are calculated.
● Calendar effects – months are not all of the same length so retail sales for
example might vary simply because there are more shopping days especially
if a month happens to have ﬁve weekends. Overseas trade statistics are
routinely adjusted for the numbers of days in each month. Easter is another
problem because it is not on a regular date and so can affect monthly ﬁgures
in different ways depending where it falls.
● Forecasting methods – the trend is calculated by a mixture of regression and
moving average methods avoiding some of the problems exhibited above
when forecasting.
The above analysis has taken a fairly mechanical approach to the analysis of
time series and has not sought the reasons why the data might vary seasonally.
The seasonal adjustment factors are simply used to adjust the original data for
regular monthly effects whatever the cause. Further investigation of the causes
might be worthwhile as they might improve forecasting. For example unemploy-
ment varies seasonally because of among other things greater employment
opportunities in summer e.g. deckchair attendants and school leavers entering
the register in September. The availability of summer jobs might be predictable
based on forecasts of the number of tourists weather etc. and the number of
school leavers next year can presumably be predicted by the number of pupils
at present in their ﬁnal year. These sorts of considerations should provide better
forecasts rather than blindly following the rules set out above.
Using adjusted or unadjusted data
Seasonal adjustment can also introduce problems into data analysis as well as
resolve them. Although seasonal adjustment can help in interpreting ﬁgures if
the adjusted data are then used in further statistical analysis they can mislead.
It is well known for example that seasonal adjustment can introduce a cyclical
component into a data series which originally had no cyclical element to it. This
occurs because a large random deviation from the trend will enter the moving
average process for 12 different months or whatever is the length of the mov-
ing average process and this tends to turn occasional random disturbances
into a cycle. Note also that the adjusted series will start to rise before the random
shock in these circumstances.
The question then arises as to whether adjusted or unadjusted data are best
used in say regression analysis. Use of unadjusted data means that the coefﬁ-
cient estimates may be contaminated by the seasonal effects using adjusted
STFE_C11.qxd 26/02/2009 09:18 Page 400

slide 418:

Summary
401
data runs into the kind of problems outlined above. A suitable compromise is
to use unadjusted data with seasonal dummy variables. In this case the estimation
of parameters and seasonal effects is dealt with simultaneously and generally
gives the best results.
A further advantage of this regression method is that it allows the signiﬁcance
of the seasonal variations to be established. An F-test for the joint signiﬁcance
of the seasonal coefﬁcients will tell you whether any of the seasonal effects are
statistically signiﬁcant. If not seasonal dummies need not be included in the
regression equation.
Finally it should be remembered that decomposing a time series is not a clear
cut procedure. It is often difﬁcult to disentangle the separate effects and differ-
ent methods will give different results. The seasonally adjusted unemployment
ﬁgures given in Statbase are slightly different from the series calculated here due
to slightly different techniques being applied. The differences are not great and the
direction of the seasonal effects are the same even if the sizes are slightly different.
Summary
● Seasonal adjustment of data allows us to see some of the underlying features
shorn of the distraction of seasonal effects such as the Christmas effect on
retail sales.
● The four components of a time series are the trend the cycle the seasonal
component and the random residual.
● These components may be thought of either as being multiplied together or
added together to make up the series. The former method is more common.
● The trend possibly mixed with the cycle can be identiﬁed by the method of
moving averages or by the use of a regression equation.
● Removing the trend-cycle values from a series leaves only the seasonal and
random components.
● The residual component can then be eliminated by averaging the data over
successive years e.g. take the average of the January seasonal and random
component over several years.
● Having isolated the seasonal effect in such a manner it can be eliminated
from the original series leaving the seasonally adjusted series.
● Knowledge of the seasonal effects can be useful in forecasting future values of
the series.
additive model
calendar effects
cycle
forecasting
moving average
multiplicative model
random residual
seasonal adjustment
seasonal component
trend
trend regression
Key terms and concepts
STFE_C11.qxd 26/02/2009 09:18 Page 401

slide 419:

Chapter 11 • Seasonal adjustment of time-series data
402
Some of the more challenging problems are indicated by highlighting the problem
number in colour.
11.1 The following table contains data for consumers’ non-durables expenditure in the UK in
constant 2003 prices.
Q1 Q2 Q3 Q4
1999 – – 153 888 160 187
2000 152 684 155 977 160 564 164 437
2001 156 325 160 069 165 651 171 281
2002 161 733 167 128 171 224 176 748
2003 165 903 172 040 176 448 182 769
2004 171 913 178 308 182 480 188 733
2005 175 174 180 723 184 345 191 763
2006 177 421 183 785 187 770 196 761
2007 183 376 188 955 – –
a Graph the series and comment upon any apparent seasonal pattern. Why does it occur
b Use the method of centred moving averages to ﬁnd the trend values for 2000–2006.
c Use the moving average ﬁgures to ﬁnd the seasonal factors for each quarter use the
multiplicative model.
d By approximately how much does expenditure normally increase in the fourth quarter
e Use the seasonal factors to obtain the seasonally adjusted series for non-durable
expenditure.
f Were retailers happy or unhappy at Christmas in 2000 How about 2006
11.2 Repeat the exercise using the additive model. In Problem 11.1c above subtract the
moving average ﬁgures from the original series. In e subtract the seasonal factors from
the original data to get the adjusted series. Is there a big difference between this and
the multiplicative model
11.3 The following data relate to car production in the UK not seasonally adjusted.
2003 2004 2005 2006 2007
January – 141.3 136 119.1 124.2
February – 141.1 143.5 131.2 115.6
March – 163 153.3 159 138
April – 129.6 139.8 118.6 120.4
May – 143.1 132 132.3 127.4
June – 155.5 144.3 139.3 137.5
July 146.3 140.5 130.2 117.8 129.7
August 91.4 83.2 97.1 73 –
September 153.5 155.3 149.9 122.3 –
October 153.4 135.1 124.8 116.1 –
November 142.9 149.3 149.7 128.6 –
December 112.4 109.7 95.3 84.8 –
Problems
STFE_C11.qxd 26/02/2009 09:18 Page 402

slide 420:

403
a Graph the data for 2004–2006 by overlapping the three years as was done in Figure 11.2
and comment upon any seasonal pattern.
b Use a 12-month moving average to ﬁnd the trend values for 2004–2006.
c Find the monthly seasonal factors multiplicative method. Describe the seasonal
pattern that emerges.
d By how much is the August production ﬁgure below the July ﬁgure in general
e Obtain the seasonally adjusted series. Compare it with the original series and comment.
f Compare the seasonal pattern found with that for consumers’ expenditure in
Problem 11.1.
11.4 Repeat Problem 11.3 using the additive model and compare results.
11.5 a Using the data of Problem 11.1 ﬁt a regression line through the data using t and t
2
as
explanatory variables t is a time trend 1–36. Use only the observations from
2000–2006. Calculate the trend values using the regression.
b Calculate the seasonal factors multiplicative model based upon this trend. How do
they compare to the values found in Problem 11.1
c Predict the value of consumers’ expenditure for 2007 Q4.
d Calculate the seasonal factors using the additive model and predict consumers’
expenditure for 2007 Q4.
11.6 a Using the data from Problem 11.3 2004–2006 only ﬁt a linear regression line to obtain
the trend values. By how much on average does car production increase per year
b Calculate the seasonal factors multiplicative model. How do they compare to the
values in Problem 11.3
c Predict car production for April 2007.
11.7 A computer will be needed to solve this and the next Problem.
a Repeat the regression equation from Problem 11.5 but add three seasonal dummy
variables for quarters 2 3 and 4 to the regressors. The dummy for quarter 2 takes
the value 1 in Q2 0 in the other quarters. The Q3 dummy takes the value 1 in Q3
0 otherwise etc. How does this affect the coefﬁcients on the time trend variables
Use data for 2000 – 2006 only.
b How do the t-ratios on the time coefﬁcients compare with the values found in
Problem 11.5 Account for the difference.
c Compare the coefﬁcients on the seasonal dummy variables with the seasonal factors
found in Problem 11.5 d. Comment on your results.
11.8 a How many seasonal dummy variables would be needed for the regression approach
to the data in Problem 11.3
b Do you think the approach would bring as reliable results as it did for consumers’
expenditure
11.9 Project: Obtain quarterly unadjusted data for a suitable variable some suggestions are
given below and examine its seasonal pattern. Write a brief report on your ﬁndings. You
should:
a Say what you expect to ﬁnd and why.
b Compare different methods of adjustment.
c Use your results to try to forecast the value of the variable at some future date.
d Compare your results if possible with the ‘ofﬁcial’ seasonally adjusted series.
Some suitable variables are: the money stock retail sales rainfall interest rates
house prices.
Problems
STFE_C11.qxd 26/02/2009 09:18 Page 403

Chapter 11 • Seasonal adjustment of time-series data
406
And the series are graphed as follows:
Exercise 11.4
a The regression equation is X 153.9 + 0.85t + 0.01t
2
note the coefﬁcient on t
2
is
very small so this is virtually a straight line.
b
STFE_C11.qxd 26/02/2009 09:18 Page 406

slide 424:

Answers to exercises
407
c The seasonal factors are calculated as follows:
Quarter X Predicted X Ratio SA factor
2001 Q1 155 154.737 1.002 1.006
2001 Q2 158 155.617 1.015 1.026
2001 Q3 155 156.518 0.990 1.001
2001 Q4 153 157.440 0.972 0.967
2002 Q1 159 158.382 1.004 1.006
2002 Q2 166 159.345 1.042 1.026
2002 Q3 160 160.329 0.998 1.001
2002 Q4 155 161.333 0.961 0.967
2003 Q1 162 162.358 0.998 1.006
2003 Q2 167 163.404 1.022 1.026
2003 Q3 164 164.470 0.997 1.001
2003 Q4 160 165.557 0.966 0.967
2004 Q1 170 166.665 1.020 1.006
2004 Q2 172 167.793 1.025 1.026
2004 Q3 172 168.942 1.018 1.001
2004 Q4 165 170.112 0.970 0.967
Exercise 11.5
Substituting t 17 and t 18 into the regression equation gives predicted values
of 171.302 and 172.513 for Q1 and Q2 respectively. Multiplying by the relevant
seasonal factors 1.006 and 1.026 gives 172.304 and 177.005. These are close to but
slightly below the actual values. The errors are 1.6 and 1.1 respectively.
STFE_C11.qxd 26/02/2009 09:18 Page 407

slide 425:

408
Important formulae used in this book
Formula Description Notes
μ
Mean of a population Use when all individual
observations are available
μ
Mean of a population Use with grouped data.
f represents the class
frequencies
X
Mean of a sample n is the number of
observations in the sample
X
Mean of a sample Use with grouped data
Median where data are x
L
and x
U
represent the lower
grouped and upper limits of the
m x
L
+ x
U
− x
L
interval containing the
median. F represents the
cumulative frequency up to
but excluding the interval
σ
2
Variance of a population
σ
2
Population variance
grouped data
s
2
Sample variance
s
2
Sample variance
grouped data
c.v.
Coefﬁcient of variation
z
z-score Measures the distance from
observation x to the mean
μ measured in standard
deviations
Coefﬁcient of skewness
g
Rate of growth Measures the average rate of
growth between years 1 and T
x
x
T
T
1
1
1
−
−
∑fx − μ
3
Nσ
3
x − μ
σ
σ
μ
∑fx − X
2
n − 1
∑x − X
2
n − 1
∑fx − μ
2
∑f
∑x − μ
2
N
5
4
6
4
7
N + 1
− F
2
f
1
4
2
4
3
∑xf
∑f
∑x
n
∑fx
∑f
∑x
N
STFE_Z01.qxd 26/02/2009 09:18 Page 408

slide 426:

Important formulae used in this book
409
Formula Description Notes
Geometric mean of n
observations on x
P
n
L
× 100
Laspeyres price index for
year n with base year 0
P
n
L
× s
0
× 100
Laspeyres price index using
expenditure weights s
P
n
p
× 100
Paasche price index for
year n
P
n
p
× 100 Paasche price index using
expenditure weights s
Q
n
L
× 100
Laspeyres quantity index
Q
n
p
× 100
Paasche quantity index
E
n
× 100
Expenditure index
PV
Present value The value now of a sum S to
be received in t years’ time
using discount rate r
NPV −S
0
+
∑
Net present value The value of an investment
S
0
now yielding S
t
per
annum discounted at
a constant rate r
nCr
Combinatorial formula n n × n − 1 × ... × 1
Prr nCr × P
r
× 1 − P
n−r
Binomial distribution In shorthand notation
r Bn P
Prx
Normal distribution In shorthand notation
x Nμ σ
2
95 conﬁdence interval Large samples using Normal
for the mean distribution
95 conﬁdence interval Small samples using t
for the mean distribution. t
v
is the critical
value of the t distribution for
v n − 1 degrees of freedom
95 conﬁdence interval Large samples only
for a proportion
p
pp
n
.
±
−
196
1
X / ±ts n
v
2
X ./ ±196
2
sn
1
2
1
2
2
σπ
μ
σ
e
−
− ⎧ ⎨ ⎩ ⎫ ⎬ ⎭ x
n
rn − r
S
t
1 + r
t
S
1 + r
t
∑p
n
q
n
∑p
0
q
0
∑q
n
p
n
∑q
0
p
n
∑q
n
p
0
∑q
0
p
0
1
∑
p
0
× s
n
p
n
∑p
n
q
n
∑p
0
q
n
∑p
n
∑p
0
∑p
n
q
0
∑p
0
q
0
∏x
n