Applied Business Statistics (2) : Applied Business Statistics (2) 8 September 2007, SUFE/Webster University- Shanghai I really have no clue what this Webster professor is talking about…
Agenda 8 September 2007 : Agenda 8 September 2007 Your feed back on 1 September assignments
Amazing Census-statistic
Visual Presentation statistics your S&P 500 company
X-Y Correlation on your company
Share price of your company
Share price sampling assignment
ATM experiment
Finishing introduction ch. 1-3 last week
Combinations and Permutations
Binomial Distribution
Poisson Distribution
Introduction textbook: continuous probability distributions
Normal Distribution
Sample size
Hypothesis Testing
Assignments for next week
In the news
AOB
Agenda 8 September 2007 : Agenda 8 September 2007 Let’s look at your assignments:
Your Team-Your Company
US Census bureau; the wow effect…
Forecasting the share price…
3) Experiment ATM machines- testing Poisson…
Probability of hole in one?
Look up and experience assignment : Look up and experience assignment Discover the census…
You may want to click around in the data preferably business sections
Find something odd; you did not know; you can create the “wow” effect with in your class
Present next week!
NYSE: New York Stock Exchange : NYSE: New York Stock Exchange As an investor you would like to predict share price movements
If you were successful you can become a billionaire!
Businesses need to have interest rate visions, inflation forecasts, currency forecasts….
We are all trying to do better then the market…
The market…(Sep 2006-Sep 2007) : The market…(Sep 2006-Sep 2007)
Share price Sampling assignment : Share price Sampling assignment Observe the share price of your company over the past 5 years
Take a random sample of share prices (closings per date)
Analyse your findings
Can you based on your findings
Determine a relation between the date/period and the share price ?
Now calculate the delta of share price movements over the past 5 years.
Take again a random sample (not the same)
Can you determine a relation between the delta share movement and the date/period of the year?
Experiment assignment: Bank’s ATM : Experiment assignment: Bank’s ATM Watch an ATM machine for 5 minutes (T) and figure out what % of these 5 minutes the number arriving at the machine was 0,1,2,3 etc.
Repeat the experiment say 5 times
Now make a discrete probability distribution based on your findings
Present your results in class
Name place/location of machine date and specific time measured (15:03-15:08)
Do not cooperate with class mates; the more registrations we have the better….
The theoretical distribution that describes this well should be the so called Poisson Distribution!
Counting-permutations and combinations : Counting-permutations and combinations If event A can happen in n1 ways and event B can happen in n2 ways
Then event A and B can happen in n1*n2 ways
If k independent events (A,B,C etc.) can happen in n different ways then event A,B,C etc, can happen in n*n*n etc ways or nk.
Simple multiplication rule : Simple multiplication rule A restaurant offers 10 different entrees, 25 different main courses and 5 different deserts
How many different meals can you arrange from these?
10*25*5= 1250 …
If you think this is too hard to understand immediately, you can see this by taking a simple example…
Example : Example A European car license plate consists of 6 different numbers or letters
Assume that you can use all letters and all 10 digits
Your company is responsible for issuing these license plates
Will you have enough possibilities to make sure you will generate unique plates for now and the future…
Calculate the number of different licence plates
The Netherland currently has about 6,3 mln. cars
GO!
Your answer : Your answer 26 letters and 10 digits is 36 possible positions on each of the 6 license plate places
The events are independent
There are 36 (place 1) * 36 (place 2) *….*36(place 6) = (36)6= 2.176.782.336 possibilities…This is about 4 times the number of vehicles in the entire world!
Maybe they are overdoing it?
Simple multiplication rule… : Simple multiplication rule… Offer a sandwich on white or brown bread
With Ham, cheese or peanut butter
How many different sandwiches?
White with Ham, White with Cheese, White with Peanut…
Brown with Ham, Brown with Cheese, Brown with Peanut
So in total 6 different sandwiches
Or 2*3=6
Factorial (n!) : Factorial (n!) Sometimes events have different possibilities ; if you chose 1 that limits the number of choices left for subsequent events such as with Permutations; the number of different ways in which objects can be arranged; the number of permutations of n objects taken r at a time: n!/(n-r)!
Example; warehouse location A can be filled with 4 containers, 6 containers are shipped in ; how many different ways are there to fill location A?
Go!
Well… : Well… Container loc 1 in A can be filled in 6 ways
Loc 2 in A in 5 ways if loc 1 is taken
Loc 3 in A in 4 ways if loc 1 and 2 are taken
Loc 4 in A in 3 ways if loc 1,2 and 3 are taken
Thus 6*5*4*3= 360
Or applying the permutation rule: n!/(n-r)!= 6!/(6-4)!= 6*5*4*3*2*1/2*1=360
An auditor has 9 audits to do : An auditor has 9 audits to do Tomorrow only 5 can be done
In how many different orders can tomorrow’s task be carried out?
Permutations: n!/(n-r)!=…
Calculate!
Well… : Well… 9!/(9-5)!=15120….
Combinations : Combinations Unlike permutations combinations consider only the possible sets of objects regardless of the order in which the members of the set are arranged
The number of combinations are:
(n!)/r!(n-r)!
Determine the number of combinations for the containers in warehouse loc.A
Well… : Well… There seem to be 15 different ways to store the containers in a combination in loc A.
n!/r!(n-r)!= 6*5*4*3*2*1/4!*2*1=15
You may want to check the answer…by giving each container a colour
So. : So. If you try to make as many different combinations from lets say ABCD
And you want combinations of 3
ABC, BCD, ABD, CDA are the only combinations or 4!/3!*(4-3)!=4
Remember our auditor? : Remember our auditor? If the auditor does not consider the order in which tomorrow’s audits are carried out, how many combinations of 5 audits can he chose from?
n!/r!(n-r)!= …
Calculate
Well… : Well… 9!/5!(9-5)!= 126 combinations
Discrete probabilities.. : Discrete probabilities.. Toss a coin twice;
Outcomes H(heads), or T(tails)
P(H)=0.5 and P(T)=0.5
Possible outcomes of this experiment:
HH,TH,HT, or TT
P(H=0)=0.25, P(H=1)=0.5 and P(H=2)=0.25
Simple distribution follows as well as
E(H)=0*0.25+1*0.5+2*0.25=1
STDEV(H)=√((0-1)^2*0.25+(1-1)^2*0.5+(2-1)^2*0.25)^0.5=(0.5)^0.5=0.707
Binomial Distribution (Bernoulli Process) : Binomial Distribution (Bernoulli Process)
There are 2 or more consecutive trials
In each trial there are just 2 possible outcomes “success” or “failure”
The trials are statistically independent (the outcomes of any trial does not affect the outcomes in another trial)
The probability of a success and failure stays the same for every trial
Discrete probability distributions : Discrete probability distributions For instance for a Bank ATM machines follow a discrete probability distributions since people arriving at the machine follow discrete variables like 1,2,3 over period of time t.
For discrete probability distributions : For discrete probability distributions The mean= µ=E(x)=∑xi*P(xi)
The variance is=σ2 = E(x-µ)^2=∑(xi-µ)^2*P(xi)
The standard deviation is σ=variance^0,5
A salesman has contacted customers over many days and found that contacts who became customers (xi) are distributed like:
Binomial distribution : Binomial distribution Outcomes of binomial have 2 possible outcomes success or failure (bi)
Also called a Bernoulli process
Two or more consecutive trials
Each trial is a success or failure
The trials are mutually independent
The success rate remains the same over several trials
For discrete probability distributions : For discrete probability distributions The mean= µ=E(x)=∑xi*P(xi)
The variance is=σ2 = E(x-µ)^2=∑(xi-µ)^2*P(xi)
The standard deviation is σ=variance^0,5
A salesman has contacted customers over many days and found that contacts who became customers (xi) are distributed like:
Do it now assignment… : Do it now assignment… Graph the distribution of contacts that became customers per day
Calculate the mean, variance and standard deviation
Your answers : Your answers Horizontal axes 0,1,2…..6
Vertical axes P(xi)
Mean: ∑ xi*P(xi)= 0*0,05+1*0,1+…+6*(0,1)=3,2 contacts on average become customers
Variance: (0-3,2)^2*0,05+(1-3,2)^2*0,1+…+(6-3,2)^2*0,1=2,66 contacts
Standard deviation: 2,66^0,5=1,63 contacts
Two distributions are discrete distributions:
The Binomial distribution
The Poisson distribution Building Customer Platform
Sony researchers assignment : Sony researchers assignment Sony has found that 60% of VCR owners know how to program their VCR but want to test this. At their service centre they select 3 VCR owners. What is the probability that 2 of these 3 are able to program their VCR? (S=success=60% and F=failure 40%)
How many different possible outcomes are there?
How many outcomes with 2 successes?
What is the cum probability calculated?
Well… : Well… ADD UP
0,144+
0,144+
0.144
Your
Answer
0,432
Formula Binomial distribution : Formula Binomial distribution The probability of having x successes in n trials:
P(x)=n!/x!(n-x)!*P(success)^nr. Successes* (1-P(success))^(nr. Of experiments-nr. Of successes)
Thus: 3!/2!(3-2)!*(0,6)^2*(0,4)^ (3-2) =0,432 Nr. Of combinations Nr. Of successes Nr. Of failures Joint probabilities of 2 times success and 1 time failure
Binomial distribution characteristics : Binomial distribution characteristics Mean: (µ=(E(x)= Expected Value of success= n*P(x)
Variance: (σ2=E(X-µ)2=n*P(x)(1-P(x))
In which P(x)=probability of success
n= number of trials
Class assignment: the lawsuit : Class assignment: the lawsuit Kellog’s is subject to a lawsuit concerning the use of nonbiodegradable packagings…
The trial will be by jury; Kellog’s chief counsel believes that the success of the defense will depend highly on how many of the 9 jurors will be corporate stockholders…
We use the binomial distribution for n=9
The jurors are selected from a large county in which 20% of the adults own stocks (∏=0,20)
What is the probability that the jury will include at least 3 stockholders (successes)?
What is the probability that the majority of the jury will be stockholders?
Should the legal counsel based on these findings base his legal arguments on addressing Kellog’s shareholders? Using the binomial distribution
Your answer… : Your answer… Follow the binomial distribution for n=9 and k=3 or higher
∏=0,20; 20% of adults hold stocks
Add P(k=3)+P(k=4)+ etc. what is the cum result?
Majority means k=5 or higher; what is that cum result
So?
You use the binomial distribution (table B at the back of your book) : You use the binomial distribution (table B at the back of your book) Take the n=9 position of the table (the jury consists of 9 persons)
The success rate is 20% (∏)
The probabilities are as shown:
We are looking for k=3 or higher adding P(x=3)+P(x=4)+…+P(x=9)= 0.2618=26.18%
The probability that k=5 or higher is even smaller 1.96%
The legal counsel can save his breath towards the shareholders and come up with something better…
Class assignment: City Wayne : Class assignment: City Wayne In city Wayne there are 41,636 residents registered
We know that in this city 20% of the residents were born in Wayne
We will take 5 trials and pick 5 residents
What is the probability that of these 5 residents 0,1,2,3,4 or 5 residents were born in Wayne? (P(1), P(2) etc)
What is the probability that out of these 5 residents 3 or more are born in Wayne? (P(x)>=3)
Using the binomial formula… : Using the binomial formula… P(0)=5!/0!(5-0)!*0.20*0.85=0.328
Similarly:
P(1)= 0.41
P(2)= 0.205
P(3)= 0.051
P(4)= 0.006
P(5)= 0.000
Check the binomial table in your book for n=5 and P(x)=0.20….
P(x>=3)=P(3)+P(4)+P(5)= 0.057 (5.7%)
Using Excel…for Binomial… : Using Excel…for Binomial… Open statistical Fx (functions)
Open BINOMDIST…function
Fill in number of success you want probability of (for instance out of 5 one person is from Wayne)
Fill in nr. Of trials (in our case 5)
We are looking for P(1) so not for cum probability (put false in last opening)
Excel calculates P(1)=0.4096…. (try!)
The Poisson distribution : The Poisson distribution Say you want to describe;
Customer arrivals at a service point
Defects in manufactured material
Number of work related deaths
The Poisson distribution is a family of distributions with a shape determined by its mean λ (lambda)
The probability that a random event will occur exactly x times over a given span of time t is:
P(x)= (λx*e -λ)/x! with λ=E(x)= the mean e=the mathematical constant used for natural processes representing value 2,71828 and e –λ = 1/(2,71828) λ
Note that for the Poisson distribution the mean=variance= λ
Arrival time analysis : Arrival time analysis Customers at a service counter
Ambulances
Queues Post office
ATM arrival
Any counter arrivals
We now discuss random variables of the discrete type…
Experiment assignment: Bank’s ATM : Experiment assignment: Bank’s ATM Watch an ATM machine for 5 minutes (T) and figure out what % of these 5 minutes the number arriving at the machine was 0,1,2,3 etc.
Repeat the experiment say 5 times
Now make a discrete probability distribution based on your findings
Present your results in class
Name place/location of machine date and specific time measured (15:03-15:08)
Do not cooperate with class mates; the more registrations we have the better….
The theoretical distribution that describes this well should be the so called Poisson Distribution!
Class assignment: Birth rates… : Class assignment: Birth rates… In an urban district the number of births are expected to be the same as last year (last year 438 children were born) an average of 438/365 days= 1.2 per day
Daily births are distributed according to the Poisson distribution…
For any given day what is the probability that no children are born in the district?
And what is the probability that no more than 1 birth will occur on a given day?
Follow the Poisson distribution formula… : Follow the Poisson distribution formula… P(x)= (λx *e -λ)/x!
λ =1.2 the mean birth rate per day
We are looking for P(x=0) so:
(1.20* 2.71828-1.2)/0!= 0.3012
Similarly for P(1), P(2) etc.
P(1)=0.3614
P(2)=0.2169
P(3)=0.0867
P(4)=0.026
P(5)=0.0062
P(X<=1)= P(0)+P(1)= 0.3012+ 0.3614= 0.6626 (66.26%)
Using Poisson tables… : Using Poisson tables… Table will indicate value of λ
For x=1,2,3 etc. the probabilities can be read directly from the table (try)
P(0)= 0.3012 in table
P(8)= almost 0
After that the distribution “stops”
Using Excel…for Poisson… : Using Excel…for Poisson… Open statistical Fx (functions)
Open POISSON function
Enter number of occurrence (lets say 0)
Enter the mean (1.2 in our case)
We want the individual probability not the cum. So enter false in the logic box
Excel calculates the P(0)…as
…. (try!)
Testing Poisson at ATM assignment : Testing Poisson at ATM assignment Gather the data of your class mates
Figure out what the mean is of your combined research
Define the Poisson distribution
Calculate variance and standard deviation
Graph this distribution
Test your findings with new observations as next week’s assignment Poisson for Riksa’s arrivals ?
Probability distributions for continuous random variables… : Probability distributions for continuous random variables…
Variable x can take on any value or range of values on a continuum
It’s probability density function is defined as f(x)
The most famous and widely used distribution in this category is the Normal distribution Is the clarity of a diamond Normally distributed ?
We are using the N distribution : We are using the N distribution If we are looking for ranges of values (a value interval) in an area that basically is infinite
We search for these values under the bell shaped curve
The total area under the curve represents probability 1 (100%)
The interval under the curve defines an area that we can find the probability for in the N distribution table…
The Normal distribution is a mathematical function… : The Normal distribution is a mathematical function… 2,71828 Mean St.deviation 3,14159
So the shape of the distribution is defined by… : So the shape of the distribution is defined by… The mean
The standard deviation
Let us now look at some examples of real life… What is the µ and σ of their performance ?
The General Aviation Association : The General Aviation Association All single piston driven engine aircraft with 4 or more seats are measured to fly about 130 hours per annum in total
The σ = 30 hours
How many flying hours do 95,5% of these aircrafts fly per annum?
How many hours do 99,7% of these aircrafts fly per annum?
The skilled statistician recognizes : The skilled statistician recognizes 95,5% is 2σ around the mean
The mean is 130 hrs. and σ= 30 hrs.
So 95,5% of the planes are in the interval 130 hrs plus/minus 2*30 hrs. or between 70 hrs. and 190 hrs.
And 99,7% is the 3σ interval so this % of planes falls within the range 130 hrs. plus/minus 3* 30 hrs. or between 40 and 220 hrs. There are amost no planes outside this range… Airplane promoting sunny vacation resort…
Comparing coffee machines : Comparing coffee machines Jura produces the S90 and E75 machines. After 5 years of marketing these products and following their repair services Jura has found that these machines differ in annual output.
The S90 has a mean of 220 liters per annum of quality output without repairs. The E75 has a mean of 265 liters per annum. The variability of the S90 is higher then the variability of the E75. Jura has measured that σ of the S90 is 32 liters and that of the E75 21 liters.
Quickly shape these distributions….
Define the 1,2 and 3 sigma levels for both machines….
The standard normal distribution : The standard normal distribution If we would follow all the different N distributions (all different combinations of mean and standard deviation we would need a bible with tables to find the % probability under the Bell-curve)
We can simplify this by taking the standard normal distribution (Z)
The z-score for a normal distribution is: z= (x-µ)/σ
Remember that we did this to compare the length of Jordan and Lobo ?
What if the population is not normally distributed… : What if the population is not normally distributed… In many cases the population is not normally distributed or/and
We have no knowledge about its actual distribution
However provided that the sample size we take is big enough (n>=30) the sample distribution can still be assumed to be normal
This is what is known as the Central Limit Theorem…
This theorem specifies:
For large simple random samples from a population that is not normally distributed the sampling distribution of the mean will be approximately normal with the mean being similar to µ(s) and the standard deviation being σ(s)= σ/n0.5 If the sample size is increased the distribution will be closer to the normal distribution…
Reconsider the planes case : Reconsider the planes case The original mean was 130 hrs. the z score for 130 is (130-130/30=0) zero
Original 170 hours is z-score (170-130)/(30)= 1.33
And for say x=100 in the original distribution the z-score is (100-130)/30= - 1.00
For each value of z= -3,-2,-1,0,1,2,3 etc. there is an original x-value
Draw this distribution with 2 scales for resp. x and z…
Remember Poisson distribution? : Remember Poisson distribution? Poisson measures the probability that a discrete number of arrivals take place in a time period
The distribution that measures the time interval in between these arrivals is a continuous distribution and is thus the complement of the Poisson distribution
This continuous distribution is called the exponential distribution
For x= the length of the interval (time, distance etc.) between occurances
F(x)= λ*e –λx both x and λ should be >0
λ= the mean and standard deviation of a Poisson distribution
1/λ= the mean and standard deviation of the corresponding exponential distribution
Exponential distribution-911 : Exponential distribution-911 We measure again surfaces under the curve of the distribution of F(x)
P(X>k)= e –λk where k= time space or distance until the next occurrence…
Let us consider a real life case of the 911 calls in the US (112 Netherlands)
Calls to 911 in New York have been found to be Poisson distributed with an average of about 10 calls per hour…
If we measure in minutes what is λ and what is 1/λ of the corresponding exponential distribution?
What is the probability that the next call will occur in at least 5 minutes from now ? (not earlier then that)
Remember: P(X>k)=e –λk
Your answer… : Your answer… If k=5 P(x>5)= e – 1/6*5 =0.4347
If you calculate this for x=0,1,2,3,4,6,7,etc. you can draw this exponential distribution
The whole area under the curve is 1
The area k=5 or higher under the curve thus represents 43,47% of the whole area
Now calculate the probability that the next call will arrive between 3 and 8 minutes from now….
Use the same metrics : Use the same metrics P(x>3)=0.6065
P(x>8)=0.2637
Deduct: P(x>3)-P(x>8)=0.3428
Draw! Some exponential distributions for some λ’s
MTBF for laser printers : MTBF for laser printers Hp sells laser printers and has found that the Mean Time Between Failure of above model is about 4000 hours
The corresponding distribution of x= hours between failures, is exponentially distributed
What is the probability that this machine will operate for another 2500 hours without experiencing a failure ?
What is the probability that the next failure will occur within the next 3000 hours ?
Your answer… : Your answer… MTBF=1/4000=0,00025 failures per hour…
P(X>2500)= e -1/(4000)*(2500)= 2,71828 -0,625= 0,5353= 53,53%
P(X<3000)=1-P(X>3000)= 1-e -1/(4000)*(3000)=1-0,4724=52,76% MTBF
Moving to inferential statistics… : Moving to inferential statistics… Using sample data to learn about a population
Using information from samples to draw conclusions on the population is known as inferential statistics
Sampling distributions : Sampling distributions We are now looking at samples to learn about a population (mean and standard deviation)
If we take many samples we generate a series of means and standard deviations
This in itself is a new probability distribution with a mean and standard deviation… (droste effect)
For processes that are “in control” : For processes that are “in control” The means of samples (with n>30) should tend to be normally distributed with the mean of these samples being close to µ
The standard deviation of these samples is then close to σ/n^ 0,5
There is a 95,5% probability that the sample mean will be within z=2 standard errors of the mean of the population
And a 99,73% probability that the sample will be within z=3 standard errors of the mean of the population
This rule is called the Central Limit Theorem
Olive Oil Filling machine : Olive Oil Filling machine Sold by the supplier with a process standard deviation of 0,1 ounces and is said to fill olive oil cans with 12 ounces of product
We take a sample at this machine with n=30
Between what weight levels is the 2 sigma 95,5% of cans fall in this interval on this machine…
Your answer… : Your answer… 12 ounces + or – 2*(σ/n^ 0,5 )=
12 ounces + or – 2*(0,1/30^0,5)= from 11,963 ounces to 12.037 ounces
If the sample means fall consistently outside this interval then the machine has drifted off from specs! Call your supplier…
Remember the 1 engine planes? : Remember the 1 engine planes? Say that we now take samples of 1 engine planes and n=36
Remember that we think µ= 130 hrs.
The sample mean is 138 hrs.
Remember that assumed was σ=30 hrs.
So the standard deviation of the sample is?
The corresponding z-score is?
What is the probability that the average flying time of the planes in the sample (36) was at least 138 hrs.?
Draw the distribution!
Your answer… : Your answer… Standard deviation of sample= σ/n ^0,5
So that is: 30 hrs./36^0,5=5 hrs.
Z-score= (sample mean-pop mean)/sample st,dev.= (138-130)/5=1,6
Z-Table at z=1,6 gives 0,4452 and this is the area in between 0 (the mean of the z distribution) and z=1,6
Thus the area beyond z=1,6 is the complement: 0,5-0,4452=5,48%
Draw so that µ=130 falls at same point as z=0 and x=138 falls at same point as z=1,6… X=130
Z=0 X=138
Z= 1,6 .4452 .0548 n=36 planes
Using the standard normal distribution table: : Using the standard normal distribution table: Z=1.600 gives directly 0.4452
Using Excel:
Use NORMSDIST function
Enter z=1.6
Will provide cum probability 0.9452
Interpret result and find 0.4452…
Degrees of freedom (Df) : Degrees of freedom (Df) Crucial for the understanding of distributions
Why divide by (n-1) instead of n while calculating the sample variance s^2
Suppose we consider the four data:
x1=10; x2=12; x3=16; x4=18
The sample mean of these data is:…
Given that the mean is known, how many of our data points are Free To Move?
So if we have 1 missing value this one is predetermined if we no the mean value.
What happens if we are missing 2 values in the sample? Say x3 and x4? We now know that x3+x4=34 If we chose a value for x3 then x4 is determined if the mean is known (this is assumed)
So when we have n data and we know their mean. The mean acts as a restriction (deviations from the mean sum up to zero) leaving us with (n-1) degrees of freedom!
Normally we do not know µ and σ : Normally we do not know µ and σ In this case we use the t-distribution also called Student distribution to calculate probabilities
T= (mean sample-µ)/(s/n^0,5) with s = stand.dev. Sample
The mean of this distribution is 0
The distribution is more spread but has the shape of the Normal distribution (thicker tails)
In using the t-table you need to know the degrees of freedom this is (n-1)
Sample employees : Sample employees n=90 employees from manufacturing
Average number of over hours last week of this group was: 8,46 hours
The samples standard deviation was: s=3,61 hrs.
What is the 98% confidence interval for the population mean?
Remember df=(n-1)=89
The population mean is centred around the sample mean at t*s/n^0,5 distance
s and n are known! t can be found in the t-table
For a 98% confidence interval the 0,01 column is needed (for 90% interval the 0,05 column; thus (1-confidence level)/2 is your column…
GO!
Your answer… : Your answer… Sample mean: 8,46 hours
t value in table at column 0,01 and df=89 is 2.369
s=3.61 hrs. and n=90
8,46+ or – 2.369*3.61/90^0,5=
The population mean is with a reliability of 98% anywhere in between 7.56 and 9.36 hrs. overtime
What Sample size… : What Sample size… Remember the rule of thumb n=N^0,5
A more precise calculation is:
n= (z^2*σ^2)/(max.error acceptable) ^2
Z= desired level of confidence corresponding z value
σ= known or estimated standard deviation population
max. error = concrete value of acceptable error
Teenagers market… : Teenagers market… Marketers always keep an eye on teenagers
A marketer for mobile phones wants to know the average amount that teenagers earn during the summer holidays
The marketer wants a 95% confidence that the sample mean is within EUR 50 of the actual population mean (all teenagers in NL)
The marketer has estimated that σ=EUR 400
What should be the sample size?
(apply the formula: n=(z^2*σ^2)/error^2
Your answer… : Your answer… 95% confidence refers to z=1,96
error= EUR 50
σ= EUR 400 (assumed)
So applying the formula: n= 1,96^2*400^2/50^2= 246 (rounded)
Say we want an error of 25 EUR : Say we want an error of 25 EUR To get a feel for the relation between accuracy and the size of the sample what should then be the sample size?
And what if the error should be 5 EUR? 983 ? 25000 ? 34879? 123900? Or…
Your answer… : Your answer… 1,96^2*400^2/25^2= 984(rounded)
Half the accepted error increases the required sample size with factor 4 (from 246 to 984)
And with a small eror like 5 EUR: 1,96^2*400^2/5^2= 17562…
A travel agency…assignment: : A travel agency…assignment: Wants to determine the proportion of US adults that have ever vacationed in Mexico
The agency wants to be 95% confident that the sample error will be no more than 3%
Assuming the travel agency has no idea about the actual value of the population proportion, what sample size is necessary to have 95% confidence that the sample proportion will be within 3% of the actual population proportion? How many US adults vacationed in Mexico?
Your answer… : Your answer… For the 95% confidence (see table) the z-value will be 1.96
If the agency has no idea about the population ratio they will use p=0.5 (or (1-p)=0.5)
Following the formula: n=z^2*p*(1-p)/e^2=1.96^2*0.5*0.5/(0.03^2)= 1067.1 (always round up) 1,068
If the agency believes however that the ratio in the population in no case is higher than p=0.3 then: 1.96^2*0.3*0.7/(0.03^2)=896.4 rounded 897…
Hypothesis Testing (Samples) : Hypothesis Testing (Samples)
The new Fat Free Pringles…
Are they as good as the existing tastes (containing fat)?
44 testers were given 2 bowls of Pringles…one with and one without fat
If the two bowls tasted the same then each tester would have a chance of 50% to correctly indicate the bowl that contained fat and the one that did not…
However 25 of 44 testers (almost 57%) identified the bowl without fat correctly
Does that result mean that Pringles failed in their attempt to make the Pringles taste the same as the Regular Pringles?
We will use hypothesis testing to find the right (statistical) conclusion…
Step 1) formulate H0 (null hypothesis) : Step 1) formulate H0 (null hypothesis) A statement about a population parameter
The H0 should be “a nothing out of the ordinary statement” and is usually believed to be true unless we have overwhelming statistical evidence it is not true…
The alternative hypothesis (H1) holds if the H0 hypothesis is false…
Directional versus Non-directional testing : Directional versus Non-directional testing A directional test/claim is one in which the population parameter is believed to be smaller or equal than…or bigger than… (un-equality)
For instance: no more than 20% of the cans are damaged… In this case we use a one –tail test
A non-directional test/claim is one in which the population parameter is believed to be a certain value (equality)
35% of car drivers are senior citizens…in this case we use a two tailed test
Errors in hypothesis testing… : Errors in hypothesis testing… Reject H0 while H0 is true (type 1 error and the probability of this type of error is α(lpha) = level of significance
Accept H0 while H0 is false (type 2 error expressed as β(eta)
Step 2) Choose level of significance…(alpha level) : Step 2) Choose level of significance…(alpha level) Basically by choosing the significance level we are choosing the chance that we will make a type 1 error during our test…
Significance levels of 10%, 5% or 1% are quite common (either one sided-for directional tests or two sided for non-directional tests)
Step 3) Choose the test statistic… : Step 3) Choose the test statistic… In most cases this will be the z-statistic or t-statistic corresponding to the normal and t-distribution…
Run a z-test if σis known and the population is assumed to be normally distributed
Run a t-test if σis unknown…
Using the z-statistic (σ known) - Example : Using the z-statistic (σ known) - Example A robot welder is in adjustment
Its mean time to perform its task is 1.3250 minutes
Past experience has found that the standard deviation of the cycle time is 0.0396 minutes
An incorrect mean operating time can disrupt the efficiency of other activities along the production line
For a recent sample of 80 jobs (n=80) the data are in attached Excel file: Welding Robot…
Using z-statistic (continued) : Using z-statistic (continued) Alpha=5%
Using z-statistic : Using z-statistic (1) Select the significance level: we are doing a two sided test with let’s say alpha=5% (2.5% at each side) i.e. if the robot is running properly there is only 5% chance that we make the mistake to conclude it needs adjustment…
2) The assumed mean under H0 :µ=1.3250 so H1: µ≠ 1.3250
3) Calculate the test statistic:
Z= (x (sample average)-1.3250)/σ(s)
Assuming normal distribution σ(s)= σ/n0.5
So: z= (1.3229-1.3250)/0.0396/800.5 = -.47
Using z-statistic (continued) : Using z-statistic (continued)
4) Identify critical values for test statistic at α=5% z= -1.96 or z=+1.96
5) Compare calculated test statistic with critical values in order to be able to decide to accept H0 or reject it…; in this case the calculated z value (-0.47) falls within the non-rejection region so at α=5% the null hypothesis can not be rejected!
Based on this the robot welder is not in need of adjustment…and thus the difference between the population mean of 1.3250 and the sample mean of 1.3229 minutes is due to chance variation…
A 95% confidence interval… : A 95% confidence interval… In this case the 95% confidence interval would have been:
try…
Your answer… : Your answer… X (average)± z*σ/√n=
1.3229±1.96*0.0396/√80=
1.3142 (left sided) and 1.3316 (right sided) around the mean
This confidence interval tells us that the mean of the population could indeed fall at 1.3250…
Z-statistic test (one tail) : Z-statistic test (one tail) Light bulbs have mean life time of 1030 hrs with stdev 90 hrs
A company considers to buy a new bulb with a longer life time (same price)
The management wants to make sure that the new bulb has a longer life time since they use a large number of them
A sample of n=40 is taken of the new bulb
The mean life time of the sample bulbs is found to be 1061.6 hrs
Underlying bulb data… : Underlying bulb data…
Class assignment: Follow the test procedure… : Class assignment: Follow the test procedure… H0: µ≤1030 hrs H1:µ> 1030 hrs
Suggesting that the life time of the new bulbs are not different from the old ones…
Let’s choose α=5% The chance of believing that the new bulb has a longer life while that’s not true
Test statistic: z= ?
Critical value: at 5% z= ?
So do we reject or accept H0 ?
Is the difference between the old mean of 1030 hrs and the mean of the sample 1061.6 hrs of the new lamp too large to have occurred by chance…or not?
Your answer… : Your answer… Test statistic: z= (1061.6-1030)/(90/ √40)= 2.22
Critical value: at 5% z=+1.645
So we reject H0 and accept H1 that the new bulb’s life is longer than the old one…
The difference between the old mean of 1030 hrs and the mean of the sample 1061.6 hrs is too large to have occurred by chance…
Using Minitab for testing… : Using Minitab for testing… For the 40 bulbs tested open the Minitab sheet
Use: basic statistics/z-test
Enter in input range the hyptohesized mean of 1030 hrs
Enter the known population stdev=90 into the sigma box
Click labels
Enter the level of significance 5% into the alpha box under options…
Note the p-value of 0.013 telling you that there is only 1.3% probability of getting a sample mean this large (1061.6) by chance…
One-Sample Z
Test of mu = 1030 vs > 1030
The assumed standard deviation = 90
95% Lower
N Mean SE Mean Bound Z P
40 1061.6 14.2 1038.2 2.22 0.013
Testing with the t-test (µ unknown) two tail test… : Testing with the t-test (µ unknown) two tail test… A credit manager claims that the average account of customers is $ 410
An auditor takes a sample of 18 accounts and finds a mean of $511.33 and a standard deviation of $183.75
Sample data are in the next slide
If the sample results are not supporting the manager’s claim than the auditor will check all the accounts
(and the company will have to pay for that)
What should the auditor do?
Underlying data… : Underlying data…
Follow the test procedure… : Follow the test procedure… H0: µ=$410 H1: µ≠$410
N=18 α=5% (two tail test)
We are using the t-statistic so the t-distribution will be used to describe the sample distribution
t=(x average - µ)/(s/ √n) =
t= ($511.33-$410)/($183.75/ √18)=2.34
Critical values: n=18 df=n-1=17
So from t-table: at 2.5% (two sided) t=± 2.11
The calculated t falls in the rejection area since it exceeds 2.11 so we will reject H0
The auditor should, based on this, proceed to check all accounts (and charge the cost to te company…)
Testing with the t-test (µ unknown) one tail test… : Testing with the t-test (µ unknown) one tail test… A tire company claims that a new tire has a mean life time of 60,000 miles
A skeptical car magazine wants to test this claim and takes a sample of 36 tires
The test data are in the next slide Life time: 60,000 miles?
Follow the test procedure… : Follow the test procedure…
Follow the test procedure… : Follow the test procedure… H0: µ≥ 60,000 miles H1:µ< 60,000 miles
Let’s say we use significance level α=0.01
The test statistic t=(x average - µ)/(s/ √n)
Calculate t= (58,341.69- 60,000)/605.42=-2.739
Critical t-value: at α=0.01; t=-2.438
So the calculated t value based on the sample lies in the rejection area and we will not accept H0; the editor’s doubt with respect to the lifetime of the tires seems to be justified by the sample results…
Using Minitab for t-tests… : Using Minitab for t-tests… One-Sample T
Test of mu = 60000 vs > 60000
95% Lower
N Mean StDev SE Mean Bound T P
36 58342 3623 604 57322 -2.75 0.995
Basic statistics, simple t-test
Testing Proportions : Testing Proportions Sometimes we want to test a proportion in a population based on a sample
Voting % (claim 65% test…)
Unsatisfied customers (claim 5% test…)
Operating spec of a machine (claim supplier 3% test…)
Two tail Test on Proportion…(example) known σ : Two tail Test on Proportion…(example) known σ The director of an MBA school claims that 70% of graduated students after 3 years end up in a job related to their MBA study…
A test of n=200 is done to test this claim and from this sample it was calculated that 66% of the students fulfill the claim
Use the 5% significance level
The estimated stdev of the sample is:
S=√p*(1-p)/n= √0.70*0.30/200=0.0324
The calculated value of z= (p(s)-p)/s=
Z=0.7-0.66/0.0324= 1.23
Critical values z=±1.96
So the calculated t-value lies in the non-rejection area
Conclusion: the claim of the MBA Program Director could indeed be true…
One tail test on proportions known σ… : One tail test on proportions known σ… The US government closes hospitals with mortality rates of over 5%
From a sample we know that in one hospital 100 operations have been performed with mortality rate 7%
At the 1% level of significance was the mortality rate of this hospital significantly greater than 5%?
Perform the test and draw a conclusion….
Your answer…. : Your answer…. Ho: p=5% H1: p≠5%
α=1%
S=√p*(1-p)/n= √0.05*0.95/100=0.02179
The calculated value of z= (p(s)-p)/s=
Z=0.07-0.05/0.02179= 0.92
Critical value at α=1%; z=+2.33
So if z>2.33 we reject Ho but in our case the calculated z=0.92 smaller than the critical level and thus we will accept Ho
Concluding : the mortality of the hospital is not significantly higher than 5% and could as well be 5% (that is what we tested) that it is 7% I the sample might be due to chance only…
Try that in Minitab (basic stats) : Try that in Minitab (basic stats) Test and CI for One Proportion
Test of p = 0.7 vs p not = 0.7
Exact
Sample X N Sample p 95% CI P-Value
1 132 200 0.660000 (0.589844, 0.725332) 0.247
Pooled Variance t-Test for two independent samples… : Pooled Variance t-Test for two independent samples… Sometimes we would like to know whether the difference between the means of two independent samples is large enough to reject the possibility that their population means are the same…
Comparing two models of printer’s speed
Comparing the tensile strength of steel bars Tensile strength of steel bars
Example… : Example…
A new training program for CPA’s developed by a software company has 2 versions
10 students are trained with format 1 their performance is stated in attached table (nr of errors)
12 students are trained with format 2 their performance is stated in attached table (nr of errors)
Test if the two formats are significantly different…
Take significance-level 10%
Follow the test procedure… : Follow the test procedure… Ho: µ1=µ2 H1: µ1≠µ2
Two tailed test with α=10%
Format 1 students sample shows average nr of errors of 6 with stdev 3.127
Format 2 students sample shows average nr of errors of 8.167 with stdev 3.326
t=(x(s1)-x(s2))-(µ1-µ2 )/√s2(1/n1+1/n2)
S2=((10-1)*(3.127)2+(12-1)*(3.326)2)/(10+12-2)= 10.484
So t=(6 – 8.167)-(0)/√10.484(1/10+1/12)=-1.563
Critical values (5% each side) t=±1.725 note df=20
The t-statistic falls in the non-rejection area so we can conclude based on the results that the tests have the same population means of errors…
Linear Regression and correlation : Linear Regression and correlation Simple linear regressions have the form: Yi=β0 + β1 Xi + εi
Where: Y= value of the dependant (endogenous) variable and i indicates the ith measurement
Where: X= value of the independent (exogenous) variable (i indicates similar)
For each X the Y values follow a normal distribution and vice versa
β0= the y-intercept of the regression line
β1= the slope of the regression line
ε= random error or residual value; the difference of the actual value of Y and the value of Y described by the line β0+β1*X for every i
How to find the best fitting line? : How to find the best fitting line?
The ordinary least square method (OLS) minimizes the ε’s to the line
So for each measured actual value of Y for a given X the predicted value Y (by the regression line) is measured and added
The line with the lowest ∑ εi^2 is the best fitting line for the given actual points
Let’s take an example…
We measured the following relation : We measured the following relation What line describes the best fit based on OLS:
Y=7+2X or Y=1+ 3X ?
GO!
In Excel…Data Analysis : In Excel…Data Analysis Use Fx Regression
Excel will estimate b0 (intercept) and b1(slope)…for the best fitting line…
OLS Assignment: finding values for b0 and b1 : OLS Assignment: finding values for b0 and b1 The slope of the regression line is defined as:
b1=(∑xi*yi – n*x(mean)*y(mean))/((∑xi^2)-n*(x(mean))^2)
The Y-intercept can be found as:
b0= Y(mean)-b1(X(mean)) since the OLS regression line always passes through (X(mean),Y(mean)) Apply these formulas in finding b0 and b1 in the following case;
Homework case: MBA points and assignment productivity : Homework case: MBA points and assignment productivity Prof.B. thinks there is a relationship between the Business Statistics score of 5 students and their assignment productivity
He measures:
Determine b0 and b1
Determine the regression line…
Show the calculations!
Assignment to verify : Assignment to verify Verify your regression line estimates in Excel and determine the regression coefficient (measure of fit)
Covariance between 2 variables : Covariance between 2 variables Sxy=∑(xi-X(mean))*(yi-y(mean))/(n-1)
Say that x= the number of commercials and y=sales value of the product in $ 100’s
Calculate the sample covariance sxy
You need to know how to do this since you need it to calculate the correlation coefficient: r xy X Y
The correlation coefficient between x and y is: : The correlation coefficient between x and y is: r xy= Cov(x,y)/sx*sy = (∑(xi-mean X)(yi-meanY)/(n-1))/sx*sy
Calculate r xy for the commercials/salesvolume case… Dr.pepper commercial
Your answer… : Your answer… ∑(xi-x mean)(yi- y mean)=99
(n-1)=9
S xy= 99/9=11
Sx=(∑(xi-x mean)^2/(n-1))^0,5= 1.49
Sy= similar 7.93
R xy= S xy/S x*S y= 11/(1.49)*(7.93)= + 0,93 a strong positive correlation
Check with Excel!
Linear correlation coefficient : Linear correlation coefficient Also called the Pearson product moment correlation coefficient
Can also be calculated using:
r= (n∑xy-(∑x)(∑y))/(n(∑x^2)-(∑x) ^2) ^0,5 * (n(∑y^2)-(∑y) ^2)^0,5
You may want to check your findings of the commercials assignment… Abacadabra…?
Homework assignment NYSE-Euro zone equity markets : Homework assignment NYSE-Euro zone equity markets Whatever your business is often you have to come up with:
An idea
Gather data
Do research
Analyse
Structure your findings
Advise your company or customers
Refine your idea etc.
It is believed that all European equity markets follow the NYSE… : It is believed that all European equity markets follow the NYSE… As investor banker you are asked by your potential customers to show the strong correlation between the NYSE and the AEX, DAX, Euronext, CAC, FTSE, DJ Euro 50…
You have to assume a correlation and assume a model
You have to gather raw data
You have to calculate the correlation
You have to describe the linear relation
You have to advise your potential customers
You have to test your theory…
Your homework assignment… : Your homework assignment… Gather data on these bourses
Develop a model
Test your model
Assess correlation
Refine your model (to improve correlation)
Advise your customers…
Next week : Next week Your assignment portfolio defines your mid term grade!
Simple Linear Regression and Correlation analysis!
Grading your Portfolio… : Grading your Portfolio… Failing to prepare is preparing to fail!
In praise of Thomas Bayes : In praise of Thomas Bayes Mathematical rule explaining how you should change your existing beliefs in the light of new evidence
A set of developing information influences probabilities
Microsoft office assistant the paper clip that tries to help the user uses this algorithm
When the user calls for the assistant the computer analysis recent actions and changes the probability on different helping information…
Econometrics! : Econometrics! Application of statistical theory to economic investigations…
Say your theory is that the profitability of a business is highly related to the industry it’s active in
But there is an “omitted variable bias” there are many other factors like leadership style, economic climate,…
There is also the reverse causality problem; doing better helps to improve the economic climate…
Robustness and precision effect the reliability of the outcomes; were enough situations measured?
Econometrics is a rich field of analysis… Omitted variables?
Reverse causality-autocorrelation?
Robustness quest?
Figures of fun; remember scale : Figures of fun; remember scale Figure and fictitious come from the same Latin root!
Remember what we said about scales…
Compare the stock market charts for Bangkok and Manilla
Assignment marketing statistics : Assignment marketing statistics You are marketing manager for baby cloths; you want to launch a new baby clothes line in a new area but you need to know more about the probability of baby births in the area
In a new county last year 438 babies were born and this years figure will be about the same according to officials
What is the probability that on a given day 0,1,2,3 etc. babies will be born if this natural process follows the Poisson distribution
Graph the distribution; is there based on this reason to believe the new clothes line should be launched during a specific season?
You may want to use the Poisson function in excel and graph from there…
Is the market size attractive? Why/Why not?
Poisson in your company…assignment : Poisson in your company…assignment Try to find out in what processes of your company the Poisson distribution will be applicable.
Define the process; assume a mean value and define X and T (time)
Develop the distribution
Plot the results
Draw conclusions…
IQ’s are rising! : IQ’s are rising! IQ tests: mean=100 IQ 1932 σ = 15 2σ level IQ 2003 IQ tests: mean=120 !
Sugar in oranges! : Sugar in oranges! The citrus industry in Florida makes extensive use of statistics.
Truck weights of oranges are weighted at the receiving juice plant.
A dozen oranges are randomly selected; the amount of sugar in the squeezed juice
Of these oranges is the basis for payment for the load of oranges!
Thanks! see you next week… : Thanks! see you next week… r=% of class participants BUS 5760 versus class list Webster 100%= 7 students