Share PowerPoint. Anywhere!

2007090812224292750

Uploaded from authorPOINT Lite
Download as Download Not Available PPT
Presentation Description

No description available

Like authorSTREAM?


You can vote once a day till December
10th, Vote Now!
Views: 81
Like it  ( Likes) Dislike it  ( Dislikes)
Added: January 28, 2008 This presentation is Public
Presentation Category :Education
Presentation StatisticsNew!
Views on authorSTREAM: 81
Presentation Transcript

Applied Business Statistics (2) : Applied Business Statistics (2) 8 September 2007, SUFE/Webster University- Shanghai I really have no clue what this Webster professor is talking about…


Agenda 8 September 2007 : Agenda 8 September 2007 Your feed back on 1 September assignments Amazing Census-statistic Visual Presentation statistics your S&P 500 company X-Y Correlation on your company Share price of your company Share price sampling assignment ATM experiment Finishing introduction ch. 1-3 last week Combinations and Permutations Binomial Distribution Poisson Distribution Introduction textbook: continuous probability distributions Normal Distribution Sample size Hypothesis Testing Assignments for next week In the news AOB


Agenda 8 September 2007 : Agenda 8 September 2007 Let’s look at your assignments: Your Team-Your Company US Census bureau; the wow effect… Forecasting the share price… 3) Experiment ATM machines- testing Poisson… Probability of hole in one?


Look up and experience assignment : Look up and experience assignment Discover the census… You may want to click around in the data preferably business sections Find something odd; you did not know; you can create the “wow” effect with in your class Present next week!


NYSE: New York Stock Exchange : NYSE: New York Stock Exchange As an investor you would like to predict share price movements If you were successful you can become a billionaire! Businesses need to have interest rate visions, inflation forecasts, currency forecasts…. We are all trying to do better then the market…


The market…(Sep 2006-Sep 2007) : The market…(Sep 2006-Sep 2007)


Share price Sampling assignment : Share price Sampling assignment Observe the share price of your company over the past 5 years Take a random sample of share prices (closings per date) Analyse your findings Can you based on your findings Determine a relation between the date/period and the share price ? Now calculate the delta of share price movements over the past 5 years. Take again a random sample (not the same) Can you determine a relation between the delta share movement and the date/period of the year?


Experiment assignment: Bank’s ATM : Experiment assignment: Bank’s ATM Watch an ATM machine for 5 minutes (T) and figure out what % of these 5 minutes the number arriving at the machine was 0,1,2,3 etc. Repeat the experiment say 5 times Now make a discrete probability distribution based on your findings Present your results in class Name place/location of machine date and specific time measured (15:03-15:08) Do not cooperate with class mates; the more registrations we have the better…. The theoretical distribution that describes this well should be the so called Poisson Distribution!


Counting-permutations and combinations : Counting-permutations and combinations If event A can happen in n1 ways and event B can happen in n2 ways Then event A and B can happen in n1*n2 ways If k independent events (A,B,C etc.) can happen in n different ways then event A,B,C etc, can happen in n*n*n etc ways or nk.


Simple multiplication rule : Simple multiplication rule A restaurant offers 10 different entrees, 25 different main courses and 5 different deserts How many different meals can you arrange from these? 10*25*5= 1250 … If you think this is too hard to understand immediately, you can see this by taking a simple example…


Example : Example A European car license plate consists of 6 different numbers or letters Assume that you can use all letters and all 10 digits Your company is responsible for issuing these license plates Will you have enough possibilities to make sure you will generate unique plates for now and the future… Calculate the number of different licence plates The Netherland currently has about 6,3 mln. cars GO!


Your answer : Your answer 26 letters and 10 digits is 36 possible positions on each of the 6 license plate places The events are independent There are 36 (place 1) * 36 (place 2) *….*36(place 6) = (36)6= 2.176.782.336 possibilities…This is about 4 times the number of vehicles in the entire world! Maybe they are overdoing it?


Simple multiplication rule… : Simple multiplication rule… Offer a sandwich on white or brown bread With Ham, cheese or peanut butter How many different sandwiches? White with Ham, White with Cheese, White with Peanut… Brown with Ham, Brown with Cheese, Brown with Peanut So in total 6 different sandwiches Or 2*3=6


Factorial (n!) : Factorial (n!) Sometimes events have different possibilities ; if you chose 1 that limits the number of choices left for subsequent events such as with Permutations; the number of different ways in which objects can be arranged; the number of permutations of n objects taken r at a time: n!/(n-r)! Example; warehouse location A can be filled with 4 containers, 6 containers are shipped in ; how many different ways are there to fill location A? Go!


Well… : Well… Container loc 1 in A can be filled in 6 ways Loc 2 in A in 5 ways if loc 1 is taken Loc 3 in A in 4 ways if loc 1 and 2 are taken Loc 4 in A in 3 ways if loc 1,2 and 3 are taken Thus 6*5*4*3= 360 Or applying the permutation rule: n!/(n-r)!= 6!/(6-4)!= 6*5*4*3*2*1/2*1=360


An auditor has 9 audits to do : An auditor has 9 audits to do Tomorrow only 5 can be done In how many different orders can tomorrow’s task be carried out? Permutations: n!/(n-r)!=… Calculate!


Well… : Well… 9!/(9-5)!=15120….


Combinations : Combinations Unlike permutations combinations consider only the possible sets of objects regardless of the order in which the members of the set are arranged The number of combinations are: (n!)/r!(n-r)! Determine the number of combinations for the containers in warehouse loc.A


Well… : Well… There seem to be 15 different ways to store the containers in a combination in loc A. n!/r!(n-r)!= 6*5*4*3*2*1/4!*2*1=15 You may want to check the answer…by giving each container a colour


So. : So. If you try to make as many different combinations from lets say ABCD And you want combinations of 3 ABC, BCD, ABD, CDA are the only combinations or 4!/3!*(4-3)!=4


Remember our auditor? : Remember our auditor? If the auditor does not consider the order in which tomorrow’s audits are carried out, how many combinations of 5 audits can he chose from? n!/r!(n-r)!= … Calculate


Well… : Well… 9!/5!(9-5)!= 126 combinations


Discrete probabilities.. : Discrete probabilities.. Toss a coin twice; Outcomes H(heads), or T(tails) P(H)=0.5 and P(T)=0.5 Possible outcomes of this experiment: HH,TH,HT, or TT P(H=0)=0.25, P(H=1)=0.5 and P(H=2)=0.25 Simple distribution follows as well as E(H)=0*0.25+1*0.5+2*0.25=1 STDEV(H)=√((0-1)^2*0.25+(1-1)^2*0.5+(2-1)^2*0.25)^0.5=(0.5)^0.5=0.707


Binomial Distribution (Bernoulli Process) : Binomial Distribution (Bernoulli Process) There are 2 or more consecutive trials In each trial there are just 2 possible outcomes “success” or “failure” The trials are statistically independent (the outcomes of any trial does not affect the outcomes in another trial) The probability of a success and failure stays the same for every trial


Discrete probability distributions : Discrete probability distributions For instance for a Bank ATM machines follow a discrete probability distributions since people arriving at the machine follow discrete variables like 1,2,3 over period of time t.


For discrete probability distributions : For discrete probability distributions The mean= µ=E(x)=∑xi*P(xi) The variance is=σ2 = E(x-µ)^2=∑(xi-µ)^2*P(xi) The standard deviation is σ=variance^0,5 A salesman has contacted customers over many days and found that contacts who became customers (xi) are distributed like:


Binomial distribution : Binomial distribution Outcomes of binomial have 2 possible outcomes success or failure (bi) Also called a Bernoulli process Two or more consecutive trials Each trial is a success or failure The trials are mutually independent The success rate remains the same over several trials


For discrete probability distributions : For discrete probability distributions The mean= µ=E(x)=∑xi*P(xi) The variance is=σ2 = E(x-µ)^2=∑(xi-µ)^2*P(xi) The standard deviation is σ=variance^0,5 A salesman has contacted customers over many days and found that contacts who became customers (xi) are distributed like:


Do it now assignment… : Do it now assignment… Graph the distribution of contacts that became customers per day Calculate the mean, variance and standard deviation


Your answers : Your answers Horizontal axes 0,1,2…..6 Vertical axes P(xi) Mean: ∑ xi*P(xi)= 0*0,05+1*0,1+…+6*(0,1)=3,2 contacts on average become customers Variance: (0-3,2)^2*0,05+(1-3,2)^2*0,1+…+(6-3,2)^2*0,1=2,66 contacts Standard deviation: 2,66^0,5=1,63 contacts Two distributions are discrete distributions: The Binomial distribution The Poisson distribution Building Customer Platform


Sony researchers assignment : Sony researchers assignment Sony has found that 60% of VCR owners know how to program their VCR but want to test this. At their service centre they select 3 VCR owners. What is the probability that 2 of these 3 are able to program their VCR? (S=success=60% and F=failure 40%) How many different possible outcomes are there? How many outcomes with 2 successes? What is the cum probability calculated?


Well… : Well… ADD UP 0,144+ 0,144+ 0.144 Your Answer 0,432


Formula Binomial distribution : Formula Binomial distribution The probability of having x successes in n trials: P(x)=n!/x!(n-x)!*P(success)^nr. Successes* (1-P(success))^(nr. Of experiments-nr. Of successes) Thus: 3!/2!(3-2)!*(0,6)^2*(0,4)^ (3-2) =0,432 Nr. Of combinations Nr. Of successes Nr. Of failures Joint probabilities of 2 times success and 1 time failure


Binomial distribution characteristics : Binomial distribution characteristics Mean: (µ=(E(x)= Expected Value of success= n*P(x) Variance: (σ2=E(X-µ)2=n*P(x)(1-P(x)) In which P(x)=probability of success n= number of trials


Class assignment: the lawsuit : Class assignment: the lawsuit Kellog’s is subject to a lawsuit concerning the use of nonbiodegradable packagings… The trial will be by jury; Kellog’s chief counsel believes that the success of the defense will depend highly on how many of the 9 jurors will be corporate stockholders… We use the binomial distribution for n=9 The jurors are selected from a large county in which 20% of the adults own stocks (∏=0,20) What is the probability that the jury will include at least 3 stockholders (successes)? What is the probability that the majority of the jury will be stockholders? Should the legal counsel based on these findings base his legal arguments on addressing Kellog’s shareholders? Using the binomial distribution


Your answer… : Your answer… Follow the binomial distribution for n=9 and k=3 or higher ∏=0,20; 20% of adults hold stocks Add P(k=3)+P(k=4)+ etc. what is the cum result? Majority means k=5 or higher; what is that cum result So?


You use the binomial distribution (table B at the back of your book) : You use the binomial distribution (table B at the back of your book) Take the n=9 position of the table (the jury consists of 9 persons) The success rate is 20% (∏) The probabilities are as shown: We are looking for k=3 or higher adding P(x=3)+P(x=4)+…+P(x=9)= 0.2618=26.18% The probability that k=5 or higher is even smaller 1.96% The legal counsel can save his breath towards the shareholders and come up with something better…


Class assignment: City Wayne : Class assignment: City Wayne In city Wayne there are 41,636 residents registered We know that in this city 20% of the residents were born in Wayne We will take 5 trials and pick 5 residents What is the probability that of these 5 residents 0,1,2,3,4 or 5 residents were born in Wayne? (P(1), P(2) etc) What is the probability that out of these 5 residents 3 or more are born in Wayne? (P(x)>=3)


Using the binomial formula… : Using the binomial formula… P(0)=5!/0!(5-0)!*0.20*0.85=0.328 Similarly: P(1)= 0.41 P(2)= 0.205 P(3)= 0.051 P(4)= 0.006 P(5)= 0.000 Check the binomial table in your book for n=5 and P(x)=0.20…. P(x>=3)=P(3)+P(4)+P(5)= 0.057 (5.7%)


Using Excel…for Binomial… : Using Excel…for Binomial… Open statistical Fx (functions) Open BINOMDIST…function Fill in number of success you want probability of (for instance out of 5 one person is from Wayne) Fill in nr. Of trials (in our case 5) We are looking for P(1) so not for cum probability (put false in last opening) Excel calculates P(1)=0.4096…. (try!)


The Poisson distribution : The Poisson distribution Say you want to describe; Customer arrivals at a service point Defects in manufactured material Number of work related deaths The Poisson distribution is a family of distributions with a shape determined by its mean λ (lambda) The probability that a random event will occur exactly x times over a given span of time t is: P(x)= (λx*e -λ)/x! with λ=E(x)= the mean e=the mathematical constant used for natural processes representing value 2,71828 and e –λ = 1/(2,71828) λ Note that for the Poisson distribution the mean=variance= λ


Arrival time analysis : Arrival time analysis Customers at a service counter Ambulances Queues Post office ATM arrival Any counter arrivals We now discuss random variables of the discrete type…


Experiment assignment: Bank’s ATM : Experiment assignment: Bank’s ATM Watch an ATM machine for 5 minutes (T) and figure out what % of these 5 minutes the number arriving at the machine was 0,1,2,3 etc. Repeat the experiment say 5 times Now make a discrete probability distribution based on your findings Present your results in class Name place/location of machine date and specific time measured (15:03-15:08) Do not cooperate with class mates; the more registrations we have the better…. The theoretical distribution that describes this well should be the so called Poisson Distribution!


Class assignment: Birth rates… : Class assignment: Birth rates… In an urban district the number of births are expected to be the same as last year (last year 438 children were born) an average of 438/365 days= 1.2 per day Daily births are distributed according to the Poisson distribution… For any given day what is the probability that no children are born in the district? And what is the probability that no more than 1 birth will occur on a given day?


Follow the Poisson distribution formula… : Follow the Poisson distribution formula… P(x)= (λx *e -λ)/x! λ =1.2 the mean birth rate per day We are looking for P(x=0) so: (1.20* 2.71828-1.2)/0!= 0.3012 Similarly for P(1), P(2) etc. P(1)=0.3614 P(2)=0.2169 P(3)=0.0867 P(4)=0.026 P(5)=0.0062 P(X<=1)= P(0)+P(1)= 0.3012+ 0.3614= 0.6626 (66.26%)


Using Poisson tables… : Using Poisson tables… Table will indicate value of λ For x=1,2,3 etc. the probabilities can be read directly from the table (try) P(0)= 0.3012 in table P(8)= almost 0 After that the distribution “stops”


Using Excel…for Poisson… : Using Excel…for Poisson… Open statistical Fx (functions) Open POISSON function Enter number of occurrence (lets say 0) Enter the mean (1.2 in our case) We want the individual probability not the cum. So enter false in the logic box Excel calculates the P(0)…as …. (try!)


Testing Poisson at ATM assignment : Testing Poisson at ATM assignment Gather the data of your class mates Figure out what the mean is of your combined research Define the Poisson distribution Calculate variance and standard deviation Graph this distribution Test your findings with new observations as next week’s assignment Poisson for Riksa’s arrivals ?


Probability distributions for continuous random variables… : Probability distributions for continuous random variables… Variable x can take on any value or range of values on a continuum It’s probability density function is defined as f(x) The most famous and widely used distribution in this category is the Normal distribution Is the clarity of a diamond Normally distributed ?


We are using the N distribution : We are using the N distribution If we are looking for ranges of values (a value interval) in an area that basically is infinite We search for these values under the bell shaped curve The total area under the curve represents probability 1 (100%) The interval under the curve defines an area that we can find the probability for in the N distribution table…


The Normal distribution is a mathematical function… : The Normal distribution is a mathematical function… 2,71828 Mean St.deviation 3,14159


So the shape of the distribution is defined by… : So the shape of the distribution is defined by… The mean The standard deviation Let us now look at some examples of real life… What is the µ and σ of their performance ?


The General Aviation Association : The General Aviation Association All single piston driven engine aircraft with 4 or more seats are measured to fly about 130 hours per annum in total The σ = 30 hours How many flying hours do 95,5% of these aircrafts fly per annum? How many hours do 99,7% of these aircrafts fly per annum?


The skilled statistician recognizes : The skilled statistician recognizes 95,5% is 2σ around the mean The mean is 130 hrs. and σ= 30 hrs. So 95,5% of the planes are in the interval 130 hrs plus/minus 2*30 hrs. or between 70 hrs. and 190 hrs. And 99,7% is the 3σ interval so this % of planes falls within the range 130 hrs. plus/minus 3* 30 hrs. or between 40 and 220 hrs. There are amost no planes outside this range… Airplane promoting sunny vacation resort…


Comparing coffee machines : Comparing coffee machines Jura produces the S90 and E75 machines. After 5 years of marketing these products and following their repair services Jura has found that these machines differ in annual output. The S90 has a mean of 220 liters per annum of quality output without repairs. The E75 has a mean of 265 liters per annum. The variability of the S90 is higher then the variability of the E75. Jura has measured that σ of the S90 is 32 liters and that of the E75 21 liters. Quickly shape these distributions…. Define the 1,2 and 3 sigma levels for both machines….


The standard normal distribution : The standard normal distribution If we would follow all the different N distributions (all different combinations of mean and standard deviation we would need a bible with tables to find the % probability under the Bell-curve) We can simplify this by taking the standard normal distribution (Z) The z-score for a normal distribution is: z= (x-µ)/σ Remember that we did this to compare the length of Jordan and Lobo ?


What if the population is not normally distributed… : What if the population is not normally distributed… In many cases the population is not normally distributed or/and We have no knowledge about its actual distribution However provided that the sample size we take is big enough (n>=30) the sample distribution can still be assumed to be normal This is what is known as the Central Limit Theorem… This theorem specifies: For large simple random samples from a population that is not normally distributed the sampling distribution of the mean will be approximately normal with the mean being similar to µ(s) and the standard deviation being σ(s)= σ/n0.5 If the sample size is increased the distribution will be closer to the normal distribution…


Reconsider the planes case : Reconsider the planes case The original mean was 130 hrs. the z score for 130 is (130-130/30=0) zero Original 170 hours is z-score (170-130)/(30)= 1.33 And for say x=100 in the original distribution the z-score is (100-130)/30= - 1.00 For each value of z= -3,-2,-1,0,1,2,3 etc. there is an original x-value Draw this distribution with 2 scales for resp. x and z…


Remember Poisson distribution? : Remember Poisson distribution? Poisson measures the probability that a discrete number of arrivals take place in a time period The distribution that measures the time interval in between these arrivals is a continuous distribution and is thus the complement of the Poisson distribution This continuous distribution is called the exponential distribution For x= the length of the interval (time, distance etc.) between occurances F(x)= λ*e –λx both x and λ should be >0 λ= the mean and standard deviation of a Poisson distribution 1/λ= the mean and standard deviation of the corresponding exponential distribution


Exponential distribution-911 : Exponential distribution-911 We measure again surfaces under the curve of the distribution of F(x) P(X>k)= e –λk where k= time space or distance until the next occurrence… Let us consider a real life case of the 911 calls in the US (112 Netherlands) Calls to 911 in New York have been found to be Poisson distributed with an average of about 10 calls per hour… If we measure in minutes what is λ and what is 1/λ of the corresponding exponential distribution? What is the probability that the next call will occur in at least 5 minutes from now ? (not earlier then that) Remember: P(X>k)=e –λk


Your answer… : Your answer… If k=5 P(x>5)= e – 1/6*5 =0.4347 If you calculate this for x=0,1,2,3,4,6,7,etc. you can draw this exponential distribution The whole area under the curve is 1 The area k=5 or higher under the curve thus represents 43,47% of the whole area Now calculate the probability that the next call will arrive between 3 and 8 minutes from now….


Use the same metrics : Use the same metrics P(x>3)=0.6065 P(x>8)=0.2637 Deduct: P(x>3)-P(x>8)=0.3428 Draw! Some exponential distributions for some λ’s


MTBF for laser printers : MTBF for laser printers Hp sells laser printers and has found that the Mean Time Between Failure of above model is about 4000 hours The corresponding distribution of x= hours between failures, is exponentially distributed What is the probability that this machine will operate for another 2500 hours without experiencing a failure ? What is the probability that the next failure will occur within the next 3000 hours ?


Your answer… : Your answer… MTBF=1/4000=0,00025 failures per hour… P(X>2500)= e -1/(4000)*(2500)= 2,71828 -0,625= 0,5353= 53,53% P(X<3000)=1-P(X>3000)= 1-e -1/(4000)*(3000)=1-0,4724=52,76% MTBF


Moving to inferential statistics… : Moving to inferential statistics… Using sample data to learn about a population Using information from samples to draw conclusions on the population is known as inferential statistics


Sampling distributions : Sampling distributions We are now looking at samples to learn about a population (mean and standard deviation) If we take many samples we generate a series of means and standard deviations This in itself is a new probability distribution with a mean and standard deviation… (droste effect)


For processes that are “in control” : For processes that are “in control” The means of samples (with n>30) should tend to be normally distributed with the mean of these samples being close to µ The standard deviation of these samples is then close to σ/n^ 0,5 There is a 95,5% probability that the sample mean will be within z=2 standard errors of the mean of the population And a 99,73% probability that the sample will be within z=3 standard errors of the mean of the population This rule is called the Central Limit Theorem


Olive Oil Filling machine : Olive Oil Filling machine Sold by the supplier with a process standard deviation of 0,1 ounces and is said to fill olive oil cans with 12 ounces of product We take a sample at this machine with n=30 Between what weight levels is the 2 sigma 95,5% of cans fall in this interval on this machine…


Your answer… : Your answer… 12 ounces + or – 2*(σ/n^ 0,5 )= 12 ounces + or – 2*(0,1/30^0,5)= from 11,963 ounces to 12.037 ounces If the sample means fall consistently outside this interval then the machine has drifted off from specs! Call your supplier…


Remember the 1 engine planes? : Remember the 1 engine planes? Say that we now take samples of 1 engine planes and n=36 Remember that we think µ= 130 hrs. The sample mean is 138 hrs. Remember that assumed was σ=30 hrs. So the standard deviation of the sample is? The corresponding z-score is? What is the probability that the average flying time of the planes in the sample (36) was at least 138 hrs.? Draw the distribution!


Your answer… : Your answer… Standard deviation of sample= σ/n ^0,5 So that is: 30 hrs./36^0,5=5 hrs. Z-score= (sample mean-pop mean)/sample st,dev.= (138-130)/5=1,6 Z-Table at z=1,6 gives 0,4452 and this is the area in between 0 (the mean of the z distribution) and z=1,6 Thus the area beyond z=1,6 is the complement: 0,5-0,4452=5,48% Draw so that µ=130 falls at same point as z=0 and x=138 falls at same point as z=1,6… X=130 Z=0 X=138 Z= 1,6 .4452 .0548 n=36 planes


Using the standard normal distribution table: : Using the standard normal distribution table: Z=1.600 gives directly 0.4452 Using Excel: Use NORMSDIST function Enter z=1.6 Will provide cum probability 0.9452 Interpret result and find 0.4452…


Degrees of freedom (Df) : Degrees of freedom (Df) Crucial for the understanding of distributions Why divide by (n-1) instead of n while calculating the sample variance s^2 Suppose we consider the four data: x1=10; x2=12; x3=16; x4=18 The sample mean of these data is:… Given that the mean is known, how many of our data points are Free To Move? So if we have 1 missing value this one is predetermined if we no the mean value. What happens if we are missing 2 values in the sample? Say x3 and x4? We now know that x3+x4=34 If we chose a value for x3 then x4 is determined if the mean is known (this is assumed) So when we have n data and we know their mean. The mean acts as a restriction (deviations from the mean sum up to zero) leaving us with (n-1) degrees of freedom!


Normally we do not know µ and σ : Normally we do not know µ and σ In this case we use the t-distribution also called Student distribution to calculate probabilities T= (mean sample-µ)/(s/n^0,5) with s = stand.dev. Sample The mean of this distribution is 0 The distribution is more spread but has the shape of the Normal distribution (thicker tails) In using the t-table you need to know the degrees of freedom this is (n-1)


Sample employees : Sample employees n=90 employees from manufacturing Average number of over hours last week of this group was: 8,46 hours The samples standard deviation was: s=3,61 hrs. What is the 98% confidence interval for the population mean? Remember df=(n-1)=89 The population mean is centred around the sample mean at t*s/n^0,5 distance s and n are known! t can be found in the t-table For a 98% confidence interval the 0,01 column is needed (for 90% interval the 0,05 column; thus (1-confidence level)/2 is your column… GO!


Your answer… : Your answer… Sample mean: 8,46 hours t value in table at column 0,01 and df=89 is 2.369 s=3.61 hrs. and n=90 8,46+ or – 2.369*3.61/90^0,5= The population mean is with a reliability of 98% anywhere in between 7.56 and 9.36 hrs. overtime


What Sample size… : What Sample size… Remember the rule of thumb n=N^0,5 A more precise calculation is: n= (z^2*σ^2)/(max.error acceptable) ^2 Z= desired level of confidence corresponding z value σ= known or estimated standard deviation population max. error = concrete value of acceptable error


Teenagers market… : Teenagers market… Marketers always keep an eye on teenagers A marketer for mobile phones wants to know the average amount that teenagers earn during the summer holidays The marketer wants a 95% confidence that the sample mean is within EUR 50 of the actual population mean (all teenagers in NL) The marketer has estimated that σ=EUR 400 What should be the sample size? (apply the formula: n=(z^2*σ^2)/error^2


Your answer… : Your answer… 95% confidence refers to z=1,96 error= EUR 50 σ= EUR 400 (assumed) So applying the formula: n= 1,96^2*400^2/50^2= 246 (rounded)


Say we want an error of 25 EUR : Say we want an error of 25 EUR To get a feel for the relation between accuracy and the size of the sample what should then be the sample size? And what if the error should be 5 EUR? 983 ? 25000 ? 34879? 123900? Or…


Your answer… : Your answer… 1,96^2*400^2/25^2= 984(rounded) Half the accepted error increases the required sample size with factor 4 (from 246 to 984) And with a small eror like 5 EUR: 1,96^2*400^2/5^2= 17562…


A travel agency…assignment: : A travel agency…assignment: Wants to determine the proportion of US adults that have ever vacationed in Mexico The agency wants to be 95% confident that the sample error will be no more than 3% Assuming the travel agency has no idea about the actual value of the population proportion, what sample size is necessary to have 95% confidence that the sample proportion will be within 3% of the actual population proportion? How many US adults vacationed in Mexico?


Your answer… : Your answer… For the 95% confidence (see table) the z-value will be 1.96 If the agency has no idea about the population ratio they will use p=0.5 (or (1-p)=0.5) Following the formula: n=z^2*p*(1-p)/e^2=1.96^2*0.5*0.5/(0.03^2)= 1067.1 (always round up) 1,068 If the agency believes however that the ratio in the population in no case is higher than p=0.3 then: 1.96^2*0.3*0.7/(0.03^2)=896.4 rounded 897…


Hypothesis Testing (Samples) : Hypothesis Testing (Samples) The new Fat Free Pringles… Are they as good as the existing tastes (containing fat)? 44 testers were given 2 bowls of Pringles…one with and one without fat If the two bowls tasted the same then each tester would have a chance of 50% to correctly indicate the bowl that contained fat and the one that did not… However 25 of 44 testers (almost 57%) identified the bowl without fat correctly Does that result mean that Pringles failed in their attempt to make the Pringles taste the same as the Regular Pringles? We will use hypothesis testing to find the right (statistical) conclusion…


Step 1) formulate H0 (null hypothesis) : Step 1) formulate H0 (null hypothesis) A statement about a population parameter The H0 should be “a nothing out of the ordinary statement” and is usually believed to be true unless we have overwhelming statistical evidence it is not true… The alternative hypothesis (H1) holds if the H0 hypothesis is false…


Directional versus Non-directional testing : Directional versus Non-directional testing A directional test/claim is one in which the population parameter is believed to be smaller or equal than…or bigger than… (un-equality) For instance: no more than 20% of the cans are damaged… In this case we use a one –tail test A non-directional test/claim is one in which the population parameter is believed to be a certain value (equality) 35% of car drivers are senior citizens…in this case we use a two tailed test


Errors in hypothesis testing… : Errors in hypothesis testing… Reject H0 while H0 is true (type 1 error and the probability of this type of error is α(lpha) = level of significance Accept H0 while H0 is false (type 2 error expressed as β(eta)


Step 2) Choose level of significance…(alpha level) : Step 2) Choose level of significance…(alpha level) Basically by choosing the significance level we are choosing the chance that we will make a type 1 error during our test… Significance levels of 10%, 5% or 1% are quite common (either one sided-for directional tests or two sided for non-directional tests)


Step 3) Choose the test statistic… : Step 3) Choose the test statistic… In most cases this will be the z-statistic or t-statistic corresponding to the normal and t-distribution… Run a z-test if σis known and the population is assumed to be normally distributed Run a t-test if σis unknown…


Using the z-statistic (σ known) - Example : Using the z-statistic (σ known) - Example A robot welder is in adjustment Its mean time to perform its task is 1.3250 minutes Past experience has found that the standard deviation of the cycle time is 0.0396 minutes An incorrect mean operating time can disrupt the efficiency of other activities along the production line For a recent sample of 80 jobs (n=80) the data are in attached Excel file: Welding Robot…


Using z-statistic (continued) : Using z-statistic (continued) Alpha=5%


Using z-statistic : Using z-statistic (1) Select the significance level: we are doing a two sided test with let’s say alpha=5% (2.5% at each side) i.e. if the robot is running properly there is only 5% chance that we make the mistake to conclude it needs adjustment… 2) The assumed mean under H0 :µ=1.3250 so H1: µ≠ 1.3250 3) Calculate the test statistic: Z= (x (sample average)-1.3250)/σ(s) Assuming normal distribution σ(s)= σ/n0.5 So: z= (1.3229-1.3250)/0.0396/800.5 = -.47


Using z-statistic (continued) : Using z-statistic (continued) 4) Identify critical values for test statistic at α=5% z= -1.96 or z=+1.96 5) Compare calculated test statistic with critical values in order to be able to decide to accept H0 or reject it…; in this case the calculated z value (-0.47) falls within the non-rejection region so at α=5% the null hypothesis can not be rejected! Based on this the robot welder is not in need of adjustment…and thus the difference between the population mean of 1.3250 and the sample mean of 1.3229 minutes is due to chance variation…


A 95% confidence interval… : A 95% confidence interval… In this case the 95% confidence interval would have been: try…


Your answer… : Your answer… X (average)± z*σ/√n= 1.3229±1.96*0.0396/√80= 1.3142 (left sided) and 1.3316 (right sided) around the mean This confidence interval tells us that the mean of the population could indeed fall at 1.3250…


Z-statistic test (one tail) : Z-statistic test (one tail) Light bulbs have mean life time of 1030 hrs with stdev 90 hrs A company considers to buy a new bulb with a longer life time (same price) The management wants to make sure that the new bulb has a longer life time since they use a large number of them A sample of n=40 is taken of the new bulb The mean life time of the sample bulbs is found to be 1061.6 hrs


Underlying bulb data… : Underlying bulb data…


Class assignment: Follow the test procedure… : Class assignment: Follow the test procedure… H0: µ≤1030 hrs H1:µ> 1030 hrs Suggesting that the life time of the new bulbs are not different from the old ones… Let’s choose α=5% The chance of believing that the new bulb has a longer life while that’s not true Test statistic: z= ? Critical value: at 5% z= ? So do we reject or accept H0 ? Is the difference between the old mean of 1030 hrs and the mean of the sample 1061.6 hrs of the new lamp too large to have occurred by chance…or not?


Your answer… : Your answer… Test statistic: z= (1061.6-1030)/(90/ √40)= 2.22 Critical value: at 5% z=+1.645 So we reject H0 and accept H1 that the new bulb’s life is longer than the old one… The difference between the old mean of 1030 hrs and the mean of the sample 1061.6 hrs is too large to have occurred by chance…


Using Minitab for testing… : Using Minitab for testing… For the 40 bulbs tested open the Minitab sheet Use: basic statistics/z-test Enter in input range the hyptohesized mean of 1030 hrs Enter the known population stdev=90 into the sigma box Click labels Enter the level of significance 5% into the alpha box under options… Note the p-value of 0.013 telling you that there is only 1.3% probability of getting a sample mean this large (1061.6) by chance… One-Sample Z Test of mu = 1030 vs > 1030 The assumed standard deviation = 90 95% Lower N Mean SE Mean Bound Z P 40 1061.6 14.2 1038.2 2.22 0.013


Testing with the t-test (µ unknown) two tail test… : Testing with the t-test (µ unknown) two tail test… A credit manager claims that the average account of customers is $ 410 An auditor takes a sample of 18 accounts and finds a mean of $511.33 and a standard deviation of $183.75 Sample data are in the next slide If the sample results are not supporting the manager’s claim than the auditor will check all the accounts (and the company will have to pay for that) What should the auditor do?


Underlying data… : Underlying data…


Follow the test procedure… : Follow the test procedure… H0: µ=$410 H1: µ≠$410 N=18 α=5% (two tail test) We are using the t-statistic so the t-distribution will be used to describe the sample distribution t=(x average - µ)/(s/ √n) = t= ($511.33-$410)/($183.75/ √18)=2.34 Critical values: n=18 df=n-1=17 So from t-table: at 2.5% (two sided) t=± 2.11 The calculated t falls in the rejection area since it exceeds 2.11 so we will reject H0 The auditor should, based on this, proceed to check all accounts (and charge the cost to te company…)


Testing with the t-test (µ unknown) one tail test… : Testing with the t-test (µ unknown) one tail test… A tire company claims that a new tire has a mean life time of 60,000 miles A skeptical car magazine wants to test this claim and takes a sample of 36 tires The test data are in the next slide Life time: 60,000 miles?


Follow the test procedure… : Follow the test procedure…


Follow the test procedure… : Follow the test procedure… H0: µ≥ 60,000 miles H1:µ< 60,000 miles Let’s say we use significance level α=0.01 The test statistic t=(x average - µ)/(s/ √n) Calculate t= (58,341.69- 60,000)/605.42=-2.739 Critical t-value: at α=0.01; t=-2.438 So the calculated t value based on the sample lies in the rejection area and we will not accept H0; the editor’s doubt with respect to the lifetime of the tires seems to be justified by the sample results…


Using Minitab for t-tests… : Using Minitab for t-tests… One-Sample T Test of mu = 60000 vs > 60000 95% Lower N Mean StDev SE Mean Bound T P 36 58342 3623 604 57322 -2.75 0.995 Basic statistics, simple t-test


Testing Proportions : Testing Proportions Sometimes we want to test a proportion in a population based on a sample Voting % (claim 65% test…) Unsatisfied customers (claim 5% test…) Operating spec of a machine (claim supplier 3% test…)


Two tail Test on Proportion…(example) known σ : Two tail Test on Proportion…(example) known σ The director of an MBA school claims that 70% of graduated students after 3 years end up in a job related to their MBA study… A test of n=200 is done to test this claim and from this sample it was calculated that 66% of the students fulfill the claim Use the 5% significance level The estimated stdev of the sample is: S=√p*(1-p)/n= √0.70*0.30/200=0.0324 The calculated value of z= (p(s)-p)/s= Z=0.7-0.66/0.0324= 1.23 Critical values z=±1.96 So the calculated t-value lies in the non-rejection area Conclusion: the claim of the MBA Program Director could indeed be true…


One tail test on proportions known σ… : One tail test on proportions known σ… The US government closes hospitals with mortality rates of over 5% From a sample we know that in one hospital 100 operations have been performed with mortality rate 7% At the 1% level of significance was the mortality rate of this hospital significantly greater than 5%? Perform the test and draw a conclusion….


Your answer…. : Your answer…. Ho: p=5% H1: p≠5% α=1% S=√p*(1-p)/n= √0.05*0.95/100=0.02179 The calculated value of z= (p(s)-p)/s= Z=0.07-0.05/0.02179= 0.92 Critical value at α=1%; z=+2.33 So if z>2.33 we reject Ho but in our case the calculated z=0.92 smaller than the critical level and thus we will accept Ho Concluding : the mortality of the hospital is not significantly higher than 5% and could as well be 5% (that is what we tested) that it is 7% I the sample might be due to chance only…


Try that in Minitab (basic stats) : Try that in Minitab (basic stats) Test and CI for One Proportion Test of p = 0.7 vs p not = 0.7 Exact Sample X N Sample p 95% CI P-Value 1 132 200 0.660000 (0.589844, 0.725332) 0.247


Pooled Variance t-Test for two independent samples… : Pooled Variance t-Test for two independent samples… Sometimes we would like to know whether the difference between the means of two independent samples is large enough to reject the possibility that their population means are the same… Comparing two models of printer’s speed Comparing the tensile strength of steel bars Tensile strength of steel bars


Example… : Example… A new training program for CPA’s developed by a software company has 2 versions 10 students are trained with format 1 their performance is stated in attached table (nr of errors) 12 students are trained with format 2 their performance is stated in attached table (nr of errors) Test if the two formats are significantly different… Take significance-level 10%


Follow the test procedure… : Follow the test procedure… Ho: µ1=µ2 H1: µ1≠µ2 Two tailed test with α=10% Format 1 students sample shows average nr of errors of 6 with stdev 3.127 Format 2 students sample shows average nr of errors of 8.167 with stdev 3.326 t=(x(s1)-x(s2))-(µ1-µ2 )/√s2(1/n1+1/n2) S2=((10-1)*(3.127)2+(12-1)*(3.326)2)/(10+12-2)= 10.484 So t=(6 – 8.167)-(0)/√10.484(1/10+1/12)=-1.563 Critical values (5% each side) t=±1.725 note df=20 The t-statistic falls in the non-rejection area so we can conclude based on the results that the tests have the same population means of errors…


Linear Regression and correlation : Linear Regression and correlation Simple linear regressions have the form: Yi=β0 + β1 Xi + εi Where: Y= value of the dependant (endogenous) variable and i indicates the ith measurement Where: X= value of the independent (exogenous) variable (i indicates similar) For each X the Y values follow a normal distribution and vice versa β0= the y-intercept of the regression line β1= the slope of the regression line ε= random error or residual value; the difference of the actual value of Y and the value of Y described by the line β0+β1*X for every i


How to find the best fitting line? : How to find the best fitting line? The ordinary least square method (OLS) minimizes the ε’s to the line So for each measured actual value of Y for a given X the predicted value Y (by the regression line) is measured and added The line with the lowest ∑ εi^2 is the best fitting line for the given actual points Let’s take an example…


We measured the following relation : We measured the following relation What line describes the best fit based on OLS: Y=7+2X or Y=1+ 3X ? GO!


In Excel…Data Analysis : In Excel…Data Analysis Use Fx Regression Excel will estimate b0 (intercept) and b1(slope)…for the best fitting line…


OLS Assignment: finding values for b0 and b1 : OLS Assignment: finding values for b0 and b1 The slope of the regression line is defined as: b1=(∑xi*yi – n*x(mean)*y(mean))/((∑xi^2)-n*(x(mean))^2) The Y-intercept can be found as: b0= Y(mean)-b1(X(mean)) since the OLS regression line always passes through (X(mean),Y(mean)) Apply these formulas in finding b0 and b1 in the following case;


Homework case: MBA points and assignment productivity : Homework case: MBA points and assignment productivity Prof.B. thinks there is a relationship between the Business Statistics score of 5 students and their assignment productivity He measures: Determine b0 and b1 Determine the regression line… Show the calculations!


Assignment to verify : Assignment to verify Verify your regression line estimates in Excel and determine the regression coefficient (measure of fit)


Covariance between 2 variables : Covariance between 2 variables Sxy=∑(xi-X(mean))*(yi-y(mean))/(n-1) Say that x= the number of commercials and y=sales value of the product in $ 100’s Calculate the sample covariance sxy You need to know how to do this since you need it to calculate the correlation coefficient: r xy X Y


The correlation coefficient between x and y is: : The correlation coefficient between x and y is: r xy= Cov(x,y)/sx*sy = (∑(xi-mean X)(yi-meanY)/(n-1))/sx*sy Calculate r xy for the commercials/salesvolume case… Dr.pepper commercial


Your answer… : Your answer… ∑(xi-x mean)(yi- y mean)=99 (n-1)=9 S xy= 99/9=11 Sx=(∑(xi-x mean)^2/(n-1))^0,5= 1.49 Sy= similar 7.93 R xy= S xy/S x*S y= 11/(1.49)*(7.93)= + 0,93 a strong positive correlation Check with Excel!


Linear correlation coefficient : Linear correlation coefficient Also called the Pearson product moment correlation coefficient Can also be calculated using: r= (n∑xy-(∑x)(∑y))/(n(∑x^2)-(∑x) ^2) ^0,5 * (n(∑y^2)-(∑y) ^2)^0,5 You may want to check your findings of the commercials assignment… Abacadabra…?


Homework assignment NYSE-Euro zone equity markets : Homework assignment NYSE-Euro zone equity markets Whatever your business is often you have to come up with: An idea Gather data Do research Analyse Structure your findings Advise your company or customers Refine your idea etc.


It is believed that all European equity markets follow the NYSE… : It is believed that all European equity markets follow the NYSE… As investor banker you are asked by your potential customers to show the strong correlation between the NYSE and the AEX, DAX, Euronext, CAC, FTSE, DJ Euro 50… You have to assume a correlation and assume a model You have to gather raw data You have to calculate the correlation You have to describe the linear relation You have to advise your potential customers You have to test your theory…


Your homework assignment… : Your homework assignment… Gather data on these bourses Develop a model Test your model Assess correlation Refine your model (to improve correlation) Advise your customers…


Next week : Next week Your assignment portfolio defines your mid term grade! Simple Linear Regression and Correlation analysis!


Grading your Portfolio… : Grading your Portfolio… Failing to prepare is preparing to fail!


In praise of Thomas Bayes : In praise of Thomas Bayes Mathematical rule explaining how you should change your existing beliefs in the light of new evidence A set of developing information influences probabilities Microsoft office assistant the paper clip that tries to help the user uses this algorithm When the user calls for the assistant the computer analysis recent actions and changes the probability on different helping information…


Econometrics! : Econometrics! Application of statistical theory to economic investigations… Say your theory is that the profitability of a business is highly related to the industry it’s active in But there is an “omitted variable bias” there are many other factors like leadership style, economic climate,… There is also the reverse causality problem; doing better helps to improve the economic climate… Robustness and precision effect the reliability of the outcomes; were enough situations measured? Econometrics is a rich field of analysis… Omitted variables? Reverse causality-autocorrelation? Robustness quest?


Figures of fun; remember scale : Figures of fun; remember scale Figure and fictitious come from the same Latin root! Remember what we said about scales… Compare the stock market charts for Bangkok and Manilla


Assignment marketing statistics : Assignment marketing statistics You are marketing manager for baby cloths; you want to launch a new baby clothes line in a new area but you need to know more about the probability of baby births in the area In a new county last year 438 babies were born and this years figure will be about the same according to officials What is the probability that on a given day 0,1,2,3 etc. babies will be born if this natural process follows the Poisson distribution Graph the distribution; is there based on this reason to believe the new clothes line should be launched during a specific season? You may want to use the Poisson function in excel and graph from there… Is the market size attractive? Why/Why not?


Poisson in your company…assignment : Poisson in your company…assignment Try to find out in what processes of your company the Poisson distribution will be applicable. Define the process; assume a mean value and define X and T (time) Develop the distribution Plot the results Draw conclusions…


IQ’s are rising! : IQ’s are rising! IQ tests: mean=100 IQ 1932 σ = 15 2σ level IQ 2003 IQ tests: mean=120 !


Sugar in oranges! : Sugar in oranges! The citrus industry in Florida makes extensive use of statistics. Truck weights of oranges are weighted at the receiving juice plant. A dozen oranges are randomly selected; the amount of sugar in the squeezed juice Of these oranges is the basis for payment for the load of oranges!


Thanks! see you next week… : Thanks! see you next week… r=% of class participants BUS 5760 versus class list Webster 100%= 7 students