logging in or signing up stat 401 notes opsheoran1968 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 325 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 31, 2011 This Presentation is Public Favorites: 0 Presentation Description descriptive Statistics Comments Posting comment... Premium member Presentation Transcript Statistical Methods for Research Workers : Stat 401 Theory Mon-Wed 10-11 A.M Practical Thursday 10-12 A.M. Mid term 30 Marks Final 45 Marks Practical 25 Marks Mid and Final Examination will be held on last day of the examination week announced by University Books recommended Theory and Problems of Statistics by M.R. Speigal Statistical techniques in Agricultural and Biological Research by D, Raghava Rao Statistical Methods by G.W. Snedecor and W.G. Cochran Statistical Methods for Agricultural Workers by V.G. Panse and P.V. Sukhatme Statistical Methods for Research Workers Course Outline : Course Outline Week 1 : Registration and introduction of the course Week2 : Frequency distribution, Graphical representation of data, Measures of central tendency, Partition values and Graphical method of their determination Week 3 : Measure of dispersion, Moments, Skewness and Kurtosis Week 4 : Definition of probability, Additive and multiplicative laws of probability and related problems, conditional problems Week 5: Random Variable, Probability distribution, Mathematical expectation and its properties Week 6: Binomial and Poisson distribution and their related problems Week 7:Normal distribution, Normal curve, Standard normal variate, related problem, Normal approximation of Binomial and Poisson distribution. Week 8 :Concept of sampling from a population, simple random sampling (with or without replacement), sampling distribution of mean, difference of means, proportion and difference of proportions Week 9: Student’s t-distribution, chi-square distribution and Snedecor’s F-distribution, confidence interval for means and difference of means Week 10-11: Mid term examination What is Statistics : What is Statistics “Statistics is a way to get information from data” Data Statistics Information Data: Facts, especially numerical facts, collected together for reference or information. Information: Knowledge communicated concerning some particular fact. Statistics may be defined as the science of collection, presentation, analysis and interpretation of numerical data.. Importance of Statistics : Importance of Statistics Statistics and Planning Statistical tools relating to production, consumption, price, investment, income, expenditure etc. and various advanced statistical techniques for handling and analysing such complex data are of great importance. Statistics and Mathematics Statistics is related to and dependent upon mathematics. The modern theory of statistics has its foundation on the theory of probability which is a particular branch of more advanced mathematical theory of measure and integration. Hence statistics is a branch of applied mathematics which specialises in data Statistics and Economics Various statistical analysis techniques are very useful in solving a variety of economic problem such as wages, price, consumption, production, distribution of income and wealth etc. Statistical tool like Index Number, Time Series analysis, Demand analysis and Forecasting techniques are extensively used for efficient planning and economic development of a country. Statistics and Business Statistics is an important tool of production control. In business, more and more statistical techniques for studying the needs and desires of consumers and for many other purposes The success of a businessman more or less depends upon the accuracy and precision of his statistical forecasting. Slide 5: Statistics and Biology, Astronomy and Medical Sciences The association between statistical methods and biological theories was first studied by Francis Galton in his work in Regression. According to Professor Karl Pearson, the whole theory of heredity rests on statistical basis. He says “The whole problem of evolution is the problem of Vital statistics. In Astronomy, the theory of Gaussian “Normal Law of Errors” for the study of movement of starts and planets is developed by using the principal of Least squares. In Medical Sciences also, the statistical tools for the collection, presentation and analysis of observed facts relating to causes and incidence of diseases and the results obtained from the use of various drugs and medicines are great importance. Moreover the efficiency of a manufactured drug or injection is tested by using the test of significance. Statistics and Psychology and Education In education and Psychology statistics has found wide applications e.g. determination of reliability and validity of a test, factor analysis etc Limitation of Statistics : Limitation of Statistics Statistics is not suited to the study of qualitative phenomenon. Statistics does not study individuals Statistical laws are not exact. Statistics is liable to be misused. Key Statistical Concepts… : Key Statistical Concepts… Population : The population is the set of all measurements of interest to the investigator. Population may be finite or infinite. Sample: A sample is a subset of measurements selected from the population of interest Variable: A variable is a characteristic that changes or varies over time or varies across different individual subjects Variables can be classified as categorical (qualitative) or quantitative (numerical). Categorical. Categorical variables take on values that are names or labels. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) Quantitative. Quantitative variables are numerical. They represent a measurable quantity. For example, when we speak of the population of a city, we are talking about the number of people in the city - a measurable attribute of the city. Therefore, population would be a quantitative variable. Slide 8: Quantitative variables can be classified as Discrete : Can assume only a countable number of values Example : Flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. It could not be any number between 0 and plus infinity. We could not, get 2.3 heads. Therefore, the number of heads must be a discrete variable. Continuous: Can assume all of the infinitely many values corresponding to a line interval. Example : The weight of students in a class is between 60 kg and 70 kg. The weight of a student would be an example of a continuous variable; since a student’s weight could take on any value between 60 and 70 kg. Classification of data : Classification of data Broadly the data have been classified into four classes Quantitative data: When the measurement of variable are quantitative then it is called as quantitative data e.g. weight of individuals, marks obtained by students etc Qualitative data : When we cannot make measurement but they can be differentiate such as Sex (Male or Female), Education (literate or Ill irate) Chronological data : When time series data are collected for a particular phenomenon it is called chronological data e.g. Wheat production in Haryana in various years Year Production 66-67 14.36 67-68 15.68 68-69 22.36 2009-10 256.38 Geographical Data : When data are presented geographically e.g. Foodgrains production in various states of India Presentation of data : Presentation of data When observations are available on a single characteristic of a large number of individuals, then it becomes necessary to condense the data as for as possible without losing any information of interest. The raw data collected as such do not give any useful information. Hence data are arranged either in array or in a frequency distribution so that some meaningful information can be extracted. Let us consider the marks in statistics obtained by 70 students 32 47 41 51 41 30 39 18 48 53 54 32 31 46 15 37 32 56 42 48 38 26 50 40 38 42 35 22 62 51 44 21 45 31 37 41 44 18 37 47 68 41 30 52 52 60 42 38 38 34 41 53 48 21 28 49 42 36 41 29 30 33 37 35 29 37 38 40 32 49 Slide 11: A better way may be express the figures in an ascending or descending order of magnitude commonly termed as Array. 15 18 18 21 21 22 26 28 29 29 30 30 30 31 31 32 32 32 32 33 34 35 35 36 37 37 37 37 37 38 38 38 38 38 39 40 40 41 41 41 41 41 41 42 42 42 42 44 44 45 46 47 47 48 48 48 49 49 50 51 51 52 52 53 53 54 56 60 62 68 This arrangement does not reduce the bulk of data. A much better representation is the frequency distribution Frequency Distribution : Frequency Distribution Def: The method or way in which the observations are classified and distributed in the proper class intervals is called frequency distribution. Frequency (Continuous data): The number of observations lying in any class interval is known as frequency of that class. Frequency (Discrete data): How many times a no. has been repeated in a series is known as the frequency of that number. How to construct a Frequency distribution : How to construct a Frequency distribution Put a bar ( | ) called tally mark against the number when it occurs. Having occurred four times, the fifth occurrence is represented by putting a cross tally mark on the first four tallies. 15 18 21 33 32 31 30 29 28 26 22 34 38 35 37 36 39 2 2 1 4 2 3 2 1 1 1 1 5 2 5 1 1 1 Marks No. of Stud Total frequency 40 41 42 51 50 49 48 47 46 45 44 52 60 53 56 54 62 Marks No. of Stud Total frequency 68 2 6 4 2 1 2 3 2 1 1 2 2 1 2 1 1 1 1 Slide 14: When the identity of the individuals is not relevant and the order in which observation arise is not important then the data can be condensed by dividing the observed range of variables into suitable number of class intervals and record the number of observations in each class as Such a table showing the distribution of the frequencies in the different classes is called frequency table and the manner in which class frequencies are distributed over the class intervals is called grouped frequency distribution of the variable Rule for construction of a frequency distribution : Rule for construction of a frequency distribution The class should be clearly defined and non-overlapping The classes should be exhaustive i.e. each of the given values should be included in one of the classes. As far as possible class interval should be of equal widths Open ended classes should be avoided The number of classes should not neither be too large nor too small. It should preferably lies between 5 and 15. We can determine the approximate number of classes by k = 1 +3.322 log10N, where N is the total frequency Number of class intervals k = 1+3.322 log10(70) = 1 + 3.322 x 1.84 = 7.13 = 7 Having fixed number of classes, divide the range (The difference between greatest and smallest observation) by it and nearest integer to this value gives the magnitude of the class interval. Broad class interval (i.e. less number of classes) will yield rough estimates while for high degree of accuracy small class intervals are desirable. class width = (Maximum observation – Minimum observation) / k Class width (h) = (68-15)/7 = 7.57 = 8 Class limits should be chosen in such a way that the mid value of the class interval and actual average of the observations in that class interval are as near as possible Graphical Representation of data : Graphical Representation of data The graphical representation of data makes the reading more interesting, less time-consuming and easily understandable. The disadvantage of graphical presentation is that it lacks details and is less accurate. It is often useful to represent a frequency distribution by means of a diagram which makes the unwieldy data understandable and convey to the eye the general run of the observations. Following methods are used for graphical representation of the data Histogram Frequency Polygon Frequency Curve Pie-chart Histogram : Histogram A histogram is a diagram which represents the class interval and frequency in the form of a rectangle. There will be as many adjoining rectangles as there are class intervals. How to draw a Histogram Mark class intervals on X-axis and frequencies on Y-axis. The scales for both the axes need not be the same. Class intervals must be exclusive. If the intervals are in inclusive form, convert them to the exclusive form. Draw rectangles with class intervals as bases and the corresponding frequencies as heights. The class limits are marked on the horizontal axis and the frequency is marked on the vertical axis. Thus a rectangle is constructed on each class interval. If the intervals are equal, then the height of each rectangle is proportional to the corresponding class frequency. If the intervals are unequal, then the area of each rectangle is proportional to the corresponding class frequency. Example : Example 5 10 15 20 25 2 4 6 8 10 12 14 16 18 X-axis Y-axis Class-interval Frequency Slide 19: Example The daily wages of 50 workers Table (a) Table (b) Slide 20: Frequency Polygon A frequency polygon is nothing but the join of the mid-points of the tops of the adjacent rectangles. Joining the mid points of all the classes (rectangle) makes the frequency polygon. In a frequency distribution, the mid-value of each class is obtained. Then on the graph paper, the frequency is plotted against the corresponding mid-value. These points are joined by straight lines. These straight lines may be extended in both directions to meet the X - axis to form a polygon. Slide 21: Frequency Polygon Pie-Chart and its construction : Pie-Chart and its construction Sometimes a circle is used to represent a given data. The various parts of it are proportionally represented by sectors of the circle. Then the graph is called a Pie Graph or Pie Chart. 1. Find the angle of each sector Total of data corresponds to 360o Let xo = the angle at the centre for item A, then Measures of Central tendency : Measures of Central tendency Def: The tendency of the observations to gather in the innermost part of the data set is referred as central tendency. The basic measures of central tendency are Mean Median Mode Geometric Mean Harmonic Mean Characteristics of an ideal measure of central tendency It should be strictly defined It should be easy to calculate It should be based on all the observations It should be suitable for further mathematical treatment It should be less affected by fluctuation of sampling. It should not be affected by extreme values Mean or Arithmetic Mean : Mean or Arithmetic Mean The mean or average of the data set is nothing but the sum of all the values given in the data set divided by total number of values present in the data set Case I : For un-grouped data Let x1, x2, . . . xn be n observations then mean is given by Example: Find the Mean of the data set: { 20, 25, 34, 50 } Case II: For a Frequency distribution : Case II: For a Frequency distribution if x1, x2, . . . xn has frequencies f1, f2, . . . fn, respectively then is given by where Note: In case of grouped or continuous frequency distribution x is taken as the mid value of the corresponding class Slide 26: Example: a) Find the arithmetic mean of the following frequency distribution b) Calculate the arithmetic mean of marks from following table Slide 27: Note : when x of f are large the calculation of mean is time consuming and error prone. In such cases we take the deviation from an arbitrary point ‘A’ and mean can be calculated as Now since di = xi – A fidi = fi(xi – A) = fixi - fiA Taking summation on both sides over i from 1 to n, we get Dividing both sides by N Note : We can take any value of A but usually, the value of x corresponding to the middle part of the distribution will be much convenient. Let the variate values x1, x2, . . . xn have frequencies f1, f2, .. . fn respectively. Let ‘A’ be any point (any value of xi). Compute the deviation di = xi – A as given below Slide 28: In case of grouped or continuous distribution we take the deviation as where h is the width of class interval and mean formula Example: Calculate the mean for following frequency distribution Properties of Arithmetic Mean : Properties of Arithmetic Mean Algebric sum of deviation for set of values from arithmetic mean is zero Proof: Let the variate values x1, x2, . . . xn be the values of variate X with frequencies f1, f2, .. . fn respectively. Then we want to prove We know that putting in (1) we get Mean of composite series : Mean of composite series Slide 31: Example : The average salary of male employees in a firm was Rs. 5200 and that of female was Rs. 4200. The mean salary of all the employees was Rs. 5000. Find the percentage of male and female employees. Suppose n1 are the male and n2 are female employees. Hence we can say that there are 80 per cent male and 20 per cent female employees Median : Median Def: Median of a distribution is the value of the variable which divide it into two equal parts. Median is known as positional average. Case 1: For un-grouped data If the number of observations are odd then median is the middle value after values have been arranged in ascending or descending order of magnitude If the number of observations are even then median is obtained by taking the arithmetic mean of two middle values Example 1: Find the median of data set: { 9, 17, 14, 21, 27 } The data set can be arranged as { 9, 14, 17, 21, 27 } The middle value is 17. Example 2: Find the median of data set: { 14, 12, 10, 8, 16, 26 } The data set can be arranged as { 8, 10, 12, 14, 16, 26 } The middle values are 12 and 14 Slide 33: Case 2: For discrete frequency distribution In case of discrete frequency distribution, median is obtained by considering the cumulative frequencies as follows 3. Corresponding value of x is the median Example: Obtained the median for the following frequency distribution Here N = 120 18 29 8 45 65 90 105 114 120 The cumulative frequency just greater than 60 is 65. Hence 5 is the median of this distribution Case 3: Median for continuous frequency distribution : Case 3: Median for continuous frequency distribution is called median class and value of the median is obtained as where l = lower limit of the median class f = frequency of the median class h = class width c = cumulative frequency of the class preceding median class and Find the median wage of the following distribution Obtain median class : Since c.f. just greater than 21.5 is 28 hence 4000-5000 is the median class l = 4000, f = 20, c = 8, h = 1000 Merits and De-merits of median : Merits and De-merits of median Merits It is not at all affected by extreme values It can be calculated for distribution with open ended classes Demerits In case of even number of observations median cannot be determined exactly It is not based on all the observations Mathematical treatment is not possible As compared to mean it is affected much by fluctuation of sampling. Uses Median is only average to be used while dealing with qualitative data. Mode : Mode Mode is the value which occurs most frequently in a set of observation and around which other items of the set cluster closely. Example: Find the mode of the data set: { 23, 15, 19, 20, 16, 19 } Step 1: The data set can be ordered as { 15, 16, 19, 19, 20, 23 } Step 2: In the data set, the value 19 occur two times. Step 3: Hence, the mode is 19. Case-2: In case of discrete frequency distribution, mode is the value of x corresponding to maximum frequency. For example The value of x corresponding to maximum frequency i.e. 25 is 4. Hence 4 is the mode of this distribution. Case 3: In case of continuous frequency distribution the mode is give by : Case 3: In case of continuous frequency distribution the mode is give by l = lower limit of modal class h = width of class f1 = frequency of modal class f0 = frequency of class preceding modal class f2 = frequency of class succeeding modal class Example: Find the mode for following distribution Since class 40-50 have maximum frequency. Hence it is modal class, hence l = 40, h=10, f1=28, f0=12, f2=20 Merits of Mode Like median mode can be located in some cases merely by inspection It is not affected by extreme values It can be calculated even if frequency distribution has class interval of unequal magnitude provided modal class and classes preceding and succeeding are of same width. It can also be calculated for the distribution with open ended classes. Geometric Mean : Geometric Mean Geometric mean of a set of n observation is the nth root of their product and denoted by G. Let x1, x2, . . . xn be the n observations then taking logs on both sides In case of frequency distribution Uses of Geometric Mean To find the rate of population growth and rate of interest In the construction of Index numbers Slide 39: Harmonic Mean: Harmonic mean of a number of observations, none of which are zero, it is the reciprocal of the arithmetical mean of reciprocals of the given values and is denoted by H For a frequency distribution Partitioned values: There are the values which divide the series into a number of equal parts called partitioned values. The three points which divide the series into four equal parts are called Quartiles where (i = 1, 2,3) The nine points which divide the series into ten equal parts are called deciles and are denoted by D1, D2, . . . D9 where (i = 1,2, . . . 9) The ninety nine values which divide the series into 100 equal parts are called as Percentiles and are denoted by P1, P2, . . . P99. where (i = 1,2, . . . 99) Measure of Dispersion : Measure of Dispersion Mean, median and mode give us an idea of the concentration of observation about central part of a distribution. With averages alone we cannot draw use conclusions about a distribution. Series A 7, 8, 9,10, 11 Mean(A) = 9 Series B 3, 6, 9, 12, 15 Mean(B) = 9 Series C 1, 5, 9, 13, 17 Mean(C) = 9 which series is consistent ? Mean must be supported by some other measure in order to draw useful information from the data Dispersion : Dispersion Def : The degree to which numerical data tend to spread about an average value is called variation or dispersion. Measure of dispersion broadly classified into two categories Measure which express the spread of observation in term of distance between values of selected observations. Example Range, Inter-quartile deviation The measure which express the spread of observation in term of average deviation of observation from some central value. Example Standard deviation, mean deviation etc. Measure of dispersion Range Quartile deviation Semi-inter quartile deviation Mean Deviation Standard Deviation Quartile co-efficient of dispersion Coefficient of variation Coefficient of mean deviation Slide 42: Range : The range is the difference between two extreme observation of a distribution. If A and B are the greatest and smallest observation respectively in a distribution, then its range is given by Range = Xmax – Xmin = A – B Note : Range is simplest but crude measure of dispersion. As it is based on two extreme observations hence it is not reliable measure of dispersion. Quartile deviation : Quartile deviation or semi-interquartile range Q is given by Where Q1 and Q3 are the first and third quartiles of distribution, respectively. Note : Quartile deviation is a better measure than range as it make use of 50 % of the data. Since it ignore other 50% of the data so it cannot be regarded as reliable measure. Slide 43: Mean Deviation : If a variable X takes the values x1, x2, . . . xn with frequencies f1, f2, . . . fn, respectively, then mean deviation from the average A (usually mean, median or mode) is given by where |xi – A| represents modulus or the absolute value of the deviation of (xi –A), where negative sign are ignored. As Mean deviation is based on all the observation, it is a better measure of dispersion than range or quartile deviation. But the step of ignoring the signs of the deviations (xi – A) create artificiality. Remark : Mean deviation is least when taken from median. Slide 44: Example : Calculate (i) Quartile deviation (Q.D), and (ii) Mean Deviation (M.D.) from mean, for the following data: i) Here N = 50, The c.f. just greater than 12.75 is 19. Hence corresponding class 20-30 contains Q1 The c.f. just greater than 37.25 is 41. Hence corresponding class 40-50 contains Q3. ii) Mean Standard Deviation and Root Mean Square Deviation : Standard Deviation and Root Mean Square Deviation Standard deviation is the positive square root of the arithmetic mean of the square of the deviations for given values from their mean. It is denoted by σ. If a variable X takes the values x1, x2, . . . xn with frequencies f1, f2, . . . fn, respectively, then mean standard deviation is given by Suitable for further mathematical treatment and also best on all the observation. Standard deviation is regarded as best and most powerful measure of dispersion. Slide 46: The square of standard deviation is called the variance and is given by Root mean square deviation, denoted by ‘s’ is given by where A is any arbitrary number. s2 is called mean square deviation Relation between σ and s Hence mean square deviation and root mean square deviation is the least when the deviations are taken from mean Different formula for calculating variances : Different formula for calculating variances Case I: when mean comes out to be a whole number i.e. integer Case II : when mean is not a whole number but comes out to be in fractions then the calculation with above formula is very cumbersome and time consuming. Case III : If the values of x and f are large, the calculation of fx, fx2 is quite tedious. In this case we take the deviation from any arbitrary point A. multiplying by fi and summing over i from 1 to n and dividing by N we get (2) Subtracting (2) from (1) we get Slide 48: Hence we can say that standard deviation and variance is independent of change of origin. When class interval are given and is large then change the scale as Let Hence, variance is independent of change of origin but not of scale. Variance of Combined Series : Variance of Combined Series Slide 50: Example : Calculate the mean and standard deviation for the following table giving age distribution of 542 members. Slide 51: Example: The first of two samples has 100 items with mean 15 and standard deviation 3. If the whole group has 250 items with mean 15.6 and standard deviation . Find the standard deviation of the second group. We have given n1 = 100, n1 + n2 = 250, σ1 = 3, Now we want to find out σ2 Since n1 + n2 = 250 => 100+ n2= 250 Hence n2 = 250-100=150 We know that combined mean is given by The variance σ2 of the combined series is given by the formula Slide 52: Coefficient of Dispersion Whenever we want to compare the variability of the two series which differ widely in their averages or measured in different units, we do not calculate measures of dispersion but calculate the coefficients of dispersion which are pure numbers independent of the units of measurement. The coefficient of dispersion based on different measures of dispersion are 1. C.D. based on range = 2. Based on quartile deviations : C.D. = 3. Based on mean deviation: C.D. = 4. Based on standard deviation: C.D. Coefficient of Variation : 100 times the coefficient of dispersion based on standard deviation is called coefficient of variation (C.V.) i.e Slide 53: Example: The analysis of monthly wages paid to the workers of two firms A and B belonging to same industry gives the following results. Firm A Firm B Number of workers 500 600 Average daily wage 186.00 175.00 Variance of distribution of wages 81 100 Which firm, A or B has a larger wage bill In which firm, A or B, is there greater variability in individual wages Calculate a) the average daily wage, and (b) the variance of the distribution of wages of all the workers in the firms A and B taken together. Firm A: No. of wage earners (say) n1 = 500, Firm B: No. of wage erners (say) n2 = 600 Average daily wages, say Average daily wages, say =121.36 Moments : Moments The rth moment variable x about the given point x = A In particular Relation between moments about mean in terms of moments about any point Pearson’s β and γ Coefficients Karl Pearson defined the following four coefficient based upon the first four moments about mean Note: These coefficients are pure numbers independent of units of measurement Slide 55: Example : Calculate the first four moments of the following distribution about the mean and hence find β1 and β2 Here we take A = 4 Moments about point x = 4 Moments about mean are Skewness : Skewness Skewness means ‘lack of symmetry’. We study skewness to have an idea about the shape of the curve which we can draw with the help of the given data. In a symmetrical distribution, the mean, median and mode are equal to each other and they divide the distribution into two equal parts such that one part is mirror image of the other. Frequency Slide 57: If some observation of very high magnitude are added to such a distribution, its right (left) tail get elongated. These observations are also known as extreme observations. The presence of extreme observation on the right hand side of a distribution make it positively skewed and three averages viz mean, median and mode will no longer be equal. For a positively skewed distribution Mean > Median > Mode. The presence of extreme observations to left side of the distribution make it negatively skewed and the relationship between mean, median and mode is Mean < Median < Mode Slide 58: Symmetric and Skewed Distributions A symmetric or normal distribution has the following characteristics: The mean and median are equal. The mean and variance completely describe the distribution. 68.3% of observations lie between (mean ± 1 standard deviation) 95.5% of observations lie between (mean ± 2 standard deviations) 99.7% of observations lie between (mean ± 3 standard deviations) Slide 59: A distribution is said to be skewed if Mean, Median and Mode fall at different points Quartile are not equidistant from median and The curve drawn with the help of the given data is not symmetrical but stretched more to one side than to other. Measure of Skewness Sk = M – Md Sk = M – Mo Sk = (Q3 – Md) – (Md – Q1) These measures are absolute measure of skewness. As in dispersion, for comparing two series we do not calculate these absolute measure but we calculate the relative measure called coefficient of skewness which are pure numbers independent of units of measurement. Slide 60: The coefficients of skewness 1. Karl Pearson’s coefficients of Skewness where σ is the standard deviation of the distribution Limits for Karl Pearson’s coefficient of skewness is ±3.However this limit rarely attained. The sign of Sk gives the direction and its magnitude gives the extent of skewness i.e. if Sk > 0 then the distribution is positively skewed and if Sk < 0 then it is negatively skewed. Since Sk is dependent upon mode. If mode is not defined then we cannot find Sk. But empirical relation between mean, median and mode states that for a moderately symmetrical distribution we have Mean – Mode ≈ 3(Mean – Median) Slide 61: Example : Compute the Karl Pearson’s coefficients of skewness from the following data Here we take d = 61 Slide 62: To find mode, we note that height is a continuous variables. It is assumed that the height has been measured under the approximation that measurement on height i.e. greater than 58 but less than 58.5 is taken as 58 while a measurement greater than or equal to 58.5 but less than 59 is taken as 59. Thus given data can be written as By inspection, the model class is 60.5-61.5, thus we have l = 60.5, f1 = 42, f2 = 35, f0 = 30 and h = 1 Hence Karl Pearson’s coefficient of skewness distribution is positively skewed Slide 63: 2. Prof. Bowley’s coefficients of Skewness (Based on quartiles) Note : Bowley’s coefficient of skewness is also known as Quartile coefficient of skewness and especially useful in situation where quartiles and median are used. Limits of Bowley;s coefficients of skewness is ±1 Example: For the data given in above example compute the Bowley’s coefficient of skewness. Slide 64: Computation of Q1 : The c.f. just greater than 46.75 is 58. Hence corresponding class 59.5-60.5 contains Q2. Computation of Md (Q2) The c.f. just greater than 93.5 is 100. Hence corresponding class 62.5-63.5 contains Q3. Computation of Q3 Bowley’s coefficient Slide 65: Based upon moments Note : Sk = 0 if either β1=0 or β2= -3. But since cannot be negative Sk = 0 if and only if β1=0. Thus for symmetrical distribution β1=0 Hence β1 is taken as a measure of skewness Kurtosis : Kurtosis Kurtosis is a measure of shape of a distribution. It Measure the relative peakedness of frequency curve. Various frequency curves can be divided into three categories depending upon the shape of their peak Leptokurtic Mesokurtic and Platykurtic Slide 67: A measure of kurtosis is given by The value of β2 = 3 for mesokurtic curve When β2 > 3, the curve is more peaked than the mesokurtic curve and is termed as leptokurtic When β2 < 3, the curve is less peaked than the mesokurtic curve and is called as platykurtic curve. Example : The first four moments of a distribution are 0,2.5,0.7 and 18.75. Examine the skewness and kurtosis of the distribution To examine skewness, we compute β1 Since μ4 > 0 and β1 is small, the distribution is moderately positive skewed Hence the curve is mesokurtic You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
stat 401 notes opsheoran1968 Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINT lite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 325 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: January 31, 2011 This Presentation is Public Favorites: 0 Presentation Description descriptive Statistics Comments Posting comment... Premium member Presentation Transcript Statistical Methods for Research Workers : Stat 401 Theory Mon-Wed 10-11 A.M Practical Thursday 10-12 A.M. Mid term 30 Marks Final 45 Marks Practical 25 Marks Mid and Final Examination will be held on last day of the examination week announced by University Books recommended Theory and Problems of Statistics by M.R. Speigal Statistical techniques in Agricultural and Biological Research by D, Raghava Rao Statistical Methods by G.W. Snedecor and W.G. Cochran Statistical Methods for Agricultural Workers by V.G. Panse and P.V. Sukhatme Statistical Methods for Research Workers Course Outline : Course Outline Week 1 : Registration and introduction of the course Week2 : Frequency distribution, Graphical representation of data, Measures of central tendency, Partition values and Graphical method of their determination Week 3 : Measure of dispersion, Moments, Skewness and Kurtosis Week 4 : Definition of probability, Additive and multiplicative laws of probability and related problems, conditional problems Week 5: Random Variable, Probability distribution, Mathematical expectation and its properties Week 6: Binomial and Poisson distribution and their related problems Week 7:Normal distribution, Normal curve, Standard normal variate, related problem, Normal approximation of Binomial and Poisson distribution. Week 8 :Concept of sampling from a population, simple random sampling (with or without replacement), sampling distribution of mean, difference of means, proportion and difference of proportions Week 9: Student’s t-distribution, chi-square distribution and Snedecor’s F-distribution, confidence interval for means and difference of means Week 10-11: Mid term examination What is Statistics : What is Statistics “Statistics is a way to get information from data” Data Statistics Information Data: Facts, especially numerical facts, collected together for reference or information. Information: Knowledge communicated concerning some particular fact. Statistics may be defined as the science of collection, presentation, analysis and interpretation of numerical data.. Importance of Statistics : Importance of Statistics Statistics and Planning Statistical tools relating to production, consumption, price, investment, income, expenditure etc. and various advanced statistical techniques for handling and analysing such complex data are of great importance. Statistics and Mathematics Statistics is related to and dependent upon mathematics. The modern theory of statistics has its foundation on the theory of probability which is a particular branch of more advanced mathematical theory of measure and integration. Hence statistics is a branch of applied mathematics which specialises in data Statistics and Economics Various statistical analysis techniques are very useful in solving a variety of economic problem such as wages, price, consumption, production, distribution of income and wealth etc. Statistical tool like Index Number, Time Series analysis, Demand analysis and Forecasting techniques are extensively used for efficient planning and economic development of a country. Statistics and Business Statistics is an important tool of production control. In business, more and more statistical techniques for studying the needs and desires of consumers and for many other purposes The success of a businessman more or less depends upon the accuracy and precision of his statistical forecasting. Slide 5: Statistics and Biology, Astronomy and Medical Sciences The association between statistical methods and biological theories was first studied by Francis Galton in his work in Regression. According to Professor Karl Pearson, the whole theory of heredity rests on statistical basis. He says “The whole problem of evolution is the problem of Vital statistics. In Astronomy, the theory of Gaussian “Normal Law of Errors” for the study of movement of starts and planets is developed by using the principal of Least squares. In Medical Sciences also, the statistical tools for the collection, presentation and analysis of observed facts relating to causes and incidence of diseases and the results obtained from the use of various drugs and medicines are great importance. Moreover the efficiency of a manufactured drug or injection is tested by using the test of significance. Statistics and Psychology and Education In education and Psychology statistics has found wide applications e.g. determination of reliability and validity of a test, factor analysis etc Limitation of Statistics : Limitation of Statistics Statistics is not suited to the study of qualitative phenomenon. Statistics does not study individuals Statistical laws are not exact. Statistics is liable to be misused. Key Statistical Concepts… : Key Statistical Concepts… Population : The population is the set of all measurements of interest to the investigator. Population may be finite or infinite. Sample: A sample is a subset of measurements selected from the population of interest Variable: A variable is a characteristic that changes or varies over time or varies across different individual subjects Variables can be classified as categorical (qualitative) or quantitative (numerical). Categorical. Categorical variables take on values that are names or labels. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) Quantitative. Quantitative variables are numerical. They represent a measurable quantity. For example, when we speak of the population of a city, we are talking about the number of people in the city - a measurable attribute of the city. Therefore, population would be a quantitative variable. Slide 8: Quantitative variables can be classified as Discrete : Can assume only a countable number of values Example : Flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. It could not be any number between 0 and plus infinity. We could not, get 2.3 heads. Therefore, the number of heads must be a discrete variable. Continuous: Can assume all of the infinitely many values corresponding to a line interval. Example : The weight of students in a class is between 60 kg and 70 kg. The weight of a student would be an example of a continuous variable; since a student’s weight could take on any value between 60 and 70 kg. Classification of data : Classification of data Broadly the data have been classified into four classes Quantitative data: When the measurement of variable are quantitative then it is called as quantitative data e.g. weight of individuals, marks obtained by students etc Qualitative data : When we cannot make measurement but they can be differentiate such as Sex (Male or Female), Education (literate or Ill irate) Chronological data : When time series data are collected for a particular phenomenon it is called chronological data e.g. Wheat production in Haryana in various years Year Production 66-67 14.36 67-68 15.68 68-69 22.36 2009-10 256.38 Geographical Data : When data are presented geographically e.g. Foodgrains production in various states of India Presentation of data : Presentation of data When observations are available on a single characteristic of a large number of individuals, then it becomes necessary to condense the data as for as possible without losing any information of interest. The raw data collected as such do not give any useful information. Hence data are arranged either in array or in a frequency distribution so that some meaningful information can be extracted. Let us consider the marks in statistics obtained by 70 students 32 47 41 51 41 30 39 18 48 53 54 32 31 46 15 37 32 56 42 48 38 26 50 40 38 42 35 22 62 51 44 21 45 31 37 41 44 18 37 47 68 41 30 52 52 60 42 38 38 34 41 53 48 21 28 49 42 36 41 29 30 33 37 35 29 37 38 40 32 49 Slide 11: A better way may be express the figures in an ascending or descending order of magnitude commonly termed as Array. 15 18 18 21 21 22 26 28 29 29 30 30 30 31 31 32 32 32 32 33 34 35 35 36 37 37 37 37 37 38 38 38 38 38 39 40 40 41 41 41 41 41 41 42 42 42 42 44 44 45 46 47 47 48 48 48 49 49 50 51 51 52 52 53 53 54 56 60 62 68 This arrangement does not reduce the bulk of data. A much better representation is the frequency distribution Frequency Distribution : Frequency Distribution Def: The method or way in which the observations are classified and distributed in the proper class intervals is called frequency distribution. Frequency (Continuous data): The number of observations lying in any class interval is known as frequency of that class. Frequency (Discrete data): How many times a no. has been repeated in a series is known as the frequency of that number. How to construct a Frequency distribution : How to construct a Frequency distribution Put a bar ( | ) called tally mark against the number when it occurs. Having occurred four times, the fifth occurrence is represented by putting a cross tally mark on the first four tallies. 15 18 21 33 32 31 30 29 28 26 22 34 38 35 37 36 39 2 2 1 4 2 3 2 1 1 1 1 5 2 5 1 1 1 Marks No. of Stud Total frequency 40 41 42 51 50 49 48 47 46 45 44 52 60 53 56 54 62 Marks No. of Stud Total frequency 68 2 6 4 2 1 2 3 2 1 1 2 2 1 2 1 1 1 1 Slide 14: When the identity of the individuals is not relevant and the order in which observation arise is not important then the data can be condensed by dividing the observed range of variables into suitable number of class intervals and record the number of observations in each class as Such a table showing the distribution of the frequencies in the different classes is called frequency table and the manner in which class frequencies are distributed over the class intervals is called grouped frequency distribution of the variable Rule for construction of a frequency distribution : Rule for construction of a frequency distribution The class should be clearly defined and non-overlapping The classes should be exhaustive i.e. each of the given values should be included in one of the classes. As far as possible class interval should be of equal widths Open ended classes should be avoided The number of classes should not neither be too large nor too small. It should preferably lies between 5 and 15. We can determine the approximate number of classes by k = 1 +3.322 log10N, where N is the total frequency Number of class intervals k = 1+3.322 log10(70) = 1 + 3.322 x 1.84 = 7.13 = 7 Having fixed number of classes, divide the range (The difference between greatest and smallest observation) by it and nearest integer to this value gives the magnitude of the class interval. Broad class interval (i.e. less number of classes) will yield rough estimates while for high degree of accuracy small class intervals are desirable. class width = (Maximum observation – Minimum observation) / k Class width (h) = (68-15)/7 = 7.57 = 8 Class limits should be chosen in such a way that the mid value of the class interval and actual average of the observations in that class interval are as near as possible Graphical Representation of data : Graphical Representation of data The graphical representation of data makes the reading more interesting, less time-consuming and easily understandable. The disadvantage of graphical presentation is that it lacks details and is less accurate. It is often useful to represent a frequency distribution by means of a diagram which makes the unwieldy data understandable and convey to the eye the general run of the observations. Following methods are used for graphical representation of the data Histogram Frequency Polygon Frequency Curve Pie-chart Histogram : Histogram A histogram is a diagram which represents the class interval and frequency in the form of a rectangle. There will be as many adjoining rectangles as there are class intervals. How to draw a Histogram Mark class intervals on X-axis and frequencies on Y-axis. The scales for both the axes need not be the same. Class intervals must be exclusive. If the intervals are in inclusive form, convert them to the exclusive form. Draw rectangles with class intervals as bases and the corresponding frequencies as heights. The class limits are marked on the horizontal axis and the frequency is marked on the vertical axis. Thus a rectangle is constructed on each class interval. If the intervals are equal, then the height of each rectangle is proportional to the corresponding class frequency. If the intervals are unequal, then the area of each rectangle is proportional to the corresponding class frequency. Example : Example 5 10 15 20 25 2 4 6 8 10 12 14 16 18 X-axis Y-axis Class-interval Frequency Slide 19: Example The daily wages of 50 workers Table (a) Table (b) Slide 20: Frequency Polygon A frequency polygon is nothing but the join of the mid-points of the tops of the adjacent rectangles. Joining the mid points of all the classes (rectangle) makes the frequency polygon. In a frequency distribution, the mid-value of each class is obtained. Then on the graph paper, the frequency is plotted against the corresponding mid-value. These points are joined by straight lines. These straight lines may be extended in both directions to meet the X - axis to form a polygon. Slide 21: Frequency Polygon Pie-Chart and its construction : Pie-Chart and its construction Sometimes a circle is used to represent a given data. The various parts of it are proportionally represented by sectors of the circle. Then the graph is called a Pie Graph or Pie Chart. 1. Find the angle of each sector Total of data corresponds to 360o Let xo = the angle at the centre for item A, then Measures of Central tendency : Measures of Central tendency Def: The tendency of the observations to gather in the innermost part of the data set is referred as central tendency. The basic measures of central tendency are Mean Median Mode Geometric Mean Harmonic Mean Characteristics of an ideal measure of central tendency It should be strictly defined It should be easy to calculate It should be based on all the observations It should be suitable for further mathematical treatment It should be less affected by fluctuation of sampling. It should not be affected by extreme values Mean or Arithmetic Mean : Mean or Arithmetic Mean The mean or average of the data set is nothing but the sum of all the values given in the data set divided by total number of values present in the data set Case I : For un-grouped data Let x1, x2, . . . xn be n observations then mean is given by Example: Find the Mean of the data set: { 20, 25, 34, 50 } Case II: For a Frequency distribution : Case II: For a Frequency distribution if x1, x2, . . . xn has frequencies f1, f2, . . . fn, respectively then is given by where Note: In case of grouped or continuous frequency distribution x is taken as the mid value of the corresponding class Slide 26: Example: a) Find the arithmetic mean of the following frequency distribution b) Calculate the arithmetic mean of marks from following table Slide 27: Note : when x of f are large the calculation of mean is time consuming and error prone. In such cases we take the deviation from an arbitrary point ‘A’ and mean can be calculated as Now since di = xi – A fidi = fi(xi – A) = fixi - fiA Taking summation on both sides over i from 1 to n, we get Dividing both sides by N Note : We can take any value of A but usually, the value of x corresponding to the middle part of the distribution will be much convenient. Let the variate values x1, x2, . . . xn have frequencies f1, f2, .. . fn respectively. Let ‘A’ be any point (any value of xi). Compute the deviation di = xi – A as given below Slide 28: In case of grouped or continuous distribution we take the deviation as where h is the width of class interval and mean formula Example: Calculate the mean for following frequency distribution Properties of Arithmetic Mean : Properties of Arithmetic Mean Algebric sum of deviation for set of values from arithmetic mean is zero Proof: Let the variate values x1, x2, . . . xn be the values of variate X with frequencies f1, f2, .. . fn respectively. Then we want to prove We know that putting in (1) we get Mean of composite series : Mean of composite series Slide 31: Example : The average salary of male employees in a firm was Rs. 5200 and that of female was Rs. 4200. The mean salary of all the employees was Rs. 5000. Find the percentage of male and female employees. Suppose n1 are the male and n2 are female employees. Hence we can say that there are 80 per cent male and 20 per cent female employees Median : Median Def: Median of a distribution is the value of the variable which divide it into two equal parts. Median is known as positional average. Case 1: For un-grouped data If the number of observations are odd then median is the middle value after values have been arranged in ascending or descending order of magnitude If the number of observations are even then median is obtained by taking the arithmetic mean of two middle values Example 1: Find the median of data set: { 9, 17, 14, 21, 27 } The data set can be arranged as { 9, 14, 17, 21, 27 } The middle value is 17. Example 2: Find the median of data set: { 14, 12, 10, 8, 16, 26 } The data set can be arranged as { 8, 10, 12, 14, 16, 26 } The middle values are 12 and 14 Slide 33: Case 2: For discrete frequency distribution In case of discrete frequency distribution, median is obtained by considering the cumulative frequencies as follows 3. Corresponding value of x is the median Example: Obtained the median for the following frequency distribution Here N = 120 18 29 8 45 65 90 105 114 120 The cumulative frequency just greater than 60 is 65. Hence 5 is the median of this distribution Case 3: Median for continuous frequency distribution : Case 3: Median for continuous frequency distribution is called median class and value of the median is obtained as where l = lower limit of the median class f = frequency of the median class h = class width c = cumulative frequency of the class preceding median class and Find the median wage of the following distribution Obtain median class : Since c.f. just greater than 21.5 is 28 hence 4000-5000 is the median class l = 4000, f = 20, c = 8, h = 1000 Merits and De-merits of median : Merits and De-merits of median Merits It is not at all affected by extreme values It can be calculated for distribution with open ended classes Demerits In case of even number of observations median cannot be determined exactly It is not based on all the observations Mathematical treatment is not possible As compared to mean it is affected much by fluctuation of sampling. Uses Median is only average to be used while dealing with qualitative data. Mode : Mode Mode is the value which occurs most frequently in a set of observation and around which other items of the set cluster closely. Example: Find the mode of the data set: { 23, 15, 19, 20, 16, 19 } Step 1: The data set can be ordered as { 15, 16, 19, 19, 20, 23 } Step 2: In the data set, the value 19 occur two times. Step 3: Hence, the mode is 19. Case-2: In case of discrete frequency distribution, mode is the value of x corresponding to maximum frequency. For example The value of x corresponding to maximum frequency i.e. 25 is 4. Hence 4 is the mode of this distribution. Case 3: In case of continuous frequency distribution the mode is give by : Case 3: In case of continuous frequency distribution the mode is give by l = lower limit of modal class h = width of class f1 = frequency of modal class f0 = frequency of class preceding modal class f2 = frequency of class succeeding modal class Example: Find the mode for following distribution Since class 40-50 have maximum frequency. Hence it is modal class, hence l = 40, h=10, f1=28, f0=12, f2=20 Merits of Mode Like median mode can be located in some cases merely by inspection It is not affected by extreme values It can be calculated even if frequency distribution has class interval of unequal magnitude provided modal class and classes preceding and succeeding are of same width. It can also be calculated for the distribution with open ended classes. Geometric Mean : Geometric Mean Geometric mean of a set of n observation is the nth root of their product and denoted by G. Let x1, x2, . . . xn be the n observations then taking logs on both sides In case of frequency distribution Uses of Geometric Mean To find the rate of population growth and rate of interest In the construction of Index numbers Slide 39: Harmonic Mean: Harmonic mean of a number of observations, none of which are zero, it is the reciprocal of the arithmetical mean of reciprocals of the given values and is denoted by H For a frequency distribution Partitioned values: There are the values which divide the series into a number of equal parts called partitioned values. The three points which divide the series into four equal parts are called Quartiles where (i = 1, 2,3) The nine points which divide the series into ten equal parts are called deciles and are denoted by D1, D2, . . . D9 where (i = 1,2, . . . 9) The ninety nine values which divide the series into 100 equal parts are called as Percentiles and are denoted by P1, P2, . . . P99. where (i = 1,2, . . . 99) Measure of Dispersion : Measure of Dispersion Mean, median and mode give us an idea of the concentration of observation about central part of a distribution. With averages alone we cannot draw use conclusions about a distribution. Series A 7, 8, 9,10, 11 Mean(A) = 9 Series B 3, 6, 9, 12, 15 Mean(B) = 9 Series C 1, 5, 9, 13, 17 Mean(C) = 9 which series is consistent ? Mean must be supported by some other measure in order to draw useful information from the data Dispersion : Dispersion Def : The degree to which numerical data tend to spread about an average value is called variation or dispersion. Measure of dispersion broadly classified into two categories Measure which express the spread of observation in term of distance between values of selected observations. Example Range, Inter-quartile deviation The measure which express the spread of observation in term of average deviation of observation from some central value. Example Standard deviation, mean deviation etc. Measure of dispersion Range Quartile deviation Semi-inter quartile deviation Mean Deviation Standard Deviation Quartile co-efficient of dispersion Coefficient of variation Coefficient of mean deviation Slide 42: Range : The range is the difference between two extreme observation of a distribution. If A and B are the greatest and smallest observation respectively in a distribution, then its range is given by Range = Xmax – Xmin = A – B Note : Range is simplest but crude measure of dispersion. As it is based on two extreme observations hence it is not reliable measure of dispersion. Quartile deviation : Quartile deviation or semi-interquartile range Q is given by Where Q1 and Q3 are the first and third quartiles of distribution, respectively. Note : Quartile deviation is a better measure than range as it make use of 50 % of the data. Since it ignore other 50% of the data so it cannot be regarded as reliable measure. Slide 43: Mean Deviation : If a variable X takes the values x1, x2, . . . xn with frequencies f1, f2, . . . fn, respectively, then mean deviation from the average A (usually mean, median or mode) is given by where |xi – A| represents modulus or the absolute value of the deviation of (xi –A), where negative sign are ignored. As Mean deviation is based on all the observation, it is a better measure of dispersion than range or quartile deviation. But the step of ignoring the signs of the deviations (xi – A) create artificiality. Remark : Mean deviation is least when taken from median. Slide 44: Example : Calculate (i) Quartile deviation (Q.D), and (ii) Mean Deviation (M.D.) from mean, for the following data: i) Here N = 50, The c.f. just greater than 12.75 is 19. Hence corresponding class 20-30 contains Q1 The c.f. just greater than 37.25 is 41. Hence corresponding class 40-50 contains Q3. ii) Mean Standard Deviation and Root Mean Square Deviation : Standard Deviation and Root Mean Square Deviation Standard deviation is the positive square root of the arithmetic mean of the square of the deviations for given values from their mean. It is denoted by σ. If a variable X takes the values x1, x2, . . . xn with frequencies f1, f2, . . . fn, respectively, then mean standard deviation is given by Suitable for further mathematical treatment and also best on all the observation. Standard deviation is regarded as best and most powerful measure of dispersion. Slide 46: The square of standard deviation is called the variance and is given by Root mean square deviation, denoted by ‘s’ is given by where A is any arbitrary number. s2 is called mean square deviation Relation between σ and s Hence mean square deviation and root mean square deviation is the least when the deviations are taken from mean Different formula for calculating variances : Different formula for calculating variances Case I: when mean comes out to be a whole number i.e. integer Case II : when mean is not a whole number but comes out to be in fractions then the calculation with above formula is very cumbersome and time consuming. Case III : If the values of x and f are large, the calculation of fx, fx2 is quite tedious. In this case we take the deviation from any arbitrary point A. multiplying by fi and summing over i from 1 to n and dividing by N we get (2) Subtracting (2) from (1) we get Slide 48: Hence we can say that standard deviation and variance is independent of change of origin. When class interval are given and is large then change the scale as Let Hence, variance is independent of change of origin but not of scale. Variance of Combined Series : Variance of Combined Series Slide 50: Example : Calculate the mean and standard deviation for the following table giving age distribution of 542 members. Slide 51: Example: The first of two samples has 100 items with mean 15 and standard deviation 3. If the whole group has 250 items with mean 15.6 and standard deviation . Find the standard deviation of the second group. We have given n1 = 100, n1 + n2 = 250, σ1 = 3, Now we want to find out σ2 Since n1 + n2 = 250 => 100+ n2= 250 Hence n2 = 250-100=150 We know that combined mean is given by The variance σ2 of the combined series is given by the formula Slide 52: Coefficient of Dispersion Whenever we want to compare the variability of the two series which differ widely in their averages or measured in different units, we do not calculate measures of dispersion but calculate the coefficients of dispersion which are pure numbers independent of the units of measurement. The coefficient of dispersion based on different measures of dispersion are 1. C.D. based on range = 2. Based on quartile deviations : C.D. = 3. Based on mean deviation: C.D. = 4. Based on standard deviation: C.D. Coefficient of Variation : 100 times the coefficient of dispersion based on standard deviation is called coefficient of variation (C.V.) i.e Slide 53: Example: The analysis of monthly wages paid to the workers of two firms A and B belonging to same industry gives the following results. Firm A Firm B Number of workers 500 600 Average daily wage 186.00 175.00 Variance of distribution of wages 81 100 Which firm, A or B has a larger wage bill In which firm, A or B, is there greater variability in individual wages Calculate a) the average daily wage, and (b) the variance of the distribution of wages of all the workers in the firms A and B taken together. Firm A: No. of wage earners (say) n1 = 500, Firm B: No. of wage erners (say) n2 = 600 Average daily wages, say Average daily wages, say =121.36 Moments : Moments The rth moment variable x about the given point x = A In particular Relation between moments about mean in terms of moments about any point Pearson’s β and γ Coefficients Karl Pearson defined the following four coefficient based upon the first four moments about mean Note: These coefficients are pure numbers independent of units of measurement Slide 55: Example : Calculate the first four moments of the following distribution about the mean and hence find β1 and β2 Here we take A = 4 Moments about point x = 4 Moments about mean are Skewness : Skewness Skewness means ‘lack of symmetry’. We study skewness to have an idea about the shape of the curve which we can draw with the help of the given data. In a symmetrical distribution, the mean, median and mode are equal to each other and they divide the distribution into two equal parts such that one part is mirror image of the other. Frequency Slide 57: If some observation of very high magnitude are added to such a distribution, its right (left) tail get elongated. These observations are also known as extreme observations. The presence of extreme observation on the right hand side of a distribution make it positively skewed and three averages viz mean, median and mode will no longer be equal. For a positively skewed distribution Mean > Median > Mode. The presence of extreme observations to left side of the distribution make it negatively skewed and the relationship between mean, median and mode is Mean < Median < Mode Slide 58: Symmetric and Skewed Distributions A symmetric or normal distribution has the following characteristics: The mean and median are equal. The mean and variance completely describe the distribution. 68.3% of observations lie between (mean ± 1 standard deviation) 95.5% of observations lie between (mean ± 2 standard deviations) 99.7% of observations lie between (mean ± 3 standard deviations) Slide 59: A distribution is said to be skewed if Mean, Median and Mode fall at different points Quartile are not equidistant from median and The curve drawn with the help of the given data is not symmetrical but stretched more to one side than to other. Measure of Skewness Sk = M – Md Sk = M – Mo Sk = (Q3 – Md) – (Md – Q1) These measures are absolute measure of skewness. As in dispersion, for comparing two series we do not calculate these absolute measure but we calculate the relative measure called coefficient of skewness which are pure numbers independent of units of measurement. Slide 60: The coefficients of skewness 1. Karl Pearson’s coefficients of Skewness where σ is the standard deviation of the distribution Limits for Karl Pearson’s coefficient of skewness is ±3.However this limit rarely attained. The sign of Sk gives the direction and its magnitude gives the extent of skewness i.e. if Sk > 0 then the distribution is positively skewed and if Sk < 0 then it is negatively skewed. Since Sk is dependent upon mode. If mode is not defined then we cannot find Sk. But empirical relation between mean, median and mode states that for a moderately symmetrical distribution we have Mean – Mode ≈ 3(Mean – Median) Slide 61: Example : Compute the Karl Pearson’s coefficients of skewness from the following data Here we take d = 61 Slide 62: To find mode, we note that height is a continuous variables. It is assumed that the height has been measured under the approximation that measurement on height i.e. greater than 58 but less than 58.5 is taken as 58 while a measurement greater than or equal to 58.5 but less than 59 is taken as 59. Thus given data can be written as By inspection, the model class is 60.5-61.5, thus we have l = 60.5, f1 = 42, f2 = 35, f0 = 30 and h = 1 Hence Karl Pearson’s coefficient of skewness distribution is positively skewed Slide 63: 2. Prof. Bowley’s coefficients of Skewness (Based on quartiles) Note : Bowley’s coefficient of skewness is also known as Quartile coefficient of skewness and especially useful in situation where quartiles and median are used. Limits of Bowley;s coefficients of skewness is ±1 Example: For the data given in above example compute the Bowley’s coefficient of skewness. Slide 64: Computation of Q1 : The c.f. just greater than 46.75 is 58. Hence corresponding class 59.5-60.5 contains Q2. Computation of Md (Q2) The c.f. just greater than 93.5 is 100. Hence corresponding class 62.5-63.5 contains Q3. Computation of Q3 Bowley’s coefficient Slide 65: Based upon moments Note : Sk = 0 if either β1=0 or β2= -3. But since cannot be negative Sk = 0 if and only if β1=0. Thus for symmetrical distribution β1=0 Hence β1 is taken as a measure of skewness Kurtosis : Kurtosis Kurtosis is a measure of shape of a distribution. It Measure the relative peakedness of frequency curve. Various frequency curves can be divided into three categories depending upon the shape of their peak Leptokurtic Mesokurtic and Platykurtic Slide 67: A measure of kurtosis is given by The value of β2 = 3 for mesokurtic curve When β2 > 3, the curve is more peaked than the mesokurtic curve and is termed as leptokurtic When β2 < 3, the curve is less peaked than the mesokurtic curve and is called as platykurtic curve. Example : The first four moments of a distribution are 0,2.5,0.7 and 18.75. Examine the skewness and kurtosis of the distribution To examine skewness, we compute β1 Since μ4 > 0 and β1 is small, the distribution is moderately positive skewed Hence the curve is mesokurtic