Lies Damn Lies and Anti Statistics

Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Lies, Damn Lies and Anti-Statistics:

Lies, Damn Lies and Anti-Statistics Alan McSweeney

Objective:

January 11, 2011 2 Objective Introduce the concept of distorting “anti-statistics”, illustrate how “anti-statistics” can be identified and define how statistics should be constructed to yield insight and meaning

Statistics:

January 11, 2011 3 Statistics A statistic has two roles - primary and secondary Primary - to summarise and describe the data while preserving information and reducing the volume of raw data Secondary - to provide and enable insight Where an alleged statistic does not perform these functions it is an “anti-statistic” Distorting the underlying information (raw data), either deliberately or accidentally Not providing insight or providing an inaccurate view of the underlying information Most people are scared of large sets of numbers The use of anti-statistics uses this fear

Statistics and Anti-Statistics:

January 11, 2011 4 Statistics and Anti-Statistics Statistics Descriptive Insightful Informative Enlightening Anti-Statistics Distorting Promoting Misinterpretation Misinformative Concealing

Statistics - Primary Function:

January 11, 2011 5 Statistics - Primary Function To describe the data while preserving information and reducing the volume of raw data This means taking a large amount of raw data, producing descriptive summaries while not losing or distorting the underlying raw data More important function of a statistic

Statistics - Secondary Function:

January 11, 2011 6 Statistics - Secondary Function To provide and enable insight By reducing the volume of raw data, you can gain insight into what the data means Enabling you to see the wood from the trees, know the amount and type of wood and make decisions about the use of the wood Secondary function if primary function satisfied

Data, Information, Knowledge and Action Cycle:

January 11, 2011 7 Data, Information, Knowledge and Action Cycle Good statistics provide information that creates knowledge and enables correct actions Data Action Knowledge Information

Information – Lots of It:

January 11, 2011 8 Information – Lots of It

Sample Information:

January 11, 2011 9 Sample Information 4,000 numbers representing the annual salaries of individuals Sample data only 100% of the information is available here Very hard to see patterns, understand the situation, gain insight and make effective decisions and understand their consequences The numbers do not lie but they are innocent creatures and can be made to lie Need techniques that extract meaning and provide insight without losing the information the data represents

Statistics:

January 11, 2011 10 Statistics I can take all this … … And give you one derived number (average) 107941.931

Statistic:

January 11, 2011 11 Statistic 4,000 numbers reduced to 1 Reduced the amount of data by 99.975% (another “statistic”) But I have lost information Average value of 107941.931 is at best a simplistic view of the data and at worst a distortion that misrepresents the source data If I use the average without looking to understand the raw data in more detail I am potentially creating a distortion

More Statistics:

January 11, 2011 12 More Statistics Be careful what statistics are used Do not generate statistics just because you can The use of statistics can give a false impression of certainty or meaning where there is none Average Sum of all the values divided by the number of values 107941.93 Standard Deviation A measure of how widely values are dispersed from the average value 59904.19 Kurtosis Value that describes the relative peakedness or flatness of a distribution where a positive value indicates a relatively peaked distribution and a negative value indicates a relatively flat distribution 0.112 Skewness A measure of the asymmetry of a distribution around the average where a positive value indicates a distribution with an asymmetric tail extending toward more positive values and a negative value indicates a distribution with an asymmetric tail extending toward more negative values 0.731 Mode The most frequently occurring value 23958 Median This the number in the middle where, half the numbers have values that are greater than the median and half have values that are less – also called the 50 th percentile 97909.5

Interpreting the Statistics:

January 11, 2011 13 Interpreting the Statistics I now know that the data is skewed towards lower values and has a heavy tail indicating a small number of people earning large salaries Statistic Value Interpretation Average 107941.93 The average is higher than the median indicating that the data is dispersed unequally towards higher values Standard Deviation 59904.19 The high standard deviation indicates the underlying data is spread across a wide range of values Kurtosis 0.112 The positive value indicates that there is a peak in the data Skewness 0.731 The positive values indicates a distribution with an unequal and heavy tail extending toward more higher values Mode 23958 In a large set of data where only a small number of data values are the same, this is meaningless Median 97909.5 When the median is less than the average, it means the data is unequally distributed with a heavy tail extending toward more higher values

Let’s Take a Look at the Data:

January 11, 2011 14 Let’s Take a Look at the Data

Let’s Take a Look at the Data:

January 11, 2011 15 Let’s Take a Look at the Data Characteristics Increases quickly from zero Distribution skewed to the left Clustered around lower values Gradual drop from peak Heavy tail This type of data distribution is very common Increases quickly from zero Clustered around lower values Gradual drop from peak Heavy tail Distribution skewed to the left

Statistics:

January 11, 2011 16 Statistics The usefulness of a statistic depends on the underlying data Average really only makes sense when the data is symmetrically/equally distributed Otherwise, the average is distorted because of unequal distribution of data Deviation also really only makes sense when the data is symmetrically distributed

Statistics:

January 11, 2011 17 Statistics Be careful of obscure statistics such as Kurtosis and Skewness They have a use but the meaning is quite specific and may not be appropriate

Descriptive Statistics:

January 11, 2011 18 Descriptive Statistics Look for statistics that contain Measures of data location and clustering Measures of dispersion and variability Measures of association Look at the underlying data, how it was collected, what it measures If the data is of poor quality or measures the wrong values, any derived information will have very limited worth There are lots of statistics that can be produced from the raw data Produce only meaningful statistics Do not throw statistics at the data

Some Common Descriptive and Summarising Statistics:

January 11, 2011 19 Some Common Descriptive and Summarising Statistics Statistic Type Statistic Description Data location and Clustering Average Simple average Weighted Average Average of values weighted according to a value such as their importance Truncated/Interpercentile Average Average of centralised subset of data Median The 50 th percentile Mode The most commonly occurring value Dispersion, Variability and Shape Variance Measure of the amount of variation within the data Standard Deviation Square root of the Variance Range The spread of the data values Skewness Measure of the asymmetry of the distribution of the data Kurtosis Measure of the "peakedness” and the length of the tail of the distribution of the data Percentiles Value below which a certain percent of the data fall Association Correlation Correlation has a specific meaning that may not be relevant to the data

Another Look at the Sample Data:

January 11, 2011 20 Another Look at the Sample Data This shows the salaries of cumulative percentages of the people surveyed

Another Look at the Sample Data:

January 11, 2011 21 Another Look at the Sample Data

Percentiles:

January 11, 2011 22 Percentiles Percentile of a set of data is the number or value below which that percent of data lies Median = 50 th percentile Value below which 50% of data lies Quartiles are percentiles for 25%, 50% and 75% Percentiles are useful in summarising data

Percentiles for Sample Data:

January 11, 2011 23 Percentiles for Sample Data This … … becomes this … 4,000 numbers reduced to 10 numbers 10% of people earn 38,332 or less 20% of people earn 54,834 or less 10% of people earn between 192,871 and 299,433 Successfully reduced the volume of data while preserving more information

Anti-Statistics:

January 11, 2011 24 Anti-Statistics Unfortunately everywhere Take a number of general forms or types such as Statement based on measurement of incorrect value Statement without scale or reference Statement based on grouping of categories (with possible distortion of categories) Statements based on inaccurate on unspecified association or correlation

Sample Type 1 Anti-Statistic:

January 11, 2011 25 Sample Type 1 Anti-Statistic Chimpanzee DNA is 99.7% the same as Human DNA What does this statement mean? Do chimpanzees make cars/houses/PCs/etc. that are 99.7% as good as those made by humans? If the statement is true then what is being measured may be invalid, such as 000000000000000000000000 and 000000000000000000000001 These numbers are 99% the same based on the length of the lines in their characters Or A lot of DNA is not involved in the development process and this is being included in measurements Or A small change in DNA has a substantial impact on what is produced

Sample Type 2 Anti-Statistic:

January 11, 2011 26 Sample Type 2 Anti-Statistic Statements of the form X is the greatest cause of Y, such as Car crashes are the greatest cause of deaths among males in their 20s and 30s Meaningless because there is no scale or reference point Statement creates an impression of scale and severity that is at best not justified or at worst incorrect Take a look at the underlying life expectancy data

Type 2 Anti-Statistic:

January 11, 2011 27 Type 2 Anti-Statistic Probability of a person dying within a year at each year of life Probability of a person dying within a year for first 35 years

Type 2 Anti-Statistic:

January 11, 2011 28 Type 2 Anti-Statistic The underlying life expectancy data shows that young people have very little chance of dying Death rates are uniformly very low after the first year of life until about age 50 So a statement such as Car crashes are the greatest cause of deaths among males in their 20s and 30s Will inevitably be true because nothing else really kills young males Death due to illness is uncommon among this group so any other cause will dominate

Sample Type 3 Anti-Statistic:

January 11, 2011 29 Sample Type 3 Anti-Statistic Statements of the form N% of people do/have done X at least N times/with defined frequency Typically arise as the results of tendentious surveys designed to create a false impression of severity Such as 75% of people admit to X up to N times a year No indication of how the 75% is spread across the range of 1 to N times 65% of people admit to having a negative experience up to N times due to X No indication of the spread of negative experiences across the range of 1 to N Generally a result of combining the responses to two or more questions or categories Have often have you done/experienced X? Once Twice Three times …

Type 3 Anti-Statistic:

January 11, 2011 30 Type 3 Anti-Statistic Have often have you done/experienced X? Once Twice Three times 4-8 times 8-12 times Have often have you done/experienced X? 45% 10% 8% 5% 2% Total of these is 75% Statement that 75% of people have done/experienced X up to 12 times a year distorts the distribution of the underlying data that is skewed towards lower rates of occurrence

Sample Type 4 Anti-Statistic:

January 11, 2011 31 Sample Type 4 Anti-Statistic Statements of the form Taking /doing A makes you N% more likely to be/experience B Two issues Causation – is there a real causal relationship Degree of causation – how strong is the causal relationship An association does not imply a causation A might cause B B might cause A A might cause B and B might cause A A might cause C that might cause B A might cause C that might cause D … that might cause B A might cause C that might cause B and A might cause D that might not cause B but A-C-D causation is greater than A-D-B negative causation Measuring error Random data that was skewed Deliberate or malicious misrepresentation Cause might be partial or contributory Be careful of any statement of a relationship that does not demonstrate how causation happens

Association and Causation Scenarios:

January 11, 2011 32 Association and Causation Scenarios A B A B Causes or Influences Causes or Influences A B Causes or Influences A B Causes or Influences C A B Causes or Influences C D A B C D Negatively Causes or Influences Causes or Influences

Association and Causation:

January 11, 2011 33 Association and Causation Takes or Does A B D Taking or Doing D Affects or Causes B Very common scenario where an association or causation is asserted

Association and Causation:

January 11, 2011 34 Association and Causation Takes or Does A B C D Taking or Doing D Has Little or No Effect or Influence on B or Even Negatively Impacts B Is a Member of a Group E Members of Group C Have a Greater Tendency to Take or do D Members of Group C Also Take or Do E Taking or Doing E Affects or Causes B The real association or causation is actually along the lines of:

Type 4 Anti-Statistic:

January 11, 2011 35 Type 4 Anti-Statistic Occurs very frequently A percentage association can give a false sense of certainty Just measures the looseness of association Often misrepresents the degree of causation Unless the precise nature of the causative relationship can be defined, take with a large dose of salt

Summary:

January 11, 2011 36 Summary Statistics are designed to provide insight without distorting the meaning of the underlying data or losing information Anti-statistics are used to distort the underlying data to create false impressions So there are Lies, Damn Lies and Anti-Statistics

More Information:

January 11, 2011 37 More Information Alan McSweeney alan@alanmcsweeney.com

authorStream Live Help