Investigative Data Mining in Fraud Detection: Investigative Data Mining in Fraud Detection
Overview (1): Overview (1) Investigative Data Mining and Problems in Fraud Detection
Definitions
Technical and Practical Problems
Existing Fraud Detection Methods
Widely used methods
The Crime Detection Method
Comparisons with Minority Report
Classifiers as Precogs
Combining Output as Integration Mechanisms
Cluster Detection as Analytical Machinery
Visualisation Techniques as Visual Symbols
Overview (2): Overview (2) Implementing the Crime Detection System: Preparation Component
Investigation objectives
Collected data
Preparation of collected data to achieve objectives
Implementing the Crime Detection System: Action Component
Which experiments generate best predictions?
Which is the best insight?
How can the new models and insights be deployed within an organisation?
Contributions and Recommendations
Significant research contributions
Proposed solutions
Literature and Acknowledgements: Dick P K (1956) Minority Report, Orion Publishing Group, London, Great Britain.
Abagnale F (2001) The Art of the Steal: How to Protect Yourself and Your Business from Fraud, Transworld Publishers, NSW, Australia.
Mena J (2003) Investigative Data Mining for Security and Criminal Detection, Butterworth Heinemann, MA, USA.
Elkan C (2001) Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000, Department of Computer Science and Engineering, University of California, San Diego, USA.
Prodromidis A (1999) Management of Intelligent Learning Agents in Distributed Data Mining Systems, Unpublished PhD thesis, Columbia University, USA.
Berry M and Linoff G (2000) Mastering Data Mining: The Art and Science of Customer Relationship Management, John Wiley and Sons, New York, USA.
Han J and Kamber M (2001) Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers.
Witten I and Frank E (1999) Data Mining: Practical Machine Learning Tools and Techniques with Java, Morgan Kauffman Publishers, CA, USA. Literature and Acknowledgements
Investigative Data Mining and Problems in Fraud Detection: Investigative Data Mining and Problems in Fraud Detection
Investigative Data Mining - Definitions: Investigative Data Mining - Definitions Investigative
Official attempt to extract some truth, or insights, about criminal activity from data
Data Mining
Process of discovering, extracting and analysing of meaningful patterns, structure, models, and rules from large quantities of data.
Spans several research areas such as database, machine learning, neural networks, data visualisation, statistics, and distributed data mining.
Investigative Data Mining
Applied to law enforcement,
Industry, and
Private databases
Fraud Detection - Definitions: Fraud Detection - Definitions Fraud
Criminal deception, use of false representations to obtain an unjust advantage, or to injure the rights and interests of another
Diversity of Fraud
Against organisations, governments, and individuals
Committed by external parties, internal management, and non-management employees
Caused by customers, service providers, and suppliers
Prevalent in insurance, credit card, and telecommunications
Most common in automobile, travel, and household contents
Cost of Fraud
Automobile insurance fraud alone – AUD$32 million for nine Australian companies
Fraud Detection Problems - Technical: Fraud Detection Problems - Technical Imperfect data
Usually not collected for data mining
Inaccurate, incomplete, and irrelevant data attributes
Highly skewed data
Many more legitimate than fraudulent examples
Higher chances of overfitting
Black-box predictions
Numerical outputs incomprehensible to people
Fraud Detection Problems - Practical: Fraud Detection Problems - Practical Lack of domain knowledge
Important attributes, likely relationships, and known patterns
Three types of fraud offenders and their modus operandi
Great variety of fraud scenarios over time
Soft fraud – Cost of investigation > Cost of fraud
Hard fraud – Circumvents anti-fraud measures
Assessing data mining potential
Predictive accuracy are useless for skewed data sets
Slide10: Existing Fraud Detection Methods
Widely Used Methods in Fraud Detection: Widely Used Methods in Fraud Detection Insurance Fraud
Cluster detection -> decision tree induction -> domain knowledge, statistical summaries, and visualisations
Special case: neural network classification -> cluster detection
Credit Card Fraud
Decision tree and naive Bayesian classification -> stacking
Telecommunications Fraud
Cluster detection -> scores and rules
Slide12: The Crime Detection Method
Comparisons with Minority Report: Comparisons with Minority Report Precogs
Foresee and prevent crime
Each precog contains multiple classifiers
Integration Mechanisms
Combine predictions
Analytical Machinery
Record, study, compare, and represent predictions in simple terms
Single “computer”
Visual Symbols
Explain the final predictions
Graphical visualisations, numerical scores, and descriptive rules
The Crime Detection Method: The Crime Detection Method
Classifiers as Precogs: Classifiers as Precogs Precog One: Naive Bayesian Classifiers
Statistical paradigm
Simple and Fast
Redundant and not normally distributed attributes*
Precog Two: C4.5 Classifiers
Computer metaphor
Explain patterns and quite fast
Scalability and efficiency issues*
Precog Three: Backpropagation Classifiers
Brain metaphor
Long training times and extensive parameter tuning*
Advantages and disadvantages
*For details on how the problems were tackled, please refer to the thesis
Combining Output as Integration Mechanisms: Combining Output as Integration Mechanisms Cross Validation
Divides training data into eleven data partitions
Each data partition used for training, testing, and evaluation once*
Slightly better success rate
Bagging
Unweighted majority voting on each example or instance
Combine predictions from same algorithm or different algorithms*
Increases success rate
*For details on how the technique works, please refer to the thesis
Combining Output as Integration Mechanisms: Combining Output as Integration Mechanisms Stacking
Meta-classifier
Base classifiers present predictions to meta-classifier*
Determines the most reliable classifiers
*For details on how the technique works, please refer to the thesis
Combining Output as Integration Mechanisms: Combining Output as Integration Mechanisms Stacking (2)
Cluster Detection as Analytical MachineryVisualisation Techniques as Visual Symbols: Cluster Detection as Analytical Machinery Visualisation Techniques as Visual Symbols Analytical Machinery: Self Organising Maps
Clusters high dimensional elements into more simple, low dimensional maps
Automatically groups similar instances together
Do not specify an easy-to-understand model*
Visual Symbols: Classification and Clustering Visualisations
Classification visualisation – confusion matrix
- naive Bayesian visualisation
Clustering visualisation - column graph
*For details on how the problems were tackled, please refer to the thesis
Steps in the Crime Detection Method: Steps in the Crime Detection Method
Slide21: Implementing the Crime Detection System: Preparation Component
The Crime Detection System: The Crime Detection System
The Crime Detection System: Preparation Component: The Crime Detection System: Preparation Component Problem Understanding
Determine investigation objectives
- Choose
- Explain
Assess situation
- Available tools
- Available data set
- Cost model*
Determine data mining objectives
- Max hits/Min false alarms
Produce project plan
- Time
- Tools
*For details, refer to the thesis
The Crime Detection System: Preparation Component: The Crime Detection System: Preparation Component Data Understanding
Describe data
- 11550 examples (1994 and 1995)
- 3870 instances (1996)
- 33 attributes
- 6% fraudulent
Explore data
- Claim trends by month
- Age of vehicles
- Age of policy holder
Verify data
- Good data quality
- Duplicate attribute, highly skewed attributes
The Crime Detection System: Preparation Component: The Crime Detection System: Preparation Component Data Preparation
Select data
- All, except one attribute, are retained for analysis
Clean data
- Missing values replaced
- Spelling mistakes corrected
Format data
- All characters converted to lowercase
- Underscore symbol
The Crime Detection System: Preparation Component: The Crime Detection System: Preparation Component Data Preparation
Construct data
- Derived attributes
- weeks_past*
- is_holidayweek_claim*
- age_price_wsum*
- Numerical input
- 14 attributes scaled between 0 and 1
- 19 attributes represented by one-of-N or binary encoding*
*For details, refer to the thesis
The Crime Detection System: Preparation Component: The Crime Detection System: Preparation Component Data Preparation
Partition data
- Data multiplication or oversampling
- For example, 50/50 distribution
Slide28: Implementing the Crime Detection System: Action Component
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Generate experiment design (1)
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Generate experiment design (2)
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Build models (1)
- Bagged X outperformed Averaged W
- Bagged Z performed marginally better than Averaged Y
- Experiment II achieved highest cost savings than I and III
- 40/60 distribution most appropriate under the cost model
- Experiment V achieved highest cost savings than II and IV
- C4.5 algorithm is the best algorithm for the data set
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Build models (2)
- Experiment VIII achieved slightly better cost savings than V
- Combining models from different algorithms is better than the single algorithm
- The top 15 classifiers from stacking consisted of 9 C4.5, 4 backpropagation, and 2 naive Bayesian classifiers*
*For details, refer to the thesis
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Build models (3)
- No scores from D2K software
- Experiment IX demonstrates sorted scores and predefined thresholds result in focused investigations*
- Satisfies Pareto’s Law
- Rules did not provide insights
- Already in domain knowledge and data attribute exploration*
- Experiment X requires 5 clusters for visualisation*
- age_of_policyholder
- weeks_past, is_holidayweek_claim
- make, accident_area, vehicle_category, age_price_wsum, number_of_cars, base_policy
*For details, refer to the thesis
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Assess models (1)
- Training and score data sets too small*
- Student’s t-test with k-1 degrees of freedom*
- McNemar’s hypothesis test*
*For details, refer to the thesis
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Assess models (2)
- Clusters 1, 2, and 3 have higher occurrences of fraud in 1996
- Clusters 1, 3, and 5 consist of several makes of inexpensive cars
- Utility vehicles, rural areas, and liability policies
- Clusters 2 and 4 contain claims submitted many weeks after the “accidents”
- Toyota, sport cars, and multiple policies
The Crime Detection System: Action Component: Modelling
Assess models (3)
- Statistical evaluation of descriptive cluster profiles
- Cluster 4
- 3121 Toyota car claims, 6% or 187 fraudulent
- 2148 Toyota sedan car claims, expect 6% or 129 to be fraudulent with ±10 standard deviation
- Actual 171 fraudulent Toyota sedan car claims, z-score of 3.8 standard deviation
- This is an insight because it is statistically reliable, not known previously, and actionable
The Crime Detection System: Action Component
The Crime Detection System: Action Component: The Crime Detection System: Action Component Modelling
Assess models (4)
- Append main predictions from 3 algorithms and final predictions from bagging to 615 fraudulent instances
- 25 cannot be detected by any algorithms, highest lift in Clusters 1 and 2
- All can be detected by at least 1 algorithm in Cluster 3
- Not all fraudulent instances can be detected
- Domain knowledge, cluster detection, and statistics offer explanation
- 101 cannot be detected by 2 algorithms
- Weakness of bagging
- Other alternatives
The Crime Detection System: Action Component: The Crime Detection System: Action Component Evaluation
Evaluate results
- Experiment VIII generate the best predictions with cost savings of about $168, 000. This is almost 30% of total cost savings possible
- Most statistically reliable insight is the knowledge of 21 to 25 year olds who drive sport cars
Review process
- Unsupervised learning to derive clusters first
- More training data partitions
- More skewed distributions
- Cost model too simplistic
- Probabilistic Neural Networks
The Crime Detection System: Action Component: The Crime Detection System: Action Component Deployment
Plan deployment
- Manage geographically distributed databases using distributed data mining
- Take time into account
Plan monitoring and maintenance
- Determined by rate of change in external environment and organisational requirements
- Rebuild models when cost savings are below a certain percentage of maximum cost savings possible
Slide40: Contributions and Recommendations
Contributions: Contributions New Crime Detection Method
Crime Detection System
Cost Model
Visualisations
Statistics
Score-based Feature
Extensive Literature Review
In-depth Analysis of Algorithms
Recommendations – Technical Problems: Recommendations – Technical Problems Imperfect data
Statistical evaluation and confidence intervals
Preparation component of crime detection system
Derived attributes
Cross validation
Highly skewed data
Partitioned data with most appropriate distribution
Cost model
Black-box predictions
Classification and clustering visualisation
Sorted scores and predefined thresholds, rules
Recommendations – Practical Problems: Recommendations – Practical Problems Lack of domain knowledge
Action component of crime detection system
Extensive literature review
Great variety of fraud scenarios over time
SOM
Crime detection method
Choice of algorithms
Assessing data mining potential
Quality and quantity of data
Cost model
z-scores
Slide44: INVESTIGATIVE DATA MINING IN FRAUD DETECTION Scores are numbers with a specified range, which indicates the relative risk that a particular data instance maybe fraudulent, to rank instances
Rules are expressions in the form of Body → Head, where Body describes the conditions under which the rule is generated and Head is the class label
Figure 1: Predictions using Precogs, Analytical Machinery, and Visual Symbols Transforming Minority Report from Science Fiction to Science Fact: 1 INTRODUCTION
The world is overwhelmed with terabytes of data
but there are only few effective and efficient ways to analyse and interpret it.
The purpose of the research is to simulate the Precrime System from the science fiction novel, Minority Report, using data mining methods and techniques, to extract insights from enormous amounts of data to
detect white-collar crime
The application is in uncovering fraudulent claims in automobile insurance
The objectives are to overcome the technical and practical problems of data mining in fraud detection
3 RESULTS ON AUTOMOBILE INSURANCE DATA
Through the use of integration mechanisms, the highest cost savings is achieved
The analytical machinery facilitated the interesting discovery of 21 to 25 year old fraudsters who used sport cars as their crime tool
4 DISCUSSION
Black-box approach from the precogs are transformed into a
semi-transparent approach
by using analytical machinery and visual symbols to analyse and interpret the predictions
Precogs can be
shared between organisations
to increase the accuracy of the predictions, without violating competitive and legal requirements
The analytical machinery transforms multidimensional data into two-dimensional clusters which contain similar data to enable the data analyst to easily
differentiate the groups of fraud. It also allows the data analyst to
assess the algorithms’ ability
to cope with evolving fraud
The crime detection method provides a flexible step-by-step approach
to generating predictions from any three algorithms, and uses some form of integration mechanisms to increase the likelihood of correct final predictions
5 CONCLUSION
Other possible applications of this crime detection method are:
Anti-terrorism
Burglary
Customs declaration fraud
Drug-related homocides
Drug smuggling
Government financial transactions
Sexual offences
2 THE CRIME DETECTION METHOD
Precogs, or precognitive elements, are entities which have the knowledge to predict that something will happen. Figure 1 uses three precogs to foresee and prevent crime by stopping potentially guilty criminals
Each precog contains multiple classification models, or classifiers, trained with one data mining technique to extrapolate the future
The three precogs are different from each other because they are trained by different data mining algorithms. For example, the first, second, and third precog are trained using naive Bayesian, C4.5, and backpropagation algorithms.
The precogs require numerical inputs of past examples to output corresponding predictions for new instances
Integration Mechanisms are needed. As each precog outputs its many predictions for each instance, all are counted and the class with the highest tally is chosen as the main prediction
Figure 1 shows that the main predictions can be combined either by majority count (bagging) or the predictions can be fed back into one of the precogs (stacking), to derive a final prediction
Analytical Machinery, or cluster detection, records, studies, compares, and represents the precogs’ predictions in easily understood terms
The analytical machinery is represented by the Self Organising Map (SOM) which clusters the similar data into groups
Figure 1 demonstrates that main predictions and final predictions are appended to the clustered data to determine the fraud characteristics which cannot be detected, and the most important attributes are selected for visualisation Visual Symbols, or visualisations, integrate human perceptual abilities in the data analysis process by presenting the data in some visual and interactive form
The naive Bayesian and C4.5 visualisations facilitate analysis of classifier predictions and performance, and column graphs aid the interpretation of clustering results REFERENCES
Dick P K (1956) Minority Report, Orion Publishing Group, London, Great Britain.
Done by Clifton Phua for Honours 2003
Supervised by Dr. Damminda Alahakoon
Slide45: Questions?