Hsinchun Chen, Ph.D.Director,COPLINK Center of ExcellenceArtificial Intelligence Lab University of Arizona : Hsinchun Chen, Ph.D. Director, COPLINK Center of Excellence Artificial Intelligence Lab University of Arizona Crime Data Mining and Visualization for Intelligence and Security Informatics: The COPLINK Research
Acknowledgement: NSF, CIA, NIJ, COPS, TPD, PPD, KCC
Outline : Outline COPLINK Data Mining and Visualization Framework
COPLINK Testbed: Data Characteristics
COPLINK Connect and Detect Systems: Using COPLINK Data for Information Sharing and Crime Relationship Identification
COPLINK Visual Data Mining Research: Crime Visualization, Agent, Deception Detection, Criminal Network Analysis
Outline : Outline The COPLINK Crime Data Mining and Visualization Framework
Introduction : Introduction The concern about national security has increased significantly since the terrorist attack on September 11, 2001
Intelligence agencies such as the CIA and FBI are actively collecting and analyzing information to investigate terrorists’ activities
Local law enforcement agencies have also become more alert to criminal activities in their own jurisdictions that may be relevant to national security
Challenge : Challenge The difficulty of analyzing the large volumes of data involved in criminal and terrorist activities
Some criminal activities are highly organized and relevant data can be voluminous, yet diffuse in geography and time span
Hard to see the overall picture until tragic events happen
New crime types emerge as technology evolves, e.g.,
Cybercrimes can be difficult to detect because busy network traffic and frequent online transactions generate large amounts of data but only a tiny portion is related to criminal activities
KDD : KDD Knowledge discovery and dissemination (KDD) techniques hold the promise of making it easy, convenient, and practical to explore very large databases
We present a general research framework and suggest high-impact challenge problems for KDD
Two dimensions: (1) Crime types and security concerns; (2) Crime analysis approaches and techniques
Crime Types at the Local Law Enforcement Level (1) : Crime Types at the Local Law Enforcement Level (1) Traffic violations
Offenders are cited or arrested when traffic violations are discovered by police officers
Sexual assault and other sexual offenses (e.g., child molesting)
Theft: illegal seizure of properties (e.g., robbery, burglary, larceny, motor vehicle theft, etc.)
Fraud: intentional perversion of truth in order to induce another to part with something of value or to surrender a legal right
e.g., forgery and counterfeiting, embezzlement, and identity deception
Crime Types at the Local Law Enforcement Level (2) : Crime Types at the Local Law Enforcement Level (2) Gang/drug offense: illegal sales or possession of drugs
Organized criminal activities are frequently found (e.g., with gangs) and can be traced through various sources of evidence (e.g., persons involved, vehicles, locations)
Violent crime: criminal activities that involve the use of force or armed weapons (e.g., guns, narcotics, bombs)
Typically, behavior of the criminals can be traced and location and time of incident are critical in identifying the suspects
Crime Types at the National Security Level (1) : Crime Types at the National Security Level (1) Sex crime: Prostitution can be an organized crime that involves more than one country. Examples include the illegal trading of prostitutes, organized pedophilia, etc.
Theft: The theft of national secret or weapon information can cause severe damage on the national or international level.
Fraud: It refers to deceptive behavior conducted in an illegal way. Specific crime types include transnational money laundering, identity fraud, and transnational financial fraud.
Crime Types at the National Security Level (2) : Crime Types at the National Security Level (2) Gang/drug offenses: drug trafficking conducted by organized gangs across national borders is an important type of crime in this category
Other types: setting up and running international criminal organization (e.g., the Mafia in Italy and the U.S., Yakuza in Japan, and The Chinese Triads in Hong Kong)
Violent crime: the act of terrorism. Examples include bombing, hijacking, bioterrorism, etc.
Terrorism – the unlawful use of force or violence against persons or property to intimidate or coerce a government, the civilian population, or any segment thereof, in furtherance of political or social objectives (FBI definition).
Crime Types at the National Security Level (3) : Crime Types at the National Security Level (3) Cybercrime: computer-mediated activities which are illegal and which can be conducted through global electronic networks
Owing to the pervasiveness of the Internet, cybercrime can occur on both local and national levels
The intentions of cyber-criminals can be political, social, or financial
Examples of cybercrime include Internet frauds, network intrusion, cyber-piracy, cyber-pornography, theft of confidential information, hate crime (race and religion), etc.
Crime Analysis Approaches and Techniques (1) : Crime Analysis Approaches and Techniques (1) Association Rules Mining
The process of discovering frequently occurring criminal elements in a database
Intrusion detection: to identify patterns of program executions and user activities as association rules
Classification
The process of finding the common properties among different crime entities and classifying them into groups
Clustering
The process of grouping criminal items into classes of similar characteristics
Crime Analysis Approaches and Techniques (2) : Crime Analysis Approaches and Techniques (2) Social Network Analysis
Establish a network that illustrates the roles of criminals, the flow of tangible/intangible goods and information, and the associations among these entities
Sequential Pattern Mining
Find frequently occurring sequences of items over a set of transactions that occurred at different times (e.g., to detect temporal pattern of network attack)
String Comparators
Algorithms used in fraud and deception detection
Entity Extraction
The process of identifying patterns of particular types from unstructured data such as text, image, or audio materials
A KDD Research Framework for Crime Data Mining : A KDD Research Framework for Crime Data Mining
Challenge Problems : Challenge Problems How can criminal identities and events be detected and extracted automatically and correctly across different crime types and from different media sources?
How can criminal and intelligence patterns be identified automatically and correctly as clusters and associations?
How can criminal and intelligence patterns be classified automatically and correctly and used for future event prediction?
How can criminal and intelligence analysis results be summarized and presented in an intuitive and effective visual format for analysts?
Outline : Outline COPLINK Testbed: Data Characteristics
Tucson PD Data Sources : Tucson PD Data Sources TPD Record Management System:
Stores a wide range of information from incident reports to warrants to pawn tickets, from person descriptions to vehicles to weapons and property items. Incident data goes back as early as 1983.
Database: Litton PRC RMS31 on Oracle 7.3, Compaq OpenVMS
TPD Mug Shot Database:
Stores about 90,000 mug shots taken by the ID Department.
Database: ImageWare on SQL Server 7.0, Windows NT 4.0 Server
TPD Gang Database:
Stores comprehensive information about 3,200 gang members: their activities, aliases, physical descriptions, vehicles, etc.
Database: In House Access 97, Windows NT 4.0 Server
Tucson PD RMS Documents : Tucson PD RMS Documents Incident Reports:
Report number, crime type, precinct, MOs, date and time.
Pawn Tickets:
Ticket number, data and time.
Warrants:
Warrant number, docket number, type and issue date.
Field Interviews:
FI number, type, precinct, date and time.
Tucson PD RMS Data Objects : Tucson PD RMS Data Objects Person:
True names, aliases, descriptions, addresses, IDs, marks and phone numbers.
Organization:
Name, address and phones.
Vehicle:
VIN, license plate, make, model, style, year and colors.
Property:
Serial number, type, make, model, size and colors.
Weapon:
Serial number, type, manufacturer, caliber and colors.
COPLINK Database: Tucson PD : COPLINK Database: Tucson PD
Phoenix PD Data Sources : Phoenix PD Data Sources Police Automated Computer Entry System, PACE:
Stores a wide range of information including: incident reports, citations, field interrogations, person descriptions, vehicles, property items, and weapons.
Seven years of data (1996 to 2002) are extracted into Coplink 2.5.
Database: Unisys DMS II on Unisys Clearpath system
Phoenix PD RMS Documents : Phoenix PD RMS Documents Incident Reports:
Report number, crime type, precinct, MOs, date and time.
Citations:
Citation number, type, charges, precinct, date and time.
Arrests:
Booking number, type, charges, date and time.
Phoenix PD RMS Data Objects : Phoenix PD RMS Data Objects Person:
True name, aliases, descriptions, addresses, IDs, marks and phones.
Organization:
Name, address and phones.
Vehicle:
VIN, license plate, make, model, style, year and colors.
Property:
Serial number, type, make, model, size and colors.
Weapon:
Serial number, type, manufacturer, caliber and colors.
COPLINK Database: Phoenix PD : COPLINK Database: Phoenix PD
TPD Data vs. PPD Data : TPD Data vs. PPD Data Area coverage: TPD data represent criminal data in Tucson area only. PPD data comprise incident data from several adjacent agencies: Phoenix PD, Scottsdale PD and Glendale PD.
Narrative data: PPD has a much more comprehensive collection of narrative data: 1,800,000 at PPD vs. 300,000 at TPD.
Property data: PPD has a more thorough collection of property data: 2,000,000 at PPD vs. 400,000 at TPD.
Mug shot data: While TPD data include mug shot information, PPD data do not.
Data Scrubbing (KDD testbed) : Data Scrubbing (KDD testbed) Data scrubbed: person last names, person IDs, phones and extensions, street and apartment numbers, vehicle license plates.
All scrubbed names remain meaningful, e.g., person names (Johnson, Martinez).
To maintain data consistency, the uniqueness of each entity is preserved after scrubbing, i.e., the same original entity has the same scrubbed entity.
Narratives are excluded from the testbed because there appears to be no reliable way to identify and scrub person names, person IDs, phones, addresses, and license plates in the narrative text.
COPLINK Documentation : COPLINK Documentation Sample COPLINK ERD, Entity Relationship Diagram
COPLINK Documentation : COPLINK Documentation COPLINK Data Dictionary: 217 Tables, 1000 attributes
COPLINK Data Formats : COPLINK Data Formats Delimited ASCII text files
SQL Server 2000 backup file
SQL Server 2000 detached database
Oracle 8i/9i dump file
Oracle 8i/9i transportable tablespace
DB2 UDB 7 backup file
TPD PPD data available: 4/1/2003
Outline : Outline COPLINK Connect and Detect Systems: Using COPLINK Data for Information Sharing and Crime Relationship Identification
Slide32 : 1990-present NSF CISE funding (IIS, Digital Government, Digital Library, NSDL, ITR, IDM, CSS), NIH/NLM, DARPA
1997 NIJ COPLINK funding; Web-enabled data warehousing
NIJ AGILE interoperability funding; information sharing
NSF Digital Government funding; data/text mining, agents, and knowledge management; COPLINK Center for Excellence
NSF/CIA KDD funding; national security research, COPLINK testbed; Border Safe research
Goal: A model and testbed for law enforcement and national security research COPLINK Progression
COPLINK Recognitions : COPLINK Recognitions
Time Magazine Global Business December 23, 2002
"Data Miners" Americans got a glimpse of how such a system might work this fall during the Washington-sniper investigation.
Life Week Magazine November 18, 2002
The Washington Post November 7, 2002
"A Missing Link Most Wanted" Linking facts in the sniper case will be a big test of what Coplink can do.
The New York Times November 2, 2002
"An Electronic Cop That Plays Hunches" It is an Internet-based system called Coplink, developed at an artificial intelligence laboratory…
Tucson Citizen October 23, 2002
"Tucson Cops, local software to help in D.C. sniper probe" A computer database system that Tucson police employ in crime investigations will be used in the hunt for the Washington, D.C.-area sniper or snipers.
Arizona Daily Star October 23, 2002
"Sniper probe to get help from Tucson" A program developed by the University of Arizona will be used to try to capture the Washington, D.C., area sniper.
The Innovation Groups August 5, 2002
"Regional Information Sharing Project for Huntsville, Texas Law Enforcement Agencies" The city of Huntsville, TX recently granted a contract to implement COPLINK.
Los Angeles Times May 20, 2002
"Making a Digital Government" Lawrence Brandt's latest job is to get federal agencies to share technology and information
KMWorld March 2002
Law enforcement is an information-intensive process, beginning with data collection at crime scenes and extending through records management and analysis of data to support crime-solving.
Slide34 :
DG Online December 2001
"Super Detective" When University of Arizona professor Hsinchun Chen combined police databases for a consortium of city police agencies, a super-detective was born.
POLICE Magazine March 2001
Coplink Shifts and Shares Information – Fast.
Law Enforcement Technology Magazine March 2001
Software For Data Searchers.
The POLICE CHIEF March 2001
Information Sharing System "Coplink“.
United Daily News (Taiwan) February 2, 2001
AI Lab's Chinese semantic retrieval system is the engine behind UDN's (United Daily New) acclaimed intelligent news search service.
Tucson Citizen January 17, 2001
"Use of COPLINK spreads, fuels company's growth“
Arizona Daily Star January 7, 2001
"Technology developed in Tucson is helping police catch criminals faster.
FCW.com April 03, 2000
Changing the Rules of the Game. How Coplink is Helping Police Departments Match Evidence Across Boundaries of Time and Space.
TechBeat August 1999
It's called a "Web-based intuitive integrated interface." But in layman's terms it's called "Coplink." What if will do is help put an end to a serious problem faced by law enforcement every day... the inability to exchange information about criminal cases across jurisdictions
NLECTC June 1999
No tool similar to Coplink has been available previously because the technology that would foster this kind of connectivity and interoperability did not exist.
Government Computer News January, 1998
"COPLINK intranet (designed by the AI Lab) will bring Arizona crime fighters to the data they need.
Software Components : Software Components (Source: Knowledge Computing Corporation, Tucson, AZ)
COPLINK Connect : COPLINK Connect Consolidating & Sharing Information promotes problem solving and collaboration Records Management Systems (RMS) Mugshots Database Gang Database
COPLINK Connect Functionality : COPLINK Connect Functionality Generic, common XML based criminal elements representation
Data migration (batch and incremental) and mapping for all major databases and legacy systems
Database independent: ODBC compliance data warehouse
Multi-layered Web-based architecture: database server, Web server, browser
Powerful and flexible search tools for various reports, e.g., incidents, warrants, pawns, etc.
Graphical browser-based GUI interface for ease of use, training and maintenance H. Chen, J. Schroeder, R. V. Hauck, L. Ridgeway, H. Atabakhsh, H. Gupta, C. Boarman, K. Rasmussen, and A. W. Clements, “COPLINK Connect: Information and Knowledge Management for Law Enforcement,” Decision Support Systems, Special Issue on Digital Government, 2002, forthcoming.
COPLINK Detect : COPLINK Detect Consolidated information enables targeted problem solving via powerful investigative criminal association analysis
COPLINK Detect Functionality : COPLINK Detect Functionality Simple association rule mining applied to criminal elements relationships
Generic, common XML based representation for criminal relationships
Incremental data migration and association analysis on databases
Support powerful, multi-attribute queries using partial crime information
Graphical browser-based GUI interface for simple crime relationship analysis and case retrieval H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, 2002, forthcoming.
COPLINK Detect 2.0/2.5 : COPLINK Detect 2.0/2.5
COPLINK Connect/Detect Status : COPLINK Connect/Detect Status Systems widely adopted for law enforcement information sharing and analysis. Commercialized and supported by KCC
Systems deployed at: Tucson, Phoenix, Salt River (AZ), Huntsville (TX), Polk County/Des Moines (IW), Ann Arbor (MI), Montgomery (DC), Henderson County (NC), Boston (MA), Redmond, Spokane (WA), Shawnee County (KA)
Under deployment/development: San Diego (CA), Pima County (AZ), Philadelphia (PA), Hennepin County (MN), US Customs Border Patrol (AZ), Middlesex County (NJ), State of Alaska, State of Hawaii
Outline : Outline COPLINK Visual Data Mining Research: Crime Visualization, Criminal Network Analysis, Deception Detection, Agent
COPLINK Visual Data Mining Research : COPLINK Visual Data Mining Research COPLINK Criminal Network Analysis: Association Tree, Association Network Analysis, Temporal-Spatial Visualization
P1000: A Picture is worth 1000 words.
Use visual representations and effective HCI to assist in more efficient and effective crime analysis
Leverage different representations and algorithms: hyperbolic trees, network placement algorithms, structural analysis, geo-spatial mapping, time visualization H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing Law Enforcement Data and Knowledge,” Communications of the ACM, 2002, forthcoming.
Existing Network Analysis Tools : Existing Network Analysis Tools First generation — manual approach
Anacapa Chart (Harper & Harris, 1975)
Second generation — graphics-based approach
Analyst’s Notebook, Netmap, Watson
COPLINK hyperbolic tree view, network view
Third generation — structural analysis approach
Anacapa Chart (1st generation) : Anacapa Chart (1st generation) Association
Matrix Link chart Manually extract criminal associations from data files
Construct an association matrix and draw a link chart based on the association matrix
Analyst’s Notebook, Netmap, Watson (2nd generation) : Analyst’s Notebook, Netmap, Watson (2nd generation)
Slide47 : COPLINK Association Tree and Network (2nd generation)
COPLINK Criminal Structural Analysis (3rd generation) : COPLINK Criminal Structural Analysis (3rd generation) Criminal association identification
Using shortest-path algorithms to find the strongest associations between two or more criminals in a network
SNA (Social Network Analysis)
Using blockmodel analysis to detect subgroups and patterns of interactions between groups
Identifying leaders, gatekeepers, and outliers from a criminal network J. Xu & H. Chen, “Criminal Network Analysis: A Data Mining Perspective,” AI Lab Technical Report, 2002.
A 9/11 Terrorist Network : A 9/11 Terrorist Network
The proposed framework : The proposed framework
Experiment : Experiment Data Sets
TPD incident summaries
Time period—Narcotics: 2000-present; Gangs: 1995-present
Size
Two testing networks
Narcotics (60 individuals)
Gang (24 individuals)
The COPLINK SNA Project (The narcotic network example) : The COPLINK SNA Project (The narcotic network example)
The COPLINK SNA Project (The gang network example) : The COPLINK SNA Project (The gang network example)
Patterns Found : Patterns Found The chain structure of the narcotic network
Implications: disrupt the network by breaking the chain
The star structure of the gang network
Implications: disrupt the network by removing the leader
Slide55 : White gangs who involved in murders and shootings White gangs who sold crack cocaine A group of black gangs Expert Validation
Slide56 : “Yes, these two groups are together very often” “(211) and (173) are best friends”
Slide57 : “ He is very important. He has a lot of money and sells drugs. His girl friend brings a lot of dancers in the city and buy drugs”
COPLINK Spatial-Temporal Visualization: Timeline Tool : COPLINK Spatial-Temporal Visualization: Timeline Tool Visualizes the chronologically ordered set of events associated with user-selected database entities
Events placed along horizontal axis
Entities placed along vertical axis
Entities can be grouped together
Each row contains all events associated with the entities in a group
Time-based Zooming
User can zoom into a specific time interval for more detail, while hiding uninteresting portions of the timeline
Slide59 : Plots location of incident events within a selected time interval
Zooming/panning capabilities
User-selectable GIS layers
Overview map
Provides context to the currently selected region
Plot events over time
Plot events as they occur, use different color shadings to indicate when it occurred relative to other events
Plot events as they occur and remove them after they are over, using directed arrows to highlight movement from one event to the next in time COPLINK Spatial-Temporal Visualization: GeoMapping Tool
Slide60 : Reveals periodic patterns of incident occurrence
Incident events will be plotted continuously on a circular graph
Time period represented along circle (day, week, month, etc.)
Height from center indicates number of incidents that occurred at that specific time
Customizable granularity (e.g. year, month, day, etc.)
3-sigma statistical significance line
Indicates unusually large or small number of occurrences at a specific time COPLINK Spatial-Temporal Visualization: Periodic Pattern Tool
COPLINK Visual Data Mining Research : COPLINK Visual Data Mining Research Deception Detection, a data mining approach
“An agent must spell a suspect’s name exactly right, or the FBI computer will not recognize it. That can be particularly frustrating in cases such as the Sept. 11 probe, in which suspects have used multiple names and sometimes created identities by switching a few letters in their names.” – FBI
FBI’s problem with 9/11 suspect names, e.g., “Majed M.GH Moqed,” “Majed Moqed,” and “Majed Mashaan Moqed,” and DOB, e.g., “01-01-1976” and “03-03-1976.”
A deception taxonomy was created based on criminal deceptions in law enforcement databases
Patterns existed in criminal deceptions, e.g., SSN variations, name variations, etc.
Phonetic and syntactic string comparators are adopted
Promising initial testing result: 94% accuracy in deception detection
G. Wang, H. Chen, H. Atabakhsh, “Automatically Detecting Deceptive Criminal Identities,” Communications of the ACM, forthcoming, 2002.
A Taxonomy for Deceptions in Criminal Identity : A Taxonomy for Deceptions in Criminal Identity
A Taxonomy of Deceptions in Criminal Identity: Name Deception : A Taxonomy of Deceptions in Criminal Identity: Name Deception Name Deception:
Either false first name or false last name (62.5%)
Only the middle initial is changed (62.5%)
Similar pronunciation but different spelling (42%)
A Completely false name (29.2%)
Using abbreviated names or adding extra letters (29.2%)
Leaving out the first name or last name (29.2%)
Exchanging last name and first name (8%)
A Taxonomy of Deceptions in Criminal Identity: DOB, SSN, Residency : A Taxonomy of Deceptions in Criminal Identity: DOB, SSN, Residency DOB and ID (SSN) deception:
In most cases, criminals only make minor changes in DOB and SSN, e.g., 19700207 19700208
Residency deception:
42% criminals in the collection deceived on address information. In most cases, only one portion of the address is changed slightly, e.g., street number.
String Comparators : String Comparators Phonetic Russell SoundEx code: Newcombe [1959], encodes a name with a format having a prefix letter followed by a three-digit number,
e.g., PEARCE and PIERCE both coded as: “P620”.
However, phonetic matching is particularly poor at finding matches [Zobel and Dart 1996];
Spelling string comparator [Jaro 1976; Winkler 1990].
compares spelling variations between two strings instead of phonetic codes
Limitation: common characters in both strings must be within half the length of the shorter string
Other Approximate String Matching tool : Other Approximate String Matching tool Agrep [Wu, Manber 1992]: A general string matching algorithm that can handle character variations of insertion, deletion, and substitution.
The pattern is represented as a bit array. The computation only involves simple bit operations (RightShift) and logic operations (AND, OR) on bit arrays.
Rdj+1=Rshift[Rdj] AND Sc OR Rshift[Rd-1j OR Rd-1j+1] OR Rd-1j
Agrep has been integrated into Unix and been in wide use since June 1991
Algorithm Design : Algorithm Design Compare corresponding fields of each pair of records (disagreement): Sname, SDOB, Saddr, and SID
To capture different types of name deceptions, Calculate the Normalized Euclidean Distance for the overall dis-similarity between two records, i.e., Disagreement =
Experimental Results (Training: 80 cases) : Experimental Results (Training: 80 cases) Table: Distance matrix, the distance value shows the degree of disagreement between each pair of records in the training data set.
Experimental Results (Training: 80 cases) : Experimental Results (Training: 80 cases) Table: Determining best threshold value (0.48)
Experimental Results (Testing: 40 cases) : Experimental Results (Testing: 40 cases) Table: Accuracy of deception detection when the best threshold value (0.48) is applied to the testing data set (40 records)
COPLINK Agent Research : COPLINK Agent Research COPLINK Agent: alert and collaboration in a wireless architecture
Enhance police information timeliness, collaboration, mobility, and safety via a web-based wireless alerting system (under testing at TPD)
Real-time alert of time-critical information from multiple databases, e.g., CAD (computer-aided dispatching) database, MVD
Identify and inform officers/detectives who are working on similar cases
Push time-critical information via wireless and personalized communications, i.e., web alert, email, cell phone, and pager
COPLINK Agent: Wireless Alert and Collaboration : COPLINK Agent: Wireless Alert and Collaboration Allows Patrol Officers to enhance their community expertise
Further promotes Officer safety through curbside knowledge
Secure wireless access and alert: laptop, PDA, pager, cell phone Alert: 24-7 monitoring of time-critical information from different databases
Collaboration: Automatically informing detectives working on similar cases
COPLINK Agent: Vehicle Search Form : COPLINK Agent: Vehicle Search Form Multi-DB Search Alert Method Notification setting
COPLINK Agent: Web and E-mail Collaboration Alerts : COPLINK Agent: Web and E-mail Collaboration Alerts Web Alert Email Alert
COPLINK Agent: Cell Phone and Pager Alert : COPLINK Agent: Cell Phone and Pager Alert Pager alert with case number Cell phone alert
Agent User Study and Result Summary : Agent User Study and Result Summary Study Design:
Case study method based on structured interviews, archival records analysis, and usability survey.
Use QUIS (Questionnaire for User Interaction Satisfaction) survey instrument developed by the HCI Lab at the U. of Maryland.
10 participants: crime analysts and detectives in several TPD units.
Positive feedback on system Effectiveness and Efficiency:
Monitoring: “… the information I have received back was instrumental in making at least 2 felony cases that will be prosecuted on the federal level.”
Collaboration from CAD Alert: “… allowing us to respond to incidents we know are important that the field units perhaps don’t realize in a timely manner.”
Multi-database Search: “The Tucson City Court Search was helpful because I located one of my suspects on her court date.”
High User Satisfaction from QUIS survey items:
Averaged 5.5 for 49 items on a 7-point Likert scale (7: most useful).
Strengths: Offers good Investigative power; Easy to read layout; Potential for Collaborative information sharing; CAD Integration; High intention to use.
Weaknesses: Lack of help messages; Difficult for inexperienced users; Obscure user preference settings.
Cyber Crime Analysis : Cyber Crime Analysis Cyber crime refers to computer-mediated activities which are illegal and which can be conducted through global electronic networks [1].
Crime types
Cyber attacks: network intrusion, email bombing
Distribution of illegal materials in the cyberspace: pirate software/CD, child pornography, cyber fraud [1] Thomas, D. and Loader, B.D., “Introduction - Cyber Crime: law enforcement, security and surveillance in the information age,” in Cyber Crime: Law enforcement, security and
surveillance in the information age, Taylor & Francis Group, New York, NY, 2000.
Challenge and Solution : Challenge and Solution Data collection – use web spiders and peer-to-peer software to download messages/files from the network.
Rule forming – ask domain experts to form rules to determine the legality and severity of the messages/files.
Identity tracing
There are thousands of millions of Internet users; cyber criminals use different IDs to hide their identities on the Internet.
Law enforcement agencies do not have enough resources to trace the activities of each suspicious ID; they must focus their efforts on the major suspects.
Authorship Analysis technique can help identify the author of a message based on the person’s writing style.
Authorship Analysis : Authorship Analysis Authorship analysis attempts to determine the likelihood of a particular author having written a piece of work based on some characteristics of the author [2].
The essence of this technique, is the formation of a set of metrics, or forensics, that remain relatively constant for a large number of writings created by the same person [3].
In cyber crime research context, this technique can help determine whether a set of illegal Internet messages belong to the same user based on the person’s writing style. [2] A. Gray, P. Sallis, and S. MacDonell, “Software forensics: Extending authorship analysis techniques to computer programs,” in Proc. 3rd Biannual Conf. Int. Assoc. of Forensic
Linguists (IAFL'97), pages 1--8, 1997.
[3] O. de Vel, “Mining e-mail authorship,” in Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD) 2000.
Experiment – Data Collection : Experiment – Data Collection An experiment was conducted to test the prediction accuracy of authorship analysis algorithm.
2 types of data were used:
70 email messages
3 students provided 20-30 email messages each.
Messages were randomly chosen by their authors and covered a variety of topics.
153 newsgroup messages
3 popular USENET newsgroups related to software trading were selected.
misc.forsale.computers.other.software
misc.forsale.computers.pc-specific.software
misc.forsale.computers.mac-specific.software
9 users who frequently posted messages in the 3 newsgroups were chosen.
Messages posted by these users were manually checked, with the help from domain experts, to determine whether they were illegal (i.e. involving sales of pirate software).
10-30 messages per user were manually downloaded that contained illegal content.
Experiment – Feature Extraction : Experiment – Feature Extraction Previous research suggested that style markers and structural features are good indicators of an author’s style [3].
Three types of message text features were used in this experiment to determine the authorship:
Style Markers (205 features)
average sentence length, total number of characters, total number of punctuations, etc.
Structural Features (11 features)
has a greeting, has a salutation, position of reply text, number of attachments, etc.
Content-specific Features (9 features, for newsgroup messages only)
has a list of products, position of price (in subject, in body, in list), etc.
Style markers were extracted automatically using programs.
Structural and content-specific features were extracted manually. [3] O. de Vel, “Mining e-mail authorship,” in Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD) 2000.
Experiment – Classification Results : Experiment – Classification Results A Support Vector Machine classifier [4] was used to predict the authorship of the messages based on the extracted features.
10-fold cross validation method was used.
Improvement in accuracy was observed with different combinations of message features.
[4] C.-W. Hsu and C.-J. Lin. “A comparison on methods for multi-class support vector machines,” IEEE Transactions on Neural Networks, 13, pages 415-425, 2002.
Implication : Implication A list of most active cyber criminals can be compiled based on the number and severity of illegal messages they post.
Law enforcement agencies can assign resources accordingly to target those criminals on the top of the list.
The remaining challenge is how to validate the results of such an experiment, or any cyber crime research, so they can be used as grounds for prosecution. A comprehensive validation method is necessary before research findings could be presented as evidence in court.
For project information:http://ai.bpa.arizona.edu/COPLINKhchen@eller.arizona.edu : For project information: http://ai.bpa.arizona.edu/COPLINK hchen@eller.arizona.edu