Spatial Data Mining: Spatial Data Mining Yang Yubin
Joint Laboratory for Geoinformation Science
The Chinese University of Hong Kong
yangyubin@cuhk.edu.hk
Agenda: Agenda Motivation and General Description
Data Mining: Basic Concepts
Data Mining Techniques
Spatial Data Mining
Spatial Data Mining Scenarios in Meteorology and Weather Forecasting
Conclusions
Questions & Discussions
Slide3: Motivation and General Description
Data Mining: Basic Concepts
Data Mining Techniques
Spatial Data Mining
Spatial Data Mining Scenarios in Meteorology and Weather Forecasting
Conclusions
Questions & Discussions
Why do we need Data Mining?: Why do we need Data Mining? Large number of records(cases) (108-1012 bytes)
One thousand (103) bytes = 1 kilobyte (KB)
One million (106) bytes = 1 megabyte (MB)
One billion (109) bytes = 1 gigabyte (GB)
One trillion (1012) bytes = 1 terabyte (TB)
High dimensional data (variables)
10-104 attributes
Only a small portion, typically 5% to 10%, of the collected data is ever analyzed
We are drowning in data, but starving for knowledge!
Scientific Viewpoint: Data collected and stored at enormous speeds (Gbyte/hour)
remote sensor on a satellite
telescope scanning the skies
scientific simulations generating terabytes of data
Classical modeling techniques are infeasible
Data reduction
Cataloging, classifying, segmenting data
Helps scientists in Hypothesis Formation Scientific Viewpoint
Current Situations (1): Great efforts for construction and maintenance of large information databases
Data cannot be analyzed by standard statistical methods
numerous missing records
data are qualitative rather than quantitative
We do not always know what information might be represented or how relevant it might be to the questions Current Situations (1)
Current Situations (2): the ways and means for using all this data lag far behind the increase of available data
Information can only be found with:
a lot of coincidence (internet)
not explicitly available (company databases)
only accessible for human eyes by using lots of processing power (astronomical, meteorological and earth observation data)
This leads to a clear demand for means of uncovering the information and knowledge hidden in the massive quantities of data Current Situations (2)
Slide8: Motivation and General Description
Data Mining: Basic Concepts
Data Mining Techniques
Spatial Data Mining
Spatial Data Mining Scenarios in Meteorology and Weather Forecasting
Conclusions
Questions & Discussions
What is Data Mining?: What is Data Mining? Data mining is concerned with solving problems by analyzing existing data
“Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from huge amount of data”
Alternative Names: Knowledge Discovery in Databases (KDD)
A term originated in Artificial Intelligence (AI) field
KDD consists of several steps (one of which is Data Mining)
Data Mining vs. KDD: Data Mining vs. KDD Knowledge Discovery in Databases (KDD): The whole process of finding useful information and patterns in data
Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process
Data mining is the core of the knowledge discovery process
KDD Process: KDD Process Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format. Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results to user in meaningful manner
Data Mining: A KDD Process: Data Mining: A KDD Process Data mining: core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
Typical Data Mining Architecture: Typical Data Mining Architecture Data
Warehouse Data cleaning & data integration Filtering Databases Database or data warehouse server Data mining engine Pattern evaluation Graphical user interface Knowledge-base
Data Mining: Confluence of Multiple Disciplines : Data Mining: Confluence of Multiple Disciplines Data Mining Database
Systems Statistics Algorithms,
…,Other
Disciplines Information
Theory Machine
Learning Visualization
Data Mining is:: Data Mining is: A “hot” word for a class of techniques that find patterns in data
A user-centric, interactive process which leverages analysis technologies and computing power
A group of techniques that find relationships that have not previously been discovered
Not reliant on an existing database
A relatively easy task that requires knowledge of the business problem/subject matter expertise
Experts and clients are needed in:: Experts and clients are needed in: Define and redefine problems
Determine relevant aspects of the problem
Supply the data
Remove errors from the data
Provide constraints on possible patterns
Interpret patterns and possibly reject implausible ones
Evaluate predicted effects…
Slide17: Motivation and General Description
Data Mining: Basic Concepts
Data Mining Techniques
Spatial Data Mining
Spatial Data Mining Scenarios in Meteorology and Weather Forecasting
Conclusions
Questions & Discussions
Primary Data Mining Tasks (1): Primary Data Mining Tasks (1) Descriptive Modeling
Finding a compact description for large dataset [Concept Description]
Clustering people or things into groups based on their attributes [Clustering]
Associating what events are likely to occur together [Association Rule]
Sequencing what events are likely to lead to later events [Sequential Pattern Analysis]
Discovering the most significant changes [Deviation Detection]
Primary Data Mining Tasks (2): Primary Data Mining Tasks (2) Predictive Modeling
Classifying people or things into groups by recognizing patterns [Classification]
Forecasting what may happen in the future by mapping a data item to a predicting real-value variable [Regression]
Concept Description: Concept Description Characterization: provides a concise and succinct summarization of the given collection of data
Discrimination: provides descriptions comparing two or more collections of data
can handle complex data types of the attributes
a more automated process
Slide21: Generalized Relation Initial Relation Concept description: Characterization
Clustering: Clustering Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Clustering
Grouping a set of data objects into clusters based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
Example
Land use: Identification of areas of similar land use in an earth observation database
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Association rule: Association rule Association (correlation and causality)
age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]
Association rule mining
Finding frequent patterns, associations, correlations among sets of items or objects in transaction databases, relational databases, and other information repositories
Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database
Motivation: finding regularities in data
What products were often purchased together?
Example: Association rule : Example: Association rule Itemset A1,A2={a1, …, ak}
Find all the rules A1A2 with min confidence and support
support, s, probability that a transaction contains A1A2
confidence, c, conditional probability that a transaction having A1 also contains A2. Let min_support = 50%, min_conf = 50%:
a1 a3 (50%, 66.7%)
a3 a1 (50%, 100%)
Sequential Pattern Analysis: Sequential Pattern Analysis Given a set of sequences, find the complete set of frequent subsequences
Applications of sequential pattern
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera, within 3 months.
Weblog click streams
Telephone calling patterns Given support threshold min_sup =2, is a sequential pattern
Deviation Detection: Deviation Detection Outlier analysis
Outlier: a data object that does not comply with the general behavior of the data
It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Periodicity analysis
Similarity-based analysis
Classification and Regression: Classification and Regression Classification:
constructs a model (classifier) based on the training set and uses it in classifying new data
Example: Climate Classification,…
Regression:
models continuous-valued functions, i.e., predicts unknown or missing values
Example: stock trends prediction,…
Classification (1): Model Construction: Classification (1): Model Construction Classification
Algorithms IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’ Classification
Algorithms IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification (2): Prediction Using the Model : Classification (2): Prediction Using the Model (Jeff, Professor, 4) Tenured?
Classification Techniques: Classification Techniques Decision Tree Induction
Bayesian Classification
Neural Networks
Genetic Algorithms
Fuzzy Set and Logic
Regression: Regression Regression is similar to classification
First, construct a model
Second, use model to predict unknown value
Methods
Linear and multiple regression
Non-linear regression
Regression is different from classification
Classification refers to predict categorical class label
Regression models continuous-valued functions
Are All the “Discovered” Patterns Interesting?: Are All the “Discovered” Patterns Interesting? A data mining task may generate thousands of patterns, not all of them are interesting.
Interestingness measures:
A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
Objective vs. Subjective interestingness measures:
Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, executability, etc.
Slide33: Motivation and General Description
Data Mining: Basic Concepts
Data Mining Techniques
Spatial Data Mining
Spatial Data Mining Scenarios in Meteorology and Weather Forecasting
Conclusions
Questions & Discussions
Spatial Data Mining: Spatial Data Mining Spatial Patterns
Spatial outliers
Location prediction
Associations, co-locations
Hotspots, Clustering, trends, …
Primary Tasks
Mining Spatial Association Rules
Spatial Classification and Prediction
Spatial Data Clustering Analysis
Spatial Outlier Analysis
Example: Unusual warming of Pacific ocean (El Nino) affects weather in USA…
Spatial Data Mining Results: Spatial Data Mining Results Understanding spatial data, discovering relationships between spatial and nonspatial data, construction of spatial knowledge bases, etc.
In various forms
The description of the general weather patterns in a set of geographic regions is a spatial characteristic rule.
The comparison of two weather patterns in two geographic regions is a spatial discriminant rule.
A rule like “most cities in Canada are close to the Canada-US border” is a spatial association rule
near(x,coast) ^ southeast(x, USA) ) hurricane(x), (70%)
Others: spatial clusters,…
What is Spatial Data?: What is Spatial Data? Used in/for:
GIS - Geographic Information Systems
Meteorology
Astronomy
Environmental studies, etc. The data related to objects that occupy space
traffic, bird habitats, global climate, logistics, ...
Object types:
Points, Lines, Polygons,etc.
Basic Concepts (1): Basic Concepts (1) Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography, meteorology, etc.
The main difference (Spatial autocorrelation)
the neighbors of a spatial object may have an influence on it and therefore have to be considered as well
Spatial attributes
Topological
adjacency or inclusion information
Geometric
position (longitude/latitude), area, perimeter, boundary polygon
Basic Concepts (2): Basic Concepts (2) Spatial neighborhood
Topological relation
“intersect”, “overlap”, “disjoint”, …
distance relation
“close_to”, “far_away”,…
direction/orientation relation
“left_of”, “west_of”,…
Global model might be inconsistent with regional models Global Model Local Model
Applications: Applications NASA Earth Observing System (EOS): Earth science data
National Inst. of Justice: crime mapping
Census Bureau, Dept. of Commerce: census data
Dept. of Transportation (DOT): traffic data
National Inst. of Health(NIH): cancer clusters
……
Example: What Kind of Houses Are Highly Valued?—Associative Classification: Example: What Kind of Houses Are Highly Valued?—Associative Classification
Slide41: Motivation and General Description
Data Mining: Basic Concepts
Data Mining Techniques
Spatial Data Mining
Spatial Data Mining Scenarios in Meteorology and Weather Forecasting
Conclusions
Questions & Discussions
Meteorological Data Mining: Meteorological Data Mining Motivation
Lot of analysis methods must be applied to fast growing data for climate studies
Result
Appropriate presentation instruments (graphs, maps, reports, etc) must be applied
Examples
Spatial outliers can be associated with disastrous natural events such as tornadoes, hurricane, and forest fires
Associations between disaster events and certain meteorological observations
Case Studies (1): Astronomy: SKICAT(SKy Image Cataloging and Analysis Tool ) (Caltech, US)
The Palomar Observatory discovered 22 quasars with the help of data mining
the Second Palomar Observatory Sky Survey (POSS-II)
decision tree methods
classification of galaxies, stars and other stellar objects
About 3 TB of sky images were analyzed Case Studies (1): Astronomy
Case Studies (2): NCAR & UCAR: Case Studies (2): NCAR & UCAR National Center for Atmospheric Research (NCAR) & University Corporation for Atmospheric Research(UCAR), US
http://www.ucar.edu/
“Automatic Fuzzy Logic-based systems now compete with human forecasts”
Richard Wagoner, Deputy Director at Research Applications Program(RAP), NCAR
Intelligent Weather System (IWS)
Detection and forecast in the areas of en-route turbulence, en-route icing, ceiling/visibility, and convective hazards in the aviation community
Road winter maintenance, airport operations, and flash flood forecasting
Operational Application: Operational Application Prediction System: WIND-2
WIND: “Weather Is Not Discrete”
Consists of three parts:
Data
Past airport weather observations, 30 years of hourly observations, time series of 300,000 detailed observations
Recent and current observations (METARs)
Model based guidance (knowledge of near-term changes,e.g., imminent wind-shift, onset/cessation of precipitation)
Fuzzy similarity-measuring algorithm
Prediction composition – predictions based on k nearest neighbors(k-nn, clustering method)
Operational Application: Operational Application Hybrid methods are used to predict weather
Dynamical approach - based upon equations of the atmosphere,uses finite element techniques
Empirical approach - similar weather situations lead to similar outcomes WIND runs in real-time for meteorologically different sites
Data-mining/forecast process takes about one second
Case Studies (3): CrossGrid (EU): Case Studies (3): CrossGrid (EU) Objective
To develop, implement and exploit new Grid components for interactive compute and data intensive applications like flooding crisis team decision support systems, air pollution combined with weather forecasting
Main tasks in Meteorological applications package
Data mining for atmospheric circulation patterns
Find a set of representative prototypes of the atmospheric patterns in a region of interest
Weather forecasting for maritime applications
Ocean wave forecasting by models of various complexity
Slide49: Data
ERA-15 using a T106L31 model (from 1978 to 1994) with 1.125◦ resolution
Terabytes
Comprises data from approx. 20 variables (such as temperature,humidity, pressure, etc.) at 30 pressure levels of a 360x360 nodes grid
Slide50: Dept. of Applied Mathematics
Universidad de Cantabria
Santander, Spain
Case Studies (4): Typhoon Image Data Mining: Case Studies (4): Typhoon Image Data Mining Objective
To establish algorithms and database models for the discovery of information and knowledge useful for typhoon analysis and prediction
Content-based image retrieval technology to search for similar cloud patterns in the past
Data mining technology to extract spatio-temporal pattern information which is meaningful from the meteorology viewpoints
Result
Alignment of Multiple Typhoons, Explore by Projection to 2D Plane, Diurnal Analysis
Methods: Methods Archive of approximately 34,000 typhoon images for the northern and southern hemisphere
Various data mining approaches
Principal component analysis(PCA), K-means clustering, self-organizing map(SOM), wavelet transform
Retrieval of historical similar patterns from image databases to perform instance-based typhoon analysis and prediction
Extracting the eigenvectors of the whole typhoon image collection
Case Studies (5): LEAD: Case Studies (5): LEAD Linked Environments for Atmospheric Discovery
To accommodate the real time, on-demand, and dynamically-adaptive nature of mesoscale problems
Complexities: vastly disparate, high volume and bandwidth data
Tremendous computational demands
Used in accessing, preparing, assimilating, predicting, managing, mining/analyzing, and displaying a broad array of meteorological and related information
Data Mining Solution Center: ITSC, The Univ. of Alabama in Huntsville, US
http://datamining.itsc.uah.edu/index.jsp
ADaM: ADaM The Algorithm Development and Mining
Component architecture data mining toolkit
For geophysical phenomena detection and feature extraction
Applications
Detecting tropical cyclones and estimating their maximum sustained wind speed
Mesocyclone Identification from RADAR
Detecting Cumulus Cloud Fields in GOES Images
ADaM (cont’d): ADaM (cont’d) Mesoscale Convective Systems Detection
EOS Special Sensor Microwave/Imager (SSM/I) Brightness Temperature Swaths from DMSP F13 and F14
Rain Detection Using SSM/I
Lightning Detection Using OLS
Rain Accumulation Study
Case Studies (6): Rainfall Classification University of Oklahoma Norman: Case Studies (6): Rainfall Classification University of Oklahoma Norman To classify significant and interesting features within a two-dimensional spatial field of meteorological data
Observed or predicted rainfall
Data source
Estimates of hourly accumulated rainfall
Using radar and raingage data
“Attributes” for classification
Statistical parameters representing the distribution of rainfall amounts across the region
Classification Method
Hierarchical cluster analysis
Many Others…: Many Others… JARtool Project (Fayyad et al., NASA )
Identifying volcanoes on the surface of Venus from images transmitted by the Magellan spacecraft
More than 30,000 high resolution Synthetic Aperture Radar(SAR) images of the surface of Venus from different angles
The obtained accuracy was about 80%
What we can learn from those scenarios?: What we can learn from those scenarios? Data Mining is a promising way for meteorological analysis
Very strong interaction between scientists and the knowledge discovery system is necessary
The users define features of the meteorological phenomena based on their expert knowledge
The system extracts the instances of such phenomena
Then, further analysis of phenomena is possible
Slide60: Motivation and General Description
Data Mining: Basic Concepts
Data Mining Techniques
Spatial Data Mining
Spatial Data Mining Scenarios in Meteorology and Weather Forecasting
Conclusions
Questions & Discussions
Conclusions: Conclusions Data mining: discovering interesting patterns from large amounts of data
A natural evolution of database technology, in great demand, with wide applications
A KDD process includes data mining, and other steps
Data Mining can be performed in a variety of information repositories
Data mining Tasks: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
Slide62: And now discussion