Ch1-04

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 1 —:

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 1 — ©Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj

Chapter 1 Introduction:

Chapter 1 Introduction Data What is the data mining

1.1 Data:

1.1 Data Numbers Curves, figures Sounds Papers, books Web, telephone process

新時代的挑戰 – 數據爆炸:

新時代的挑戰 – 數據爆炸 銀行 , 信用卡 電話 基因 ( 人類及各種生物 ) 天文 ( 各種星體 ) 網頁 中藥

數據 (Data):

數據 (Data) 數據不一定是數 , 可以是 多媒體 地理 光譜 , 色譜的指紋圖譜 基因 microarray, 指紋圖譜 中葯顯微鑑別

Mining Multimedia Databases:

Refining or combining searches Search for “blue sky” (top layout grid is blue) Search for “blue sky and green meadows” (top layout grid is blue and bottom is green) Search for “airplane in blue sky” (top layout grid is blue and keyword = “airplane”) Mining Multimedia Databases

Slide8:

數據 (Data) 數據不一定是數 , 可以是 多媒體 地理 光譜 , 色譜的指紋圖譜 基因 microarray, 指紋圖譜 中葯顯微鑑別

Slide9:

Traditional Spatial Data Analysis

Slide10:

數據 (Data) 數據不一定是數 , 可以是 多媒體 地理 光譜 , 色譜的指紋圖譜 基因 microarray, 指紋圖譜 中葯顯微鑑別

Slide11:

Get more data/information HPLC -DAD 3D chromatogram HPLC chromatogram of nuclueside of Cordyceps Sinensis ( 冬蟲草 ) at one wavelength Hyphenated Instrument ( 聯用儀器 )

Slide12:

广西玉林 广东肇庆 越南 云南

Slide13:

數據 (Data) 數據不一定是數 , 可以是 多媒體 地理 光譜 , 色譜的指紋圖譜 基因 microarray, 指紋圖譜 中葯顯微鑑別

Slide14:

DNA microarray

Slide15:

數據 (Data) 數據不一定是數 , 可以是 多媒體 地理 光譜 , 色譜的指紋圖譜 基因 microarray, 指紋圖譜 中葯顯微鑑別

Slide16:

香港常用中葯與中成葯顯微鑑別研究 ( RGC )

Data Analysis:

Data Analysis Univariate statistics Multivariate statistics Data Mining

Multivariate Analysis:

Multivariate Analysis Regression analysis Principal component analysis Factor analysis Structural equation models Canonical correlation analysis Discriminant analysis Cluster analysis

1.2 What Is Data Mining?:

1.2 What Is Data Mining? Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and summarize the data in novel ways that are both understandable and useful to the data owner. -- David Hand, Heikki Mannila, and Padhraic Symth; 2001

Slide20:

What Is Data Mining? Data Mining is the process of exploration and analysis , by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules . --Michael J. A. Berry and Gordon S. Linoff; 2000

What Is Data Mining?:

What Is Data Mining? Knowledge discovery in databases (KDD) Extraction of interesting ( non-trivial, implicit , previously unknown and potentially useful) information or patterns from data in large databases , etc.

What Is Data Mining?:

What Is Data Mining? Inside stories Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

What Is Data Mining?:

What Is Data Mining? Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning Visualization

Data Mining: A KDD Process:

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation

Problems:

Problems Classification Pattern Recognition Association (Correlation) Description Visualization Etc. AT&T, Ernst & Young, IRS, DGBAS, Credit Card, etc.

Problems:

Problems Classification Pattern Recognition Association (Correlation) Description Visualization Etc. AT&T, Ernst & Young, IRS, DGBAS, Credit Card, etc.

Data Mining Methods:

Data Mining Methods Characterization Decision Tree Association or Affinity Grouping Classification/prediction Discrimination Regression Clustering Outlier Analysis Description and Visualization

Data Mining vs. Statistics:

Data Mining vs. Statistics Large amount of data: 1,000,000,000 rows, 3,000 columns Happenstance data Why sample? We have a large parallel computer PowerPoint shows Reasonable Price for Software: $2,000,000 Large amount of data: 10,000 rows, 20 columns Systematically gathered data Sample -- we even get error estimates!! Overhead foils Reasonable Price for Software

Applications of Data Mining:

Applications of Data Mining Bank, Credit card ( 銀行 , 信用卡 ) Marketing ( 市場研究 ) Web intelligence ( 網頁智能 ) Communication ( 傳媒 ) Risk management ( 風險管理 ) Genetics ( 基因 ) Chinese Medicine ( 中藥 ) Chemistry ( 化學 )

eBusiness:

eBusiness Marketing & Data Mining

Deta mining (KDD) :

Deta mining (KDD) Clear task Good data sets Methods depend on the task

1.3 Complexity:

1.3 Complexity Descriptor Data Set Size in Bytes Storage Mode Tiny 10 2 Piece of Paper Small 10 4 A Few Pieces of Paper Medium 10 6 A Floppy Disk Large 10 8 Hard Disk Huge 10 10 Multiple Hard Disks Massive 10 12 Robotic Magnetic Tape Storage Silos Supermassive 10 15 Distributed Data Archives

1.3 Complexity:

1.3 Complexity O( n 1/2 ) Plot a Scatterplot O( n ) Calculate Means, Variances, Kernel Density Estimates O(n log(n)) Calculate Fast Fourier Transforms O(n c) Calculate Singular Value Decomposition of an r x c Matrix; Solve a Multiple Linear Regression O( n 2 ) Solve most Clustering Algorithms O( a n ) Detect Multivariate Outliers Algorithmic Complexity

1.3 Complexity:

1.3 Complexity

1.3 Complexity:

1.3 Complexity

Statistical Data Mining:

Statistical Data Mining Need Statistical methodologies/algorithms that is computable (under the constraints of computer memory and complexity). So All Statistical methodologies need to be labeled its complexity. For powerful O(n 2 ) methodologies, an approximate O(n) algorithm is needed. Sampling is recommended.

1.4 Data Mining Example:

1. 4 Data Mining Example The sport of choice for the urban poor is BASKETBALL . The sport of choice for maintenance level employees is BOWLING . The sport of choice for front-line workers is FOOTBALL. The sport of choice for supervisors is BASEBALL . The sport of choice for middle management is TENNIS . The sport of choice for corporate officers is GOLF .

CONCLUSION:

CONCLUSION The higher you are in the corporate structure, the smaller your balls become.

大陸情書 :

大陸情書 親愛的齊: 我們的感情,在組織的親切關懷下、在領導的過問下, 一年來正沿著健康的道路蓬勃發展。這主要表現在: (一)我們共通信 121 封,平均 3.01 天一封。 其中你給我的信 51 封,占 42.1%﹔ 我給你的信 70 封, 占 57.9% 。每封信平均 1502 字,最長的達 5215 字, 最短的也有 624 字。

大陸情書 :

大陸情書 (二) 約會共 98 次,平均 3.7 天一次。 其中你主動約我 38 次,占 38.7%﹔ 我主動約你 60 次, 占 61.3% 。每次約會平均 3.8 小時,最長達 6.4 小時, 最短的也有 1.6 小時。   (三) 我到你家看望你父母 38 次,平均每 9.4 天一次, 你到我家看望我父母 36 次,平均 10 天一次。 以上充分証明一年來的交往我們已形成了戀愛共識, 我們愛情的主流是互相了解、互相關心、互相幫助, 是平等互利的 。

大陸情書 :

大陸情書 當然,任何事物都是一分為二的, 缺點的存在是不可避免的。我們二人雖然都是積極的, 但從以上的數據看,發展還不太平衡, 積極性還存在一定的差距,這是前進中的缺點。 相信在新的一年里,我們一定會發揚成績、克服缺點、 攜手前進,開創我們愛情的新局面 。

大陸情書 :

大陸情書 因此,我提出三點意見供你參考: (一)要圍繞一個愛字, (二)要狠抓一個親字, (三)要落實一個合字。 讓我們弘揚團結拼搏的精神,共同振興我們的愛情, 爭取達到一個新高度,登上一個新台階。 本著我們的婚事我們辦,辦好婚事為我們的精神, 共創輝煌! 你的小惠

Lincoln & Kennedy:

Lincoln & Kennedy The incidence of coincidence is so previewed, that it cannot be considered coincidence. Abraham Lincoln was elected to Congress in 1846. John F. Kennedy was elected to Congress in 1946. Abraham Lincoln was elected President in 1860. John F. Kennedy was elected President in 1960. The names Lincoln and Kennedy each contain seven letters.

Lincoln & Kennedy:

Lincoln & Kennedy Both were particularly concerned with civil rights. Both wives lost their children while living in the White House. Both Presidents were shot on a Friday. Both Presidents were shot in the head. Both were shot in presence of their wives. The secretary of each President warned them not to go, to the theatre and to Dallas, respectively.

Lincoln & Kennedy:

Lincoln & Kennedy Lincoln's secretary was named Kennedy. Kennedy's secretary was named Lincoln. Both were assassinated by Southerners. Both were succeeded by Southerners. Both successors were named Johnson. Andrew Johnson, who succeeded Lincoln, was born in 1808. Lyndon Johnson, who succeeded Kennedy, was born in 1908.

Lincoln & Kennedy:

Lincoln & Kennedy John Wilkes Booth, who assassinated Lincoln, was born in 1839. Lee Harvey Oswald, who assassinated Kennedy, was born in 1939. Both assassins were known by their three names. Both names are comprised of fifteen letters. Lincoln was shot at the theatre named 'Kennedy.' Kennedy was shot in a car called 'Lincoln.'

Lincoln & Kennedy:

Lincoln & Kennedy Booth ran from the theatre and was caught in a warehouse. Oswald ran from a warehouse and was caught in a theatre. Booth and Oswald were assassinated before their trials. And here's the kicker... A week before Lincoln was shot, he was in Monroe, Maryland. A week before Kennedy was shot, he was in Monroe, Marilyn.

authorStream Live Help