Presentation Transcript
Multimedia and Graph mining: Multimedia and Graph mining Christos Faloutsos
CMU
CONGRATULATIONS!: CONGRATULATIONS!
Outline: Outline Problem definition / Motivation
Biological image mining
Graphs and power laws
Streams and forecasting
Conclusions
Motivation: Motivation Data mining: ~ find patterns (rules, outliers)
How do detached cat retinas evolve?
How do real graphs look like?
How do (numerical) streams look like?
ViVo: cat retina mining: ViVo: cat retina mining with Ambuj Singh, Mark Verardo, Vebjorn Ljosa, Arnab Bhattacharya (UCSB)
Jia-Yu Tim Pan, HJ Yang (CMU)
Detachment Development: Detachment Development Normal 1 day after
detachment 3 days after
detachment 7 days after
detachment 28 days after
detachment 3 months after
detachment
Data and Problem: Data and Problem (Problem) What happens in retina after detachment?
What tissues (regions) are involved?
How do they change over time?
How will a program convey this info?
More than classification “we want to learn what classifier learned”
Main idea: Main idea extract characteristic visual ‘words’
Equivalent to characteristic keywords, in a collection of text documents
Visual vocabulary?: Visual vocabulary?
Visual vocabulary?: Visual vocabulary? news:
president,
minister,
economic sports:
baseball,
score,
penalty
Visual Vocabulary (ViVo) generation: Visual Vocabulary (ViVo) generation Step 1: Tile image Step 2: Extract tile features Step 3: ViVo generation Visual vocabulary Feature 1 Feature 2 8x12 tiles
Biological interpretation: Biological interpretation
Which tissue is significant on 7-day?: Which tissue is significant on 7-day?
FEMine: Mining Fly Embryos: FEMine: Mining Fly Embryos
With: With Eric Xing (CMU CS)
Bob Murphy (CMU – Bio)
Tim Pan (CMU -> Google)
Andre Balan (U. Sao Paulo)
Outline: Outline Problem definition / Motivation
Biological image mining
Graphs and power laws
Streams and forecasting
Conclusions
Graphs - why should we care?: Graphs - why should we care?
Graphs - why should we care?: Graphs - why should we care? Internet Map [lumeta.com] Food Web [Martinez ’91] Protein Interactions [genomebiology.com] Friendship Network [Moody ’01]
Joint work with: Joint work with Dr. Deepayan Chakrabarti (CMU/Yahoo R.L.)
Problem: network and graph mining: Problem: network and graph mining How does the Internet look like?
How does the web look like?
What constitutes a ‘normal’ social network?
What is ‘normal’/‘abnormal’?
which patterns/laws hold?
Graph mining: Graph mining Are real graphs random?
Laws and patterns: Laws and patterns NO!!
Diameter
in- and out- degree distributions
other (surprising) patterns
Laws – degree distributions: Laws – degree distributions Q: avg degree is ~3 - what is the most probable degree? degree count ?? 3
Laws – degree distributions: Laws – degree distributions Q: avg degree is ~3 - what is the most probable degree? degree
Solution:: Solution: The plot is linear in log-log scale [FFF’99]
freq = degree (-2.15) O = -2.15 Exponent = slope Outdegree Frequency Nov’97 -2.15
But:: But: Q1: How about graphs from other domains?
Q2: How about temporal evolution?
The Peer-to-Peer Topology: The Peer-to-Peer Topology Frequency versus degree
Number of adjacent peers follows a power-law [Jovanovic+]
More power laws:: More power laws: citation counts: (citeseer.nj.nec.com 6/2001) log(#citations) log(count) Ullman
Slide29: Swedish sex-web Nodes: people (Females; Males)
Links: sexual relationships Liljeros et al. Nature 2001 4781 Swedes; 18-74;
59% response rate. Albert Laszlo Barabasi
http://www.nd.edu/~networks/
Publication%20Categories/
04%20Talks/2005-norway-3hours.ppt
More power laws:: More power laws: web hit counts [w/ A. Montgomery] Web Site Traffic log(in-degree) log(count) Zipf ``ebay’’
epinions.com: epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] (out) degree count trusts-2000-people user
A famous power law: Zipf’s law: A famous power law: Zipf’s law Bible - rank vs frequency (log-log)
similarly, in many other languages; for customers and sales volume; city populations etc etc
log(rank) log(freq)
Olympic medals (Sidney’00, Athens’04):: Olympic medals (Sidney’00, Athens’04): log( rank) log(#medals)
More power laws: areas – Korcak’s law: More power laws: areas – Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)
But:: But: Q1: How about graphs from other domains?
Q2: How about temporal evolution?
Time evolution: Time evolution with Jure Leskovec (CMU)
and Jon Kleinberg (Cornell)
(‘best paper’ KDD05)
Evolution of the Diameter: Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter:
diameter ~ O(log N)
diameter ~ O(log log N)
What is happening in real data?
Evolution of the Diameter: Evolution of the Diameter Prior work on Power Law graphs hints at slowly growing diameter:
diameter ~ O(log N)
diameter ~ O(log log N)
What is happening in real data?
Diameter shrinks over time
As the network grows the distances between nodes slowly decrease
Diameter – ArXiv citation graph: Diameter – ArXiv citation graph Citations among physics papers
1992 –2003
One graph per year time [years] diameter
Diameter – “Autonomous Systems”: Diameter – “Autonomous Systems” Graph of Internet
One graph per day
1997 – 2000 number of nodes diameter
Diameter – “Affiliation Network”: Diameter – “Affiliation Network” Graph of collaborations in physics – authors linked to papers
10 years of data time [years] diameter
Diameter – “Patents”: Diameter – “Patents” Patent citation network
25 years of data
time [years] diameter
Temporal Evolution of the Graphs: Temporal Evolution of the Graphs N(t) … nodes at time t
E(t) … edges at time t
Suppose that
N(t+1) = 2 * N(t)
Q: what is your guess for
E(t+1) =? 2 * E(t)
Temporal Evolution of the Graphs: Temporal Evolution of the Graphs N(t) … nodes at time t
E(t) … edges at time t
Suppose that
N(t+1) = 2 * N(t)
Q: what is your guess for
E(t+1) =? 2 * E(t)
A: over-doubled!
But obeying the ``Densification Power Law’’
Densification – Physics Citations: Densification – Physics Citations Citations among physics papers
2003:
29,555 papers, 352,807 citations
N(t) E(t) 1.69
Densification – Physics Citations: Densification – Physics Citations Citations among physics papers
2003:
29,555 papers, 352,807 citations
N(t) E(t) 1.69 1: tree
Densification – Physics Citations: Densification – Physics Citations Citations among physics papers
2003:
29,555 papers, 352,807 citations
N(t) E(t) 1.69 clique: 2
Densification – Patent Citations: Densification – Patent Citations Citations among patents granted
1999
2.9 million nodes
16.5 million edges
Each year is a datapoint N(t) E(t) 1.66
Densification – Autonomous Systems: Densification – Autonomous Systems Graph of Internet
2000
6,000 nodes
26,000 edges
One graph per day N(t) E(t) 1.18
Densification – Affiliation Network: Densification – Affiliation Network Authors linked to their publications
2002
60,000 nodes
20,000 authors
38,000 papers
133,000 edges N(t) E(t) 1.15
Graphs - Conclusions: Graphs - Conclusions Real graphs obey some surprising patterns
which can help us spot anomalies / outliers
A lot of interest from web searching companies
recommendation systems
link spamming
trust propagation
HUGE graphs (Millions and Billions of nodes)
Outline: Outline Problem definition / Motivation
Biological image mining
Graphs and power laws
Streams and forecasting
Conclusions
Why care about streams?: Why care about streams?
Why care about streams?: Why care about streams? Sensor devices
Temperature, weather measurements
Road traffic data
Geological observations
Patient physiological data
sensor-Andrew project
Embedded devices
Network routers
Co-evolving time sequences: Co-evolving time sequences Joint work with
Jimeng Sun (CMU)
Dr. Spiros Papadimitriou (CMU/IBM)
Dr. Yasushi Sakurai (NTT)
Prof. Jeanne VanBriesen (CMU/CEE)
Motivation: Motivation water distribution network normal operation sensors
near leak sensors
away
from leak
Motivation: Motivation water distribution network normal operation major leak chlorine concentrations sensors
near leak sensors
away
from leak
Motivation: Motivation actual measurements
(n streams) k hidden variable(s) spot: “hidden (latent) variables”
Motivation: Motivation chlorine concentrations Phase 1 Phase 1 Phase 2 Phase 2 actual measurements
(n streams) k hidden variable(s) k = 2 spot: “hidden (latent) variables”
Motivation: Motivation chlorine concentrations Phase 1 Phase 1 Phase 2 Phase 2 Phase 3 Phase 3 actual measurements
(n streams) k hidden variable(s) k = 1 spot: “hidden (latent) variables”
SPIRIT / InteMon: SPIRIT / InteMon http://warsteiner.db.cs.cmu.edu/demo/intemon.jsp
http://localhost:8080/demo/graphs.jsp
self-* storage system (PDL/CMU)
1 PetaByte storage
self-monitoring, self-healing: self-*
with Jimeng Sun (CMU/CS)
Evan Hoke (CMU/CS-ug)
Prof. Greg Ganger (CMU/CS/ECE)
John Strunk (CMU/ECE)
Related project: Related project Anomaly detection in network traffic (Zhang, Xie)
Conclusions: Conclusions Biological images, graphs & streams pose fascinating problems
self-similarity, fractals and power laws work, when other methods fail!
Books: Books Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)
Contact info: Contact info christos@cs.cmu.edu
www.cs.cmu.edu/~christos
Wean Hall 7107
Ph#: x8.1457
and, again WELCOME!