Slide1 : in collaboration with
Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,
Ralf Schenkel, Fabian Suchanek, Martin Theobald
DB and IR: Two Parallel Universes : DB and IR: Two Parallel Universes canonical
application: accounting libraries data type: numbers,
short strings text foundation: algebraic /
logic based probabilistic /
statistics based search
paradigm: Boolean retrieval
(exact queries,
result sets/bags) ranked retrieval
(vague queries,
result lists) Database Systems Information Retrieval market
leaders: Oracle, IBM DB2,
MS SQL Server, etc. Google, Yahoo!, MSN,
Verity, Fast, etc.
Why DB&IR Now? – Application Needs : Why DB&IR Now? – Application Needs Global health-care management for monitoring epidemics
News archives for journalists, press agencies, etc.
Product catalogs for houses, cars, vacation places, etc.
Customer support & CRM in insurances, telcom, retail, software, etc.
Bulletin boards for social communities
Enterprise search for projects, skills, know-how, etc.
Personalized & collaborative search in digital libraries, Web, etc.
Comprehensive archive of blogs with time-travel search Simplify life for application areas like: Typical data:
Disease (DId, Name, Category, Pathogen …) UMLS-Categories ( … )
Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …) Typical query:
symptoms of tropical virus diseases and reported anomalies
with young patients in central Europe in the last two weeks
Why DB&IR Now? – Platform Desiderata : Why DB&IR Now? – Platform Desiderata Structured data (records) Unstructured data (documents) Unstructured
search
(keywords) Structured
search
(SQL,XQuery) DB Systems IR Systems
Search Engines Keyword Search on
Relational Graphs
(IIT Bombay, UCSD, MSR, Hebrew U,
CU Hong Kong, Duke U, ...) Querying entities &
relations from IE
(MSR Beijing, UW Seattle,
IBM Almaden, UIUC, MPI, … )
Why DB&IR Forever? : Why DB&IR Forever? Turn the Web, Web2.0, and Web3.0 into the world‘s
most comprehensive knowledge base („semantic DB“) ! Data enrichment at very large scale
Text and speech are key sources of
knowledge production (publications, patents, conferences, meetings, ...)
Outline : Outline • Past • Future • Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
Parallel Universes: A Closer Look : Parallel Universes: A Closer Look Matter Antimatter user = programmer
query = precise spec.
of info request
interaction via API user = your kids
query = approximation
of user‘s real info needs
interaction process via GUI strength: indexing, QP
weakness: user model strength: ranking model
weakness: interoperability eval. measure: efficiency
(throughput, response time,
TPC-H, XMark, …) eval. measure: effectiveness
(precision, recall, F1, MAP, NDCG,
TREC & INEX benchmarks, …
Slide8 : DB IR 1990 1995 2000 2005 VAGUE
(Motro) Proximal Nodes
(Baeza-Yates et al.) WHIRL
(Cohen) Prob. Datalog
(Fuhr et al.) INEX XPath XPath
Full-Text Prob. DB
(Cavallo&Pittarelli) Prob. Tuples
(Barbara et al.)
WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] : WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R
Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3
And M.Title ~ R.Title And M.Plot ~ R.Comment Title Plot … Year Movies Title Comment … Rating Reviews Matrix Hero Matrix 1 Matrix
Reloaded Matrix
Eigenvalues Ying xiong
aka. Hero Shrek 2 … matrix spectrum
… orthonormal … … fight for peace …
… sword fight …
dramatic colors … … In ancient China … fights
… sword fight …
fights Broken Sword … In the near future …
computer hacker Neo …
… fight training … … cool fights …
new techniques … … fights …
and more fights …
… fairly boring … 1999 2002 2004 In Far Far Away …
our lovely hero
fights with cat killer … 4 1 5 5 DB&IR for query-time data integration
More recent work: MinorThird, Spider, DBLife, etc.
But scoring models fairly ad hoc
XXL: Early XML IR : XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Which professors
from Saarbruecken (SB)
are teaching IR and have
research projects on XML? Union of heterogeneous sources without global schema Similarity-aware XPath:
//~Professor [//* = ”~SB“]
[//~Course [//* = ”~IR“] ]
[//~Research [//* = ”~XML“] ] Similarity-aware XPath:
//~Professor [//* = ”~SB“]
[//~Course [//* = ”~IR“] ]
[//~Research [//* = ”~XML“] ]
XXL: Early XML IR : XXL: Early XML IR [Anja Theobald, GW: Adding Relevance toXML, WebDB’00] Scoring and ranking:
tf*idf for content condition
ontological similarity for
relaxed tag condition
score aggregation with
probabilistic independence Similarity-aware XPath:
//~Professor [//* = ”~Saarbruecken“]
[//~Course [//* = ”~IR“] ]
[//~Research [//* = ”~XML“] ] Which professors
from Saarbruecken (SB)
are teaching IR and have
research projects on XML? Motivation: Union of heterogeneous sources has no schema
The Past: Lessons Learned : The Past: Lessons Learned DB&IR: added flexible ranking to (semi) structured querying
to cope with schema and instance diversity
but ranking seems „ad hoc“ and
not consistently good in benchmarks
to win benchmark: tuning needed,
but tuning is easier if ranking is principled !
ontologies are mixed blessing:
quality diverse, concept similarity subtle,
danger of topic drift
ontology-based query expansion
(into large disjunctions)
poses efficiency challenge
Outline : Outline Past • Future • Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
TopX: 2nd Generation XML IR : TopX: 2nd Generation XML IR ”Semantic“ XPath Full-Text query:
/Article
[ftcontains(//Person, ”Max Planck“)]
[ftcontains(//Work, ”quantum physics“)]
//Children[@Gender = ”female“]//Birthdates supported by TopX engine: http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/
http://topx.sourceforge.net
Exploit tags & structure for better precision
Can relax tag names & structure for better recall
Principled ranking by probabilistic IR (Okapi BM25 for XML)
Efficient top-k query processing (using improved TA)
Robust ontology integration (self-throttling to avoid topic drift)
Efficient query expansion (on demand, by extended TA)
Relevance feedback for automatic query rewriting [Martin Theobald, Ralf Schenkel, GW: VLDB’05, VLDB Journal]
Commercial Break : Commercial Break [Martin Theobald, Ralf Schenkel, GW: VLDB’95] TopX demo
today 3:30 – 5:30
Principled Ranking by Probabilistic IR : Principled Ranking by Probabilistic IR odds for item d with
terms di being relevant for
query q = {q1, …, qm} binary features, conditional independence of features [Robertson & Sparck-Jones 1976] Now estimate pi and qi values from
relevance feedback,
pseudo-relevance feedback,
corpus statistics
by MLE (with statistical smoothing)
and store precomputed pi, qi in index „God does not play dice.“ (Einstein)
IR does. with related to but different from
statistical language models
Probabilistic Ranking for SQL : Probabilistic Ranking for SQL SQL queries that return many answers need ranking Examples:
Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …)
Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“)
Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …)
Select * From Movies Where Genre = ”Romance“ And Era = ”90s“ odds for tuple d with
attributes XY relevant for
query q: X1=x1 … Xm=xm Estimate prob‘s, exploiting workload W: [S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06] Example: frequent queries
… Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“
… Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“
boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“
From Tables and Trees to Graphs : From Tables and Trees to Graphs Example:
Conferences (CId, Title, Location, Year) Journals (JId, Title)
CPublications (PId, Title, CId) JPublications (PId, Title, Vol, No, Year)
Authors (PId, Person) Editors (CId, Person)
Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95 Schema-agnostic keyword search over multiple tables:
graph of tuples with foreign-key relationships as edges [BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS] Result is connected tree with nodes that contain
as many query keywords as possible Ranking: with nodeScore based on tf*idf or prob. IR
and edgeScore reflecting importance of relationships (or confidence, authority, etc.) Related use cases:
XML beyond trees
RDF graphs
ER graphs (e.g. from IE)
social networks Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)
The Present: Observations & Opportunities : The Present: Observations & Opportunities Probabilistic IR and statistical language models
yield principled ranking and high effectiveness
(related to prob. relational models (Suciu, Getoor, …) but different)
Structural similarity and ranking
based on tree edit distance (FleXPath, Timber, …)
Aim for comprehensive XML ranking model
capturing content, structure, ontologies
Aim to generate structure skeleton
in XPath query from user feedback
Good progress on performance
but still many open efficiency issues
Outline : Outline Past • Future Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
Knowledge Queries : Knowledge Queries Nobel laureate who survived both world wars and his children drama with three women making a prophecy
to a British nobleman that he will become king proteins that inhibit both protease and some other enzyme connection between Thomas Mann and Goethe differences in Rembetiko music from Greece and from Turkey neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘ market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Turn the Web, Web2.0, and Web3.0 into the world‘s
most comprehensive knowledge base („semantic DB“) ! Answer „knowledge queries“ such as:
Three Roads to Knowledge : Three Roads to Knowledge Handcrafted High-Quality Knowledge Bases
(Semantic-Web-style ontologies, encyclopedias, etc.) Large-scale Information Extraction & Harvesting:
(using pattern matching, NLP, statistical learning, etc.
for product search, Web entity/object search, ...) Social Wisdom from Web 2.0 Communities
(social tagging, folksonomies, human computing,
e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...) Social Wisdom from Web 2.0 Communities
(social tagging, folksonomies, human computing,
e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...)
High-Quality Knowledge Sources : High-Quality Knowledge Sources universal „common-sense“ ontologies:
SUMO (Suggested Upper Merged Ontology): 60 000 OWL axioms
Cyc: 5 Mio. facts (OpenCyc: 2 Mio. facts)
domain-specific ontologies:
UMLS (Unified Medical Language System): 1 Mio. biomedical concepts
135 categories, 54 relations (e.g. virus causes disease | symptom)
GeneOntology, etc.
thesauri and concept networks:
WordNet: 200 000 concepts (word senses) and hypernym/hyponym relations
can be cast into OWL-lite (or typed graph with statistical weights)
lexical sources:
Wikipedia (1.8 Mio. articles, 40 Mio. links, 100 languages) etc.
hand-tagged natural-language corpora:
TEI (Text Encoding Initiative) markup of historic encyclopedia
FrameNet: sentences classified into frames with semantic roles growing with strong momentum
High-Quality Knowledge Sources : High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family enzyme -- (any of several complex proteins that are produced by cells and
act as catalysts in specific biochemical reactions)
=> protein -- (any of a large group of nitrogenous organic compounds
that are essential constituents of living cells; ...)
=> macromolecule, supermolecule
...
=> organic compound -- (any compound of carbon
and another element or a radical)
...
=> catalyst, accelerator -- ((chemistry) a substance that initiates or
accelerates a chemical reaction
without itself being affected)
=> activator -- ((biology) any agency bringing about activation; ...)
can be cast into
OWL-lite or into
graph, with weights for relation strengths
(derived from co-occurrence statistics)
High-Quality Knowledge Sources : High-Quality Knowledge Sources Wikipedia and other lexical sources
Exploit Hand-Crafted Knowledge : {{Infobox_Scientist
| name = Max Planck
| birth_date = [[April 23]], [[1858]]
| birth_place = [[Kiel]], [[Germany]]
| death_date = [[October 4]], [[1947]]
| death_place = [[Göttingen]], [[Germany]]
| residence = [[Germany]]
| nationality = [[Germany|German]]
| field = [[Physicist]]
| work_institution = [[University of Kiel]]
[[Humboldt-Universität zu Berlin]]
[[Georg-August-Universität Göttingen]]
| alma_mater = [[Ludwig-Maximilians-Universität München]]
| doctoral_advisor = [[Philipp von Jolly]]
| doctoral_students =
[[Gustav Ludwig Hertz]]
…
| known_for = [[Planck's constant]],
[[Quantum mechanics|quantum theory]]
| prizes = [[Nobel Prize in Physics]] (1918)
…
Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources
YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, GW: WWW 2007] : YAGO: Yet Another Great Ontology [F. Suchanek, G. Kasneci, GW: WWW 2007] Turn Wikipedia into explicit knowledge base (semantic DB)
Exploit hand-crafted categories and templates
Represent facts as explicit knowledge triples:
relation (entity1, entity2)
(in 1st-order logic, compatible with RDF, OWL-lite, XML, etc.)
Map (and disambiguate) relations into WordNet concept DAG entity1 entity2 relation Max_Planck Kiel bornIn Kiel City isInstanceOf Examples:
YAGO Knowledge Representation : YAGO Knowledge Representation Entity Max_Planck April 23, 1858 Person City Country subclass Location subclass instanceOf subclass subclass bornOn “Max Planck” means “Dr. Planck” means subclass October 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass concepts individuals words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/ Accuracy: 97%
NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] : NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] queries with regular expressions Ling $x scientist isa hasFirstName | hasLastName $y Zhejiang locatedIn* worksFor conjunctive queries Beng Chin Ooi (coAuthor
| advisor)* Kiel $x scientist isa bornIn Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and informativeness
statistical language model for result graphs
Ranking Factors : Ranking Factors Confidence:
Prefer results that are likely to be correct
Certainty of IE
Authenticity and Authority of Sources Informativeness:
Prefer results that are likely important
May prefer results that are likely new to user
Frequency in answer
Frequency in corpus (e.g. Web)
Frequency in query log Compactness:
Prefer results that are tightly connected
Size of answer graph bornIn (Max Planck, Kiel) from
„Max Planck was born in Kiel“
(Wikipedia) livesIn (Elvis Presley, Mars) from
„They believe Elvis hides on Mars“
(Martian Bloggeria) q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) Einstein vegetarian Bohr Nobel Prize Tom Cruise 1962 isa isa bornIn diedIn won won
Slide31 : Information Extraction (IE): Text to Records combine NLP, pattern matching, lexicons, statistical learning
Knowledge Acquisition from the Web : Knowledge Acquisition from the Web Learn Semantic Relations from Entire Corpora at Large Scale
(as exhaustively as possible but with high accuracy)
Examples:
all cities, all basketball players, all composers
headquarters of companies, CEOs of companies, synonyms of proteins
birthdates of people, capitals of countries, rivers in cities
which musician plays which instruments
who discovered or invented what
which enzyme catalyzes which biochemical reaction Existing approaches and tools
(Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …):
almost-unsupervised pattern matching and learning:
seeds (known facts) patterns (in text) (extraction) rule (new) facts
Methods for Web-Scale Fact Extration : city(Beijing)
plays(Coltrane, sax) city(Beijing) old center of Beijing
plays(Coltrane, sax) sax player Coltrane city(Beijing) old center of Beijing old center of X
plays(Coltrane, sax) sax player Coltrane Y player X Methods for Web-Scale Fact Extration Example:
city (Seattle) in downtown Seattle
city (Seattle) Seattle and other towns
city (Las Vegas) Las Vegas and other towns
plays (Zappa, guitar) playing guitar: … Zappa
plays (Davis, trumpet) Davis … blows trumpet seeds text rules new facts Example:
city (Seattle) in downtown Seattle in downtown X
city (Seattle) Seattle and other towns X and other towns
city (Las Vegas) Las Vegas and other towns X and other towns
plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X
plays (Davis, trumpet) Davis … blows trumpet X … blows Y Example:
city (Seattle) in downtown Seattle in downtown X
city (Seattle) Seattle and other towns X and other towns
city (Las Vegas) Las Vegas and other towns X and other towns
plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X
plays (Davis, trumpet) Davis … blows trumpet X … blows Y Example:
city (Seattle) in downtown Seattle in downtown X
city (Seattle) Seattle and other towns X and other towns
city (Las Vegas) Las Vegas and other towns X and other towns
plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X
plays (Davis, trumpet) Davis … blows trumpet X … blows Y in downtown Beijing city(Beijing)
Coltrane blows sax plays(C., sax)
Performance of Web-IE : Performance of Web-IE State-of-the-art precision/recall results: Anecdotic evidence: invented (A.G. Bell, telephone)
married (Hillary Clinton, Bill Clinton)
isa (yoga, relaxation technique)
isa (zearalenone, mycotoxin)
contains (chocolate, theobromine)
contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables)
married (Segolene Royal, Francois Hollande)
isa (yoga, excellent way)
isa (your day, good one)
contains (chocolate, raisins)
plays (the liver, central role)
makes (everybody, mistakes) relation precision recall corpus systems
countries 80% 90% Web KnowItAll
cities 80% ??? Web KnowItAll
scientists 60% ??? Web KnowItAll
headquarters 90% 50% News Snowball, LEILA
birthdates 80% 70% Wikipedia LEILA
instanceOf 40% 20% Web Text2Onto, LEILA
Open IE 80% ??? Web TextRunner precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%
Beyond Surface Learning with LEILA : Beyond Surface Learning with LEILA Almost-unsupervised Statistical Learning with Dependency Parsing Limitation of surface patterns:
who discovered or invented what
“Tesla’s work formed the basis of AC electric power”
Learning to Extract Information by Linguistic Analysis [F.Suchanek, G.Ifrim, GW: KDD‘06] LEILA outperforms other Web-IE methods
in terms of precision, recall, F1, but:
dependency parser is slow
one relation at a time “Al Gore funded more work for a better basis of the Internet”
IE Efficiency and Accuracy Tradeoffs : IE Efficiency and Accuracy Tradeoffs precision vs. recall: two-stage processing (filter pipeline)
recall-oriented harvesting
precision-oriented scrutinizing
preprocessing
indexing: NLP trees & graphs, N-grams, PoS-tag patterns ?
exploit ontologies? exploit usage logs ?
turn crawl&extract into set-oriented query processing
candidate finding
efficient phrase, pattern, and proximity queries
optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06] IE is cool, but what‘s in it for DB folks? [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]
The Future: Challenges : The Future: Challenges Generalize YAGO approach (Wikipedia + WordNet)
Methods for comprehensive, highly accurate
mappings across many knowledge sources
cross-lingual, cross-temporal
scalable in size, diversity, number of sources
Pursue DB support towards efficient IE (and NLP)
Achieve Web-scale IE throughput that can
sustain rate of new content production (e.g. blogs)
with > 90% accuracy and Wikipedia-like coverage
Integrate handcrafted knowledge with NLP/ML-based IE
Incorporate social tagging and human computing
Outline : Outline Past Future Present : Matter, Antimatter, and Wormholes : From Data to Knowledge : XML and Graph IR
Major Trends in DB and IR : Major Trends in DB and IR malleable schema (later) deep NLP, adding structure record linkage info extraction graph mining entity-relationship graph IR ontologies ranking Database Systems Information Retrieval statistical language models data uncertainty programmability search as Web Service dataspaces Web objects Web 2.0 Web 2.0
Conclusion : Conclusion DB&IR integration agenda:
models − ranking, ontologies, prob. SQL ?, graph IR ?
languages and APIs − XQuery Full-Text++ ?
systems − drop SQL, go light-weight ?
− combine with P2P, Deep Web, ... ?
Rethink progress measures and experimental methodology
Address killer app(s) and grand challenge(s):
from data to knowledge (Web, products, enterprises)
integrate knowledge bases, info extraction, social wisdom
cope with uncertainty; ranking as first-class principle
Bridge cultural differences between DB and IR:
co-locate SIGIR and SIGMOD
DB&IR: Both Sides Now : DB&IR: Both Sides Now Joni Mitchell (1969): Both Sides Now
…
I've looked at life from both sides now, From up and down, and still somehow It's life's illusions i recall. I really don't know life at all. Thank You !