Integrating Online and Geospatial Information Sources : Integrating Online and Geospatial Information Sources Craig A. Knoblock
University of Southern California
Acknowledgements : Acknowledgements “Slide Integration”
Thanks to Subbarao Kambhampati
We jointly gave a version of this tutorial at AAAI-02
Also many thanks to
Jose Luis Ambite
Greg Barish
Alon (Ha)Levy
Steve Minton
Ion Muslea
Sheila Tejada
for permission to use/mutilate some of their slides
Overview : Overview Motivation for Information Integration
Database Refresher
Accessing Information Sources
Record Linkage
Data Integration/Query Planning
Plan Execution
Standards for Integration/Mediation
Discussion
Preamble & Platitudes : Preamble & Platitudes Internet is growing at an enormous rate
Even after the bubble-burst
All kinds of information sources are online
Web pages, databases masquerading as web pages, Services, Sensors
Promise of unprecedented information access to every Tom, Dick and Mary..
But, right now, they still need to “know” where to go, and be willing to manually put together bits and pieces of information gleaned from various sources and services
“Information Integration” aims to do this automatically.
Isn’t web mostly text? : Isn’t web mostly text? The invisible web is mostly structured…
Most web servers have back end database servers
They dynamically convert (wrap) the structured data into readable english
=> The capital of India is New Delhi.
So, if we can “unwrap” the text, we have structured data!
(un)wrappers, learning wrappers etc…
Note also that such dynamic pages cannot be crawled...
The (coming) Semi-structured web
Most pages are at least “semi”-structured
XML standard is expected to ease the transfer of such pages.
The Services
Travel services, mapping services
The Sensors
Stock quotes, current temperatures, ticket prices…
Why isn’t this just : Why isn’t this just Search engines do text-based retrieval of URLS
Works reasonably well for single document texts, or for finding sites based on single document text
Cannot integrate information from multiple documents
Cannot do effective “query relaxation” or generalization
Cannot link documents and databases
The aim of Information integration is to support query processing over structured and semi-structured sources as well as services.
Why isn’t this just : Why isn’t this just No common schema
Sources with heterogeneous schemas (and ontologies)
Semi-structured sources
Legacy Sources
Not relational-complete
Variety of access/process limitations
Autonomous sources
No central administration
Uncontrolled source content overlap
Unpredictable run-time behavior
Makes query execution hard
Presence of “services”
Need to “compose” services
Databases Distributed Databases
(An exceedingly brief) Database Refresher : (An exceedingly brief) Database Refresher
Slide9 : Database
(relational) Database Manager
(DBMS)
-Storage mgmt
-Query processing
-View management
-(Transaction processing) Query
(SQL) Answer
(relation) Traditional Database Architecture
Database Outline : Database Outline What we care about
Structured data representations
Relational databases
Deductive databases
Structured query languages
SQL
Views (& materialized views)
Slide11 : Relational Data: Terminology Name Price Category Manufacturer
gizmo $19.99 gadgets GizmoWorks
Power gizmo $29.99 gadgets GizmoWorks
SingleTouch $149.99 photography Canon
MultiTouch $203.99 household Hitachi tuples Attribute names Product Product(name: string, Price: real, category: enum, Manufacturer: string) (Arity=4) schema
Relational Algebra : Relational Algebra Operators
tuple sets as input, new set as output
Operations
Union, Intersection, difference, ..
Selection (s)
Projection ()
Cartesian product (X)
Join ( )
SQL: A query language for Relational Algebra : SQL: A query language for Relational Algebra Many standards out there: SQL92, SQL2, SQL3, SQL99
Select attributes
From relations (possibly multiple, joined)
Where conditions (selections) “Find companies that manufacture products
bought by Joe Blow”
SELECT Company.name
FROM Company, Product
WHERE Company.name=Product.maker
AND Product.name IN
(SELECT product
FROM Purchase
WHERE buyer = “Joe Blow”); Other features:
aggregation, group-by
etc.
Deductive Databases : Deductive Databases Relations viewed as predicates.
Interrelations between relations expressed as “datalog” rules
(Horn clauses, without function symbols)
Queries correspond to datalog programs
“Conjunctive queries” are datalog programs with a single non-recursive rule [Correspond to SPJ queries in SQL]
Emprelated(Name,Dname) :- Empdep(Name,Dname)
Emprelated(Name,Dname) :- Empdep(Name,D1), Emprelated(D1,Dname)
EDB predicate IDB predicate
Datalog : Datalog Datalog Program = set of datalog rules
Datalog rule = conjunctive query
big-LA-buyers(Buyer,Seller, Price) :-
person(Buyer, “Los Angeles”),
purchase(Buyer, Seller, Product, Price),
Price > 10000.
Datalog : Datalog Datalog Program = set of datalog rules
Datalog rule = conjunctive query
big-LA-buyers(Buyer,Seller, Price) :-
person(Buyer, “Los Angeles”),
purchase(Buyer, Seller, Product, Price),
Price > 10000.
Buyer, Seller, Price
[ Product [ person(Buyer, “Los Angeles”) ^
purchase(Buyer, Seller, Product, Price) ^
Price > 10000) ]
big-LA-buyers(Buyer,Seller, Price) ] Datalog First-Order Logic
Views : Views CREATE VIEW Seattle-view AS
SELECT buyer, seller, product, store
FROM Person, Purchase
WHERE Person.city = “Seattle” AND
Person.name = Purchase.buyer We can later use the views:
SELECT name, store
FROM Seattle-view, Product
WHERE Seattle-view.product = Product.name AND
Product.category = “shoes” What’s really happening when we query a view?? Virtual
vs.
Materialized
Conjunctive Queries and Views : Conjunctive Queries and Views CREATE VIEW Big-LA-buyers AS
SELECT buyer, seller, price
FROM Person, Purchase
WHERE Person.city = “Los Angeles” AND
Person.name = Purchase.buyer AND
Purchase.price > 10000
big-LA-buyers(Buyer,Seller, Price) :-
person(Buyer, “Los Angeles”),
purchase(Buyer, Seller, Product, Price),
Price > 10000.
Datalog rule ~ view definition
Rule body ~ select-from-where construct of SQL
Integrator vs. DBMS : Integrator vs. DBMS No common schema
Sources with heterogeneous schemas
Semi-structured sources
Legacy Sources
Not relational-complete
Variety of access/process limitations
Autonomous sources
No central administration
Uncontrolled source content overlap
Lack of source statistics
Tradeoffs between query plan cost, coverage, quality etc.
Multi-objective cost models
Unpredictable run-time behavior
Makes query execution hard
Presence of “services”
Need to “compose” services
Reprise
Acessing Information Sources : Acessing Information Sources Wrapper Learning
Wrappers & Information Agents : A
G
E
N
T GIVE ME: Thai food
< $20
“A”-rated Thai
< $20 “A”rated Wrappers & Information Agents
Wrapper Induction : Wrapper Induction Problem description:
Web sources present data in human-readable format
take user query
apply it to data base
present results in “template” HTML page
To integrate data from multiple sources, one must first extract relevant information from Web pages
Task: learn extraction rules based on labeled examples
Hand-writing rules is tedious, error prone, and time consuming
Example of Extraction Task : Example of Extraction Task NAME Casablanca Restaurant
STREET 220 Lincoln Boulevard
CITY Venice
PHONE (310) 392-5751
Rule Learning : Rule Learning Machine learning:
Use past experiences to improve performance
Rule learning:
INPUT:
Labeled examples: training & testing data
Admissible rules (hypotheses space)
Search strategy
Desired output:
Rule that performs well both on training and testing data
STALKER [Muslea et al, ’98 ’99 ’01] : STALKER [Muslea et al, ’98 ’99 ’01] Hierarchical wrapper induction
Decomposes a hard problem in several easier ones
Extracts items independently of each other
Each rule is a finite automaton
Advantages:
Powerful extraction language (eg, embedded list)
One hard-to-extract item does not affect others
Disadvantage:
Does not exploit item order (sometimes may help)
STALKER: The Wrapper Architecture : Extraction
Rules Query Data Information Extractor EC Tree STALKER: The Wrapper Architecture
Extraction Rules : Extraction Rules Extraction rule: sequence of landmarks Name: Joel’s Phone: (310) 777-1111
Review: … SkipTo(Phone) SkipTo() SkipTo()
Slide28 : Name: KFC
Cuisine: Fast Food
Locations:
Venice (310) 123-4567,
(800) 888-4412.
L.A. (213) 987-6543.
Encino (818) 999-4567,
(888) 727-3131. RESTAURANT
Name List ( Locations ) Cuisine
City List (PhoneNumbers)
AreaCode Phone The Embedded Catalog Tree (ECT)
Slide29 : Training Examples: Name: Del Taco
Phone (toll free) : ( 800 ) 123-4567
Cuisine ...
Name: Burger King
Phone : ( 310 ) 987-9876
Cuisine: …
Example of Rule Induction
Learning the Extraction Rules : Inductive
Learning
System Extraction
Rules Labeled Pages Learning the Extraction Rules GUI
Slide31 : Training Examples: Name: Del Taco
Phone (toll free) : ( 800 ) 123-4567
Cuisine ...
Name: Burger King
Phone : ( 310 ) 987-9876
Cuisine: …
Initial candidate: SkipTo( ( )
Example of Rule Induction
Slide32 :
SkipTo( ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()
Training Examples: Name: Del Taco
Phone (toll free) : ( 800 ) 123-4567
Cuisine ...
Name: Burger King
Phone : ( 310 ) 987-9876
Cuisine: …
Initial candidate: SkipTo( ( )
Example of Rule Induction
Slide33 : Initial candidate: SkipTo( ( )
… SkipTo(Phone) SkipTo(:) SkipTo( ( ) ...
SkipTo( ( ) ... SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()
Training Examples: Name: Del Taco
Phone (toll free) : ( 800 ) 123-4567
Cuisine ...
Name: Burger King
Phone : ( 310 ) 987-9876
Cuisine: …
Example of Rule Induction
Active Learning & Information Agents : Active Learning & Information Agents Active Learning
Idea: system selects most informative exs. to label
Advantage: fewer examples to reach same accuracy
Information Agents
One agent may use hundreds of extraction rules
Small reduction of examples per rule => big impact on user
Want to achieve 100% accuracy with as few examples as possible
Slide35 : Name: Joel’s
Phone: (310) 777-1111
Review: The chef… SkipTo( Phone: ) Name: Kim’s
Phone: (213) 757-1111
Review: Korean … Name: Chez Jean
Phone: (310) 666-1111
Review: … Name: Burger King
Phone:(818) 789-1211
Review: ... Name: Café del Rey
Phone: (310) 111-1111
Review: ... Name: KFC
Phone: (800) 111-7171
Review:... Unlabeled Examples Training Examples Which example should be labeled next?
Slide36 : Name: KFC
Phone: (310) 111-1111
Review: Fried chicken … SkipTo( Phone: ) BackTo( ( Number ) ) Two ways to find start of the phone number: Multi-view Learning
Slide37 : + - Unlabeled data Co-Testing Labeled data
Slide38 : Name: Joel’s
Phone: (310) 777-1111
Review: ... SkipTo( Phone: ) Name: Kim’s
Phone: (213) 757-1111
Review: ... Co-Testing for Wrapper Induction BackTo( (Number) )
Slide39 : Not all queries are equally informative … Phone: (800) 171-1771
Fax: (111) 111-1111
Review: … … Phone: -
Review: Founded a century ago (1891) , this …
Slide40 : Learn “content description” for item to be extracted
Too general for extraction
( Nmb ) Nmb – Nmb can’t tell a phone number from a fax number
Useful at discriminating among query candidates
Learned field description
Starts with: ( Nmb )
Ends with: Nmb – Nmb
Contains: Nmb Punct
Length: [6,6] Weak Views
Naïve & Aggressive Co-Testing : Naïve & Aggressive Co-Testing
Naïve Co-Testing:
Query: randomly chosen contention point
Output: rule with fewest mistakes on queries
Aggressive Co-Testing:
Query: contention point that most violates weak view
Output: committee vote (2 rules + weak view)
Empirical Results: 33 Difficult Tasks : Empirical Results: 33 Difficult Tasks 33 most difficult of the 140 extraction tasks
Each view: > 7 labeled examples for best accuracy
At least 100 examples for task
Results in 33 Difficult Domains : Results in 33 Difficult Domains Extraction
Tasks
Results in 33 Difficult Domains : Results in 33 Difficult Domains Extraction
Tasks
Results in 33 Difficult Domains : Results in 33 Difficult Domains Extraction
Tasks
Automatic Wrapper Generation [Crescenzi, Mecca, & Merialdo, 2001] : Automatic Wrapper Generation [Crescenzi, Mecca, & Merialdo, 2001] Automatically generates wrappers web pages
Supports nested structures and lists
Applies to large, complex pages with regular structure
Approach
Start with the first page and create a union-free regular expression that defines the wrapper
Match each successive sample against the wrapper
Mismatches result in generalizations of the regular expression
Example Matching : Example Matching
Types of Mismatches : Types of Mismatches String mismatches are used to discover fields of the document
Tag mismatches can indicate either optional elements or iterators
For iterations, mismatch is caused by repeated elements in a list
End of the list corresponds to last matching token
Beginning of list corresponds to one of the mismatched tokens
These create possible “squares”
Limitations : Limitations Assumptions:
Pages are well-structured
Want to extract at the level of entire fields
Structure can be modeled without disjunctions
Search space for explaining mismatches is huge
Uses a number of heuristics to prune space
Limited backtracking
Limit on number of choices to explore
Patterns cannot be delimited by optionals
Will result in pruning possible wrappers
Record Linkage : Record Linkage Integrating Data Across Sources
Record Linkage : Record Linkage Problem
Different sources typically represent and format information differently.
As a result, determining if two sources are referring to the same object can be difficult.
Example
Is “Joe Cool” the same person as “Joseph B. Cool”?
What if they have the same telephone number?
What if Joe Cool’s number is 310-322-0730 and Joseph B. Cool’s number is 310-640-2973?
Example Data Integration Problem : Example Data Integration Problem Zagat’s Restaurant
Guide Source Department of Health
Restaurant Source Art’s Delicatessen
Ca’ Brea
CPK
The Grill
Patina
Philippe’s The Original
The Tillerman Art’s Deli
California Pizza Kitchen
Campanile
Citrus
Grill, The
Philippe The Original
Spago How to align (or join) the objects across different sources
Information Retrieval Approach [Cohen, 1998] : Information Retrieval Approach [Cohen, 1998] Idea: Evaluate the similarity of records via textual similarity. Used in Whirl (Cohen 1998).
Follows the same approach used by classical IR algorithms (including web search engines).
First, “stemming” is applied to each entry.
E.g. “Joe’s Diner” -> “Joe [‘s] Diner”
Then, entries are compared by counting the number of words in common.
Note: Infrequent words weighted more heavily by TFIDF metric = Term Frequency Inverse Document Frequency
Unsupervised Record Linkage : Unsupervised Record Linkage Idea: Analyze data and automatically cluster pairs into three groups:
Let R = P(obs | Same) / P(obs| Different)
Matched if R > threshold TU
Unmatched if R < threshold TL
Ambiguous if TL < R < TU
This model for computing decision rules was introduced by Felligi & Sunter in 1969
Particularly useful for statistically linking large sets of data, e.g., by US Census Bureau
Unsupervised Record Linkage (cont.) : Unsupervised Record Linkage (cont.) Winkler (1998) used EM algorithm to estimate P(obs | Same) and P(obs | Different)
EM computes the maximum likelihood estimate
The algorithm iteratively determines the parameters most likely to generate the observed data.
Additional mathematical techniques must be used to adjust for “relative frequencies”
I.e. last name of “Smith” is much more frequent than “Knoblock”.
Supervised Active Learning Approach[Tejada, Knoblock & Minton, 2001] : Supervised Active Learning Approach [Tejada, Knoblock & Minton, 2001] Supervised learning. System learns:
Which attributes to weight more heavily:
Transformation rules
Mapping Rules : Mapping Rules Set of Similarity Scores Mapping Rules
Name Street Phone
.967 .973 .3
.17 .3 .74
.8 .542 .49
.95 .97 .67
… Name > .8 & Street > .79 => mapped
Name > .89 => mapped
Street < .57 => not mapped
Transformation Weights : Transformation Weights Appropriate transformations depend on the application domain
Restaurants, companies, airports…
…and on the different attributes within an application
Acronym is more appropriate for restaurant name than phone number
Learn likelihood that if a transformation is applied then two object match Transformation Weight = P (match | transformation)
Slide59 : Learning Object Mappings Candidate Generator:
Judge textual similarity of mappings
Reduce number of mappings considered for classification
Mapping Learner:
Active learning technique to learn mapping rules and transformation weights
System chooses most informative example for the user to label
Minimize the amount of user interaction Candidate
Generator Set of
Mapped
Objects Source 1 Source 2 Mapping
Learner User Input Active Atlas
Mapping Rule Learner : Mapping Rule Learner
Committee Disagreement : Committee Disagreement Chooses an example based on the disagreement of the query committee
In this case CPK, California Pizza Kitchen is the most informative example based on disagreement
Art’s Deli, Art’s Delicatessen
CPK, California Pizza Kitchen
Ca’Brea, La Brea Bakery
Yes Yes Yes
Yes No Yes
No No No
Examples M1 M2 M3 Committee
Data Integration/Query Planning : Data Integration/Query Planning
Principal Dimensions of Data Integration : Principal Dimensions of Data Integration Virtual vs. materialized architecture
Access: query only or query & update?
Mediated schema and query reformulation
Content Descriptions
Global-as-view
Local-as-view
Language for descriptions and queries: conjunctive queries (CQs), union of CQs, Datalog (recursion), first-order logic (,,), description logics, …
Types of Sources
Structured (DB’s) vs. semi-structured (Web)
Source capabilities: positive and negative
Materialized Architecture: Data Warehouse : Materialized Architecture: Data Warehouse
Virtual Architecture:Mediator : Virtual Architecture: Mediator
Virtual Integration Architecture : Virtual Integration Architecture Leave the data in the sources
When a query comes in:
Determine the relevant sources to the query
Break down the query into sub-queries for the sources
Get the answers from the sources, and combine them appropriately
Data is fresh. Approach scalable
Issues:
Relating Sources & Mediator
Reformulating the query
Efficient planning & execution Data source wrapper Data source wrapper Data source wrapper Mediator: User queries Mediated schema Data source catalog Reformulation engine optimizer Execution engine Garlic [IBM], Hermes[UMD];Tsimmis, InfoMaster[Stanford]; DISCO[INRIA]; Information Manifold [AT&T]; SIMS/Ariadne[USC];Emerac/Havasu[ASU]
Desiderata for Relating Source-Mediator Schemas : Desiderata for Relating Source-Mediator Schemas Expressive power: distinguish between sources with closely related data. Hence, be able to prune access to irrelevant sources.
Easy addition: make it easy to add new data sources.
Reformulation: be able to reformulate a user query into a query on the sources efficiently and effectively.
Nonlossy: be able to handle all queries that can be answered by directly accessing the sources Reformulation
Source Descriptions : Source Descriptions Elements of source descriptions:
Contents: source contains movies, directors, cast.
Constraints: only movies produced after 1965.
Completeness: contains all American movies.
Capabilities:
Negative: source requires movie title or director as input
Positive: source can perform selections, joins, …
Approaches to Specification of Source Descriptions : Approaches to Specification of Source Descriptions Global-as-View (GAV):
Mediator relation defined as a view over source relations
Ex: TSIMMIS (Stanford), HERMES (Maryland)
Local-as-View (LAV):
Source relation defined as view over mediator relations
Ex: Information Manifold (AT&T), Tukwila(UW), InfoMaster (Stanford), Ariadne (USC)
View ~ named query ~ logical formula
Query Reformulation : Query Reformulation Problem: rewrite the user query expressed in the mediated schema into a query expressed in the source schemas
Given a query Q in terms of the mediated-schema relations, and descriptions of the information sources,
Find a query Q’ that uses only the source relations, such that
Q’ |= Q (i.e., answers are correct; i.e., Q’ ⊆ Q) and
Q’ provides all possible answers to Q given the sources
Answering queries using views : Answering queries using views Given query q and view definitions V={V1…Vn}
q’ is an Equivalent Rewriting of q using V if:
q’ refers only to views in V, and
q’ = q
q’ is a Maximally-Contained Rewriting of q using V if:
q’ refers only to views in V, and
q’ ⊆ q, and
there is no rewriting q1, such that q’ ⊆ q1 ⊆ q and q1 q
Global-as-View (GAV) : Global-as-View (GAV) Each mediator relation is defined as a view over source relations.
MovieActor(title,actor)
DB1(id,title,actor,year)
MovieActor(title,actor) DB2(title,director,actor,year)
MovieReview(title, review) DB1(id,title,actor,year) ^ DB3(id,review)
Query Reformulation in GAV : Query Reformulation in GAV Query reformulation = rule unfolding+simplification
Query: Find reviews for ‘DeNiro’ movies
q(title,review) :- MovieActor(title,‘DeNiro’),
MovieReview(title,review)
1. q’(title,review) :- DB1(id,title,‘DeNiro’,year),
DB1(id,title,actor,year’), DB3(id,review)
2. q’(title,review) :-
DB2(title,director,‘DeNiro’,year),
DB1(id,title,actor, year’), DB3(id,review)
Local-as-View (LAV) : Local-as-View (LAV) Each source relation is defined as a view over mediator relations
V1(title, year, director) Movie(title,year,director,genre) ^ American(director) ^ year ≥1960 ^ genre = ‘Comedy’
V2 (title, review) Movie(title,year,director,genre) ^ year≥1990 ^ MovieReview(title, review) ⊆ ⊆
Query Reformulation in LAV : Query Reformulation in LAV Reformulated query:
q’(title,review) :- V1(title,year,director),
V2(title,review) q’ ⊆ q V1(title, year, director) Movie(title,year,director,genre) ^
American(director) ^ year ≥1960 ^ genre = ‘Comedy’
V2 (title, review) Movie(title,year,director,genre) ^ year≥1990 ^ MovieReview(title, review) Query: Reviews for comedies produced after 1950
q(title,review) :- Movie(title,year,director,’Comedy’), year ≥1950, MovieReview(title,review)
Integrating GIS and ImageryGlobal as View Approach [Gupta et al.] : Integrating GIS and Imagery Global as View Approach [Gupta et al.] GIS Source
Soil maps
Parcel maps
Digital elevation maps
Transportation network maps
Image Library
Satellite imagery
Aerial images
Property photographs
Mediation in MIX : Mediation in MIX Mediator defined by building an structured representation of both GIS and image sources
Mediator relations defined by:
Containment conditions
Spatial or temporal joins
Logical associations
Queries and results in XML
Mediation in MIX (cont.) : Mediation in MIX (cont.) Wrappers
Construct wrappers for the GIS and image data sources
Evaluating spatial queries
Determine subqueries to each of the sources
Compose results and produce integrated XML document
Spatial data converter used to handle conversions between sources (e.g., UTM to USGS 7.5’ quad)
Example : Example Produce a table of aerial imagery and photographs of houses broken down by 5-year increments and Total Assessed Value
Result : Result
Plan Execution : Plan Execution
Motivation : Motivation Problem
Information gathering may involve accessing and integrating data from many sources
Total time to execute these plans may be large
Why?
Unpredictable network latencies
Varying remote source capabilities
Thus, execution is often I/O-bound
Complicating factor: binding patterns
During execution, many sources cannot be queried until a previous source query has been answered
GAV vs. LAV : GAV vs. LAV Not modular
Addition of new sources changes the mediated schema
Can be awkward to write mediated schema without loss of information
Query reformulation easy
reduces to view unfolding (polynomial)
Can build hierarchies of mediated schemas
Best when
Few, stable, data sources
well-known to the mediator (e.g. corporate integration)
Garlic, TSIMMIS, HERMES
Modular--adding new sources is easy
Very flexible--power of the entire query language available to describe sources
Reformulation is hard
Involves answering queries only using views (can be intractable)
Best when
Many, relatively unknown data sources
possibility of addition/deletion of sources
Information Manifold, InfoMaster, Emerac
Traditional Approaches : Traditional Approaches Executing information gathering plans
Generate a plan
Plan typically consists of a partial ordering of the operators
Execute the plan based on the given order
Operators process all of their input data before transmitting any results to consumer(s)
Operators as fast as their most latent input
Long delays due to the dependencies in the plan
Dataflow vs Von-Neumann : Dataflow vs Von-Neumann ADD ADD ADD MUL ADD MUL ((a + b) * (c + d)) a b c d a b c d actor arc
Streaming Dataflow : Streaming Dataflow Plans consist of a network of operators
Each operator like a function
Example: Wrapper, Select, etc.
Operators produce and consume data
Operators “fire” when any part of any input data becomes available
Data routed between operators are relations
Zero or more tuples with one or more attributes
Wrapper Select Join Wrapper Input Output Plan
Parallelism of Streaming Dataflow : Parallelism of Streaming Dataflow Dataflow (horizontal parallelism)
Decentralized, independent operator execution
Enables "maximally parallel" operator execution
Also known as the "dataflow limit"
Streaming/pipelining (vertical parallelism)
Producer emits tuples to consumer ASAP
Producer & consumer can process same relation simultaneously
Effective because information gathering latencies can be high – even at the tuple level
Data often "trickles" out of I/O-bound operators
Example: The RepInfo Agent : Example: The RepInfo Agent INPUT
Any street address
e.g., 4767 Admiralty Way, Marina del Rey, CA, 90292
OUTPUT
Federal reps
2 senators,
1 house member
For each rep:
Recent news
Real-time funding
information
RepInfo Sources : RepInfo Sources
RepInfo Sources : RepInfo Sources
RepInfo Sources : RepInfo Sources
OpenSecrets – Navigation + Fetching! : OpenSecrets – Navigation + Fetching!
OpenSecrets – Navigation + Fetching! : OpenSecrets – Navigation + Fetching!
OpenSecrets – Navigation + Fetching! : OpenSecrets – Navigation + Fetching!
OpenSecrets – Navigation + Fetching! : OpenSecrets – Navigation + Fetching!
RepInfo agent plan : RepInfo agent plan Wrapper
OpenSecrets
(member page) Join
name Select
senators,
house reps Wrapper
Vote-Smart address all officials senators & house reps graph URL recent news combined results Wrapper
OpenSecrets
(funding page) funding URL Wrapper
Yahoo News Wrapper
OpenSecrets
(names page) member URL
Adaptive Query Execution : Adaptive Query Execution Network Query Engines
Tukwila
Operator reordering
Optimized operators
Telegraph
Tuple-level adaptivity
Niagara
Partial results for blocking operators
Agent Execution Language
Theseus
Speculative execution
How to speculate? : How to speculate? General problem
Means for issuing and confirming predictions
Two new operators
Speculate: Makes predictions based on "hints"
Confirm: Prevents errant results from exiting plan
Speculate answers hints confirmations predictions/additions Confirm confirmations probable results actual results
How to speculate? : How to speculate? Example: RepInfo
Make predictions about officials based on address
Makes practical sense:
Representatives do not change often
Addresses-to-reps is a many-to-one relationship
Speedups beyond 2 : Speedups beyond 2 Cascading speculation
Speculation on speculation
Functional dependencies
Enable early confirmation because subsequent FD processing is deterministic W W W W W W W W W W S S S S S S S S S G
Learning to Speculate : Learning to Speculate Accurate predictions
The better our prediction accuracy, the better the speedup
Example
Predict federal officials given an address
Categories of predictions
How do we deal with…?
New hints
Making novel predictions
Caching : Caching Associate answers with previously seen hints
Advantages
Simple
Disadvantages
Requires lots of space
Only supports previously seen predictions on previously seen data (category A)
Other ways to predict : Other ways to predict Classification
4780 Admiralty Way, Marina del Rey, CA 90292
Likely reps = (Boxer, Feinstein, and Harman)
We have learned that zip code and city are the features that most likely indicate the representative
Translation
Some data have predictable transformations
Example: the OpenSecrets source
Member URL
http://www.opensecrets.org/politicians/summary.asp?CID=N00006750
Funding URL
http://www.opensecrets.org/politicians/sector.asp?CID=N00006750
Standards for Integration/Mediation : Standards for Integration/Mediation
The X-standards… : The X-standards… XML: an on-the-wire representation for data
Xquery: a query language for XML
Xschema/DTD: a schema description language for XML data
RDF: a language for meta-data description
WSDL/SOAP/UDDI: languages for describing services
HTML vs. XML : HTML vs. XML
Bibliography
Foundations of Databases
Abiteboul, Hull, Vianu
Addison Wesley, 1995
Data on the Web
Abiteoul, Buneman, Suciu
Morgan Kaufmann, 1999
Foundations…
Abiteboul
Hull
Vianu
Addison Wesley
1995
…
“Self-describing”
-Schema info part of the data
-Good for data exchange
(albeit baroque for storage)
XML Terminology : XML Terminology tags: book, title, author, …
start tag: , end tag:
elements: …,…
elements are nested
empty element: abbrv.
an XML document: single root element well formed XML document: if it has matching tags
Why are Database folks so excited about XML? : Why are Database folks so excited about XML? XML is just a syntax for (self-describing) data
This is still exciting because
No standard syntax for relational data
With XML, we can
Translate any legacy data to XML
Can exchange data in XML format
Ship over the web, input to any application
XML vs. Relational Data : XML vs. Relational Data XML is meant as a language that supports both Text and Structured Data
Conflicting demands...
XML supports semi-structured data
In essence, the schema can be union of multiple schemas
Easy to represent books with or without prices, books with any number of authors etc.
XML supports free mixing of text and data
using the #PCDATA type
XML is ordered (while relational data is unordered)
Querying XML : Querying XML Requirements:
Need to handle lack of schema.
We may not know much about the data, so we need to navigate the XML.
Need to support both “information retrieval” and “SQL-style” queries.
Ordered vs. un-ordered XML
“Human readable”
like SQL?
Candidates
Many… based on conflicting requirements
XSL: Makes IR folks happy
XML-QL: Makes DB folks happy
Xquery : W3C’s attempt to make everybody (un)happy
Example Query : Example Query
{ for $b in /bib/book
where $b/publisher = "Addison-Wesley"
and $b/@year > 1991
return
{ $b/title }
}
“For all books after 1991,
return with Year changed from
a tag to an attribute”
TCP/IP Illustrated
Advanced Programming in the Unix environment
Result Query
Impact of XML on Integration : Impact of XML on Integration If and when all sources accept Xqueries and exchange data in XML format, then
Mediator can accept user queries in Xquery
Access sources using Xquery
Get data back in XML format
Merge results and send to user in XML format
How about now?
Sources can use XML adapters (middle-ware)
XML middleware for Databases : XML middleware for Databases XML adapters (middle-ware) received significant attention in DB community
SilkRoute (AT&T)
Xperanto (IBM)
Issues:
Need to convert relational data into XML
Tagging (easy)
Need to convert Xquery queries into equivalent SQL queries
Trickier as Xquery supports schema querying
A single query may be mapped into a union of SQL queries
Is XML standardization a magical solution for Integration? : Is XML standardization a magical solution for Integration? If all WEB sources standardize into XML format
Source access (wrapper generation issues) become easier to manage
BUT all other problems remain
Still need to relate source (XML)schemas to mediator (XML)schema
Still need to reason about source overlap, source access limitations etc.
Still need to manage execution in the presence of source/network uncertainities
“Semantic Web” : “Semantic Web” The LAV/GAV approaches assume that some human expert will do the actual schema mapping
The “semantic-web” initiative attempts to automate schema mapping
Idea: Allow pages to write logical axioms relating their vocabulary (tags) to other external tags
Support automatic inference of relations between source and mediator schema using these rules
DAML+OIL
Review : Review Motivation for Information Integration
Database Refresher
Accessing Information Sources
Record Linkage
Data Integration/Query Planning
Plan Execution
Standards for Integration/Mediation
Discussion
Discussion : Discussion Many opportunities for integrating online sources with geospatial sources
There are many online sources that can be related to geospatial sources
Online databases, text documents, phone books, schedules, …
Possible uses
Augmenting what is known about geospatial entities
Using online sources to analyze imagery
Combining online sources with geospatial sources for research and analysis
Effect of coastal pollution on the economy of coastal regions
Placing text documents in a geospatial context