logging in or signing up s07 query processing in data integration Sudiksha Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 209 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 18, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Supporting keyword queries over databases: A simpl{e/istic} first step: Supporting keyword queries over databases: A simpl{e/istic} first step As you may have heard, Google struck deals with AZ, UT and some other states to make their public records databases searchable by “Google Users” Clearly the databases are not going to be searched with SQL queries, but rather with keyword search. How does one do keyword search over database records? One simple idea is to Generate the universal relation of the database (i.e., de-normalize and create a single big table) Write out each tuple as a simple HTML file Support keyword queries on these HTML files Looks like an obviously dumb idea (we are destroying the structure in the tuple by writing it out as an html file!) So we need a “smart-sounding” name for the idea We shall call it “SURFACING THE DEEP WEB” We surface the deepweb into HTML pages since we have a very nice hammer in terms of keyword based search that can then be used on those pages… Slide added after the class discussionSoft Joins..WHIRL [Cohen]: Soft Joins..WHIRL [Cohen] We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output in a ranked form—with the rank proportional to the similarity Neat idea… but does have some implementation difficulties Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples Uses A*-search to focus on top-K answers (See Surajit et. Al. CIDR 2005 who argue for a whole new query algebra to help support top-K query processing) WHIRL queries: WHIRL queries Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing WHIRL queries: WHIRL queries “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”) WHIRL queries: WHIRL queries Similarity is based on TFIDF rare words are most important. Search for high-ranking answers uses inverted indices…. WHIRL queries: WHIRL queries Similarity is based on TFIDF rare words are most important. Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms” WHIRL results: WHIRL results This sort of worked: Interactive speeds (<0.3s/q) with a few hundred thousand tuples. For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains. Average precision better than 90% on 5-way joinsWHIRL and soft integration: WHIRL and soft integration WHIRL worked for a number of web-based demo applications. e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor WHIRL could link many data types reasonably well, without engineering WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001) WHIRL was relational But see ELIXIR (SIGIR2001) WHIRL users need to know schema of source DBs WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics Text fields with few misspellimgs WHIRL was memory-based all data must be centrally stored—no federated data. small datasets onlyString Similarity Metrics: String Similarity Metrics Tf-idf measures are not really very good at handling similarity between “short textual attributes” (e.g. titles) String similarity metrics are more suitable String similarity can be handled in terms of Edit distance (# of primitive ops such as “backspace”, “overtype”) needed to convert one string into another N-gram distance (see next slide)N-gram distance: N-gram distance An n-gram of a string is a contiguous n-character subsequence of the string 3 grams of string “hitchhiker” are {hit, itc, tch, chh, hhi, hik, ike, ker} “space” can be treated as a special character A string S can be represented as a set of its n-grams Similarity between two strings can be defined in terms of the similarity between the sets Can do jaccard similarity N-grams are to strings what K-shingles are to documents Document duplicate detection is often done in terms of the set similarity between its shingles Each shingle is hashed to a hash signature. A jaccard similarity is computed between the document shingle sets Useful for plagiarism detection (see Turnitin software does it..)Performance: PerformanceQuery Processing in Data Integration(Gathering and Using Source Statistics): Query Processing in Data Integration (Gathering and Using Source Statistics) Query Optimization Challenges: Query Optimization Challenges -- Deciding what to optimize --Getting the statistics on sources --Doing the optimization What to Optimize: What to Optimize Traditional DB optimizers compare candidate plans purely in terms of the time they take to produce all answers to a query. In Integration scenarios, the optimization is “multi-objective” Total time of execution Cost to first few tuples Often, the users are happier with plans that give first tuples faster Coverage of the plan Full coverage is no longer an iron-clad requirement Too many relevant sources, Uncontrolled overlap between the sources Can’t call them all! (Robustness, Access premiums…)Roadmap: Roadmap We will first focus on optimization issues in vertical integration (“data aggregation” ) scenarios Learning source statistics Using them to do source selection Then move to optimization issues in horizontal integration (“data linking”) scenarios. Join optimization issues in data integration scenariosQuery Processing Issues in Data Aggregation: Query Processing Issues in Data Aggregation Recall that in DA, all sources are exporting fragments of the same relation R E.g. Employment opps; bibliography records; item/price records etc The fragment of R exported by a source may have fewer columns and/or fewer rows The main issue in DA is “Source Selection” Given a query q, which source(s) should be selected and in what order Objective: Call the least number of sources that will give most number of high-quality tuples in the least amount of time Decision version: Call k sources that …. Quality of tuples– may be domain specific (e.g. give lowest price records) or domain independent (e.g. give tuples with fewest null values) Issues affecting Source Selection in DA: Issues affecting Source Selection in DA Source Overlap In most cases you want to avoid calling overlapping sources …but in some cases you want to call overlapping sources E.g. to get as much information about a tuple as possible; to get the lowest priced tuple etc. Source latency You want to call sources that are likely to respond fast Source quality You want to call sources that have high quality data Domain independent: E.g. High density (fewer null values) Domain specific E.g. sources having lower cost books Source “consistency”? Exports data that is error free Learning Source Statistics: Learning Source Statistics Coverage, overlap, latency, density and quality statistics about sources are not likely to be exported by sources! Need to learn them Most of the statistics are source and query specific Coverage and Overlap of a source may depend on the query Latency may depend on the query Density may depend on the query Statistics can be learned in a qualitative or quantitative way LCW vs. coverage/overlap statistics Feasible access patterns vs. binding pattern specific latency statistics Quantitative is more general and amenable to learning Too costly to learn statistics w.r.t. each specific query Challenge: Find right type of query classes with respect to which statistics are learned Query class definition may depend on the type of statistics Since sources, user population and network are all changing, statistics need to be maintained (through incremental changes)Source Limitations: Source Limitations Sources are not really fully-relational databases Legacy systems Limited access patters (Can’s ask a white-pages source for the list of all numbers) Limited local processing power Typically only selections (on certain attributes) are supported Access limitations modeled in terms of allowed (“feasible”) binding patterns with which the source can be accessed E.g. S(X,Y,Z) with feasible patterns f,f,b or b,b,f Access Restrictions & Recursive Reformulations: Access Restrictions & Recursive Reformulations Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper from ASU-Papers Create Source S3 as select paper from AwardPapers given paper Query: select * from AwardPapers S1bf(p1,p2) :- cites(p1,p2) S2(p) :- Asp(p) S3b(p) :- Awp(p) Q(p) :- Awp(p) Awp(p) :- S3b(p) Asp(p) :- S2(p) Cites(p1,p2) :- S1bf(p) Dom(p) :- S2(p) Dom(p) :- Dom(p1), S1(p1,p) [Kwok&Weld, 96; Duschka &Levy, 97]Learning & Using Source Statistics: Learning & Using Source Statistics Managing Source Overlap: Managing Source Overlap Often, sources on the Internet have overlapping contents The overlap is not centrally managed (unlike DDBMS—data replication etc.) Reasoning about overlap is important for plan optimality We cannot possibly call all potentially relevant sources! Qns: How do we characterize, get and exploit source overlap? Qualitative approaches (LCW statements) Quantitative approaches (Coverage/Overlap statistics)Local Completeness Information: Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? True source contents Advertised description Guarantees (LCW; Inter-source comparisons) Problems: 1. Sources may not be interested in giving these! Need to learn hard to learn! 2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is uselessQuantitative ways of modeling inter-source overlap: Quantitative ways of modeling inter-source overlap Coverage & Overlap statistics [Koller et. al., 97] S1 has 80% of the movies made after 1960; while S2 has 60% of the movies S1 has 98% of the movies stored in S2 Computing cardinalities of unions given intersections Who gives these statistics? -Third party -Probing BibFinder Case Study: BibFinder Case Study See the bibfinder slidesCase Study: BibFinder: Case Study: BibFinder BibFinder: A popular CS bibliographic mediator Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect, Network Bibliography, CSB, CiteSeer More than 58000 real user queries collected Mediated schema relation in BibFinder: paper(title, author, conference/journal, year) Primary key: title+author+year Focus on Selection queries Q(title, author, year) :- paper(title, author, conference/journal, year), conference=SIGMOD Selecting top-K sources for a given query: Selecting top-K sources for a given query Given a query Q, and sources S1….Sn, we need the coverage and overlap statistics of sources Si w.r.t. Q P(S|Q) is the coverage (Probability that a random tuple belonging to Q is exported by source S) P({S1..Sj}|Q) is the overlap between S1..Sj w.r.t. query Q (Probability that a random tuple belonging to Q is exported by all the sources S1..Sj). If we have the coverage and overlap statistics, then it is possible to pick the top-K sources that will give maximal number of tuples for Q. Computing Effective Coverage provided by a set of sources: Computing Effective Coverage provided by a set of sources Suppose we are calling 3 sources S1, S2, S3 to answer a query Q. The effective coverage we get is P(S1US2US3|Q). In order to compute this union, we need the intersection (overlap) statistics (in addition to the coverage statistics) Given the above, we can pick the optimal 3-sources for answering Q by considering all 3-sized subsets of source set S1….Sn, and picking the set with highest coverageSelecting top-K sources: the greedy way: Selecting top-K sources: the greedy way Selecting optimal K sources is hard in general. One way to reduce cost is to select sources greedily, one after other. For example, to select 3 sources, we select first source Si as the source with highest P(Si|Q) value. To pick the jth source, we will compute the residual coverage of each of the remaining sources, given the 1,2…j-1 sources we have already picked The residual coverage computation requires overlap statistics). For example picking a third source in the context of sources S1 and S2 will require us to calculate:Response time can depend on the query type: Response time can depend on the query type Range queries on year Effect of binding author field --Response times can also depend on the time of the day, and the day of the week.Multi-objective Query optimization: Multi-objective Query optimization Need to optimize queries jointly for both high coverage and low response time Staged optimization won’t quite work. An idea: Make the source selection be dependent on both (residual)coverage and response timeResults on BibFinder: Results on BibFinderChallenges in gathering overlap statistics: Challenges in gathering overlap statistics Sources are incomplete and partially overlapping Calling every possible source is inefficient and impolite Need coverage and overlap statistics to figure out what sources are most relevant for every possible query! We introduce a frequency-based approach for mining these statisticsSlide40: Outline Motivation BibFinder/StatMiner Architecture StatMiner Approach Automatically learning AV Hierarchies Discovering frequent query classes Learning coverage and overlap Statistics Using Coverage and Overlap Statistics StatMiner evaluation with BibFinder Related Work ConclusionMotivation: Motivation We introduce StatMiner A threshold based hierarchical mining approach Store statistics w.r.t. query classes Keep more accurate statistics for more frequently asked queries Handling the efficiency and accuracy tradeoffs by adjusting the thresholdsSlide42: BibFinder/StatMiner Slide43: Query List Each query q corresponds to a Vector of coverage/overlap Statistics. If there are 3 sources S1, s2, s3, we have: [P(S1|q),P(S2|q), P(S3|q) P(S1&S2|q), P(S2&S3|q) P(S1&S3|q) P(S1&S2&S3|q) ] A sparse vector with exponential dimensions By keeping thresholds on min overlap, we can avoid remembering small values The larger the thresholds, the sparser the vectorsIssues in Storing & Using Statistics: Issues in Storing & Using Statistics Storing statistics for each query is disadvantageous Too many queries Stored statistics can only be useful if the same query comes up again Idea1: Focus on only frequently asked queries Idea 2: Store statistics w.r.t. query classes Generate query classes by clustering.. When a new query comes, we can map it to some existing query classes But Clustering directly on queries won’t work Because we won’t know how to map a new query into existing query classes Idea: First do “subspace” clustering—cluster attribute values A query class is then defined as a cross product of attribute value clustersSlide45: AV Hierarchies and Query ClassesSlide46: StatMiner A query is a vector of overlap statisticsSlide47: Learned Conference HierarchySlide48: Using Coverage and Overlap Statistics to Rank SourcesSlide49: Outline Motivation BibFinder/StatMiner Architecture StatMiner Approach Automatically learning AV Hierarchies Discovering frequent query classes Learning coverage and overlap Statistics Using Coverage and Overlap Statistics StatMiner evaluation with BibFinder Related Work ConclusionBibFinder/StatMiner Evaluation: BibFinder/StatMiner Evaluation Experimental setup with BibFinder: Mediator relation: Paper(title,author,conference/journal,year) 25000 real user queries are used. Among them 4500 queries are randomly chosen as test queries. AV Hierarchies for all of the four attributes are learned automatically. 8000 distinct values in author, 1200 frequent asked keywords itemsets in title, 600 distinct values in conference/journal, and 95 distinct values in year.Slide51: Learned Conference HierarchySpace Consumption for Different minfreq and minoverlap: Space Consumption for Different minfreq and minoverlap We use a threshold on the support of a class, called minfreq, to identify frequent classes We use a minimum support threshold minoverlap to prune overlap statistics for uncorrelated source sets. As we increase any of the these two thresholds, the memory consumption drops, especially in the beginning.Accuracy of the Learned Statistics: Accuracy of the Learned Statistics Absolute Error No dramatic increases Keeping very detailed overlap statistics would not necessarily increase the accuracy while requiring much more space. For example: minfreq=0.13 and minoverlap=0.1 versus minfreq=0.33 and minoverlap=0Plan Precision: Plan Precision Here we observe the average precision of the top-2 source plans The plans using our learned statistics have high precision compared to random select, and it decreases very slowly as we change the minfreq and minoverlap threshold.Slide55: Plan Precision on Controlled Sources We observer the plan precision of top-5 source plans (totally 25 simulated sources). Using greedy select do produce better plans. See Section 3.8 and Section 3.9 for detailed informationNumber of Distinct Results: Number of Distinct Results Here we observe the average number of distinct results of top-2 source plans. Our methods gets on average 50 distinct answers, while random search gets only about 30 answers.Applications: Applications Path Selection in Bioinformatics [LNRV03] More and More Bioinformatics sources available on Internet Thousands of paths existing for answering users’ queries Path Coverage and Overlap Statistics are needed Text Database Selection in Information Retrieval StatMiner can provide a better way of learning and storing representatives of the databases Main Ideas Maintain a query list and discover frequent asked keyword-sets Learn keyword-set hierarchy based on the statistics distance Learn and store coverage (document frequency) for frequent asked keyword-set classes. A new query will be mapped to a set of close classes and use their statistics to estimate statistics for the query. Advantages Multiple-word-term & ScalabilityLatency statistics(Or what good is coverage without good response time?): Latency statistics (Or what good is coverage without good response time?) Sources vary significantly in terms of their response times The response time depends both on the source itself, as well as the query that is asked of it Specifically, what fields are bound in the selection query can make a difference ..So, learn statistics w.r.t. binding patterns Query Binding Patterns: Query Binding Patterns A binding pattern refers to which arguments of a relational query are “bound” Given a relation S(X,Y,Z) A query S(“Rao”, Y, “Tom”) has binding pattern bfb A query S(X,Y, “TOM”) has binding pattern ffb Binding patterns can be generalized to take “types” of bindings E.g. S(X,Y,1) may be ffn (n being numeric binding) and S(X,Y, “TOM”) may be ffs (s being string binding) Sources tend to have different latencies based on the binding pattern In extreme cases, certain binding patterns may have infinite latency (i.e., you are not allowed to ask that query) Called “infeasible” binding patterns(Digression): (Digression) LCWs are the “qualitative” versions of quantitative coverage/overlap statistics Feasible binding patterns are “qualitative” versions of quantitative latency statisticsBinding-specific latency stats are more effective: Binding-specific latency stats are more effective Combining coverage and response time: Combining coverage and response time Qn: How do we define an optimal plan in the context of both coverage/overlap and response time requirements? An instance of “multi-objective” optimization General solution involves presenting a set of “pareto-optimal” solutions to the user and let her decide Pareto-optimal set is a set of solutions where no solution is dominated by another one in all optimization dimensions (i.e., both better coverage and lower response time) Another idea is to combine both objectives into a single weighted objective Combining Coverage & Repsonse Time:Two Models: Combining Coverage & Repsonse Time: Two ModelsIt is possible to optimize for first tuples: It is possible to optimize for first tuplesDifferent “kinds” of plans: Different “kinds” of plans You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
s07 query processing in data integration Sudiksha Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 209 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 18, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Supporting keyword queries over databases: A simpl{e/istic} first step: Supporting keyword queries over databases: A simpl{e/istic} first step As you may have heard, Google struck deals with AZ, UT and some other states to make their public records databases searchable by “Google Users” Clearly the databases are not going to be searched with SQL queries, but rather with keyword search. How does one do keyword search over database records? One simple idea is to Generate the universal relation of the database (i.e., de-normalize and create a single big table) Write out each tuple as a simple HTML file Support keyword queries on these HTML files Looks like an obviously dumb idea (we are destroying the structure in the tuple by writing it out as an html file!) So we need a “smart-sounding” name for the idea We shall call it “SURFACING THE DEEP WEB” We surface the deepweb into HTML pages since we have a very nice hammer in terms of keyword based search that can then be used on those pages… Slide added after the class discussionSoft Joins..WHIRL [Cohen]: Soft Joins..WHIRL [Cohen] We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output in a ranked form—with the rank proportional to the similarity Neat idea… but does have some implementation difficulties Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples Uses A*-search to focus on top-K answers (See Surajit et. Al. CIDR 2005 who argue for a whole new query algebra to help support top-K query processing) WHIRL queries: WHIRL queries Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing WHIRL queries: WHIRL queries “Find reviews of sci-fi comedies [movie domain] FROM review SELECT * WHERE r.text~’sci fi comedy’ (like standard ranked retrieval of “sci-fi comedy”) “ “Where is [that sci-fi comedy] playing?” FROM review as r, LISTING as s, SELECT * WHERE r.title~s.title and r.text~’sci fi comedy’ (best answers: titles are similar to each other – e.g., “Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”) WHIRL queries: WHIRL queries Similarity is based on TFIDF rare words are most important. Search for high-ranking answers uses inverted indices…. WHIRL queries: WHIRL queries Similarity is based on TFIDF rare words are most important. Search for high-ranking answers uses inverted indices…. - It is easy to find the (few) items that match on “important” terms - Search for strong matches can prune “unimportant terms” WHIRL results: WHIRL results This sort of worked: Interactive speeds (<0.3s/q) with a few hundred thousand tuples. For 2-way joins, average precision (sort of like area under precision-recall curve) from 85% to 100% on 13 problems in 6 domains. Average precision better than 90% on 5-way joinsWHIRL and soft integration: WHIRL and soft integration WHIRL worked for a number of web-based demo applications. e.g., integrating data from 30-50 smallish web DBs with <1 FTE labor WHIRL could link many data types reasonably well, without engineering WHIRL generated numerous papers (Sigmod98, KDD98, Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000, JAIR2001) WHIRL was relational But see ELIXIR (SIGIR2001) WHIRL users need to know schema of source DBs WHIRL’s query-time linkage worked only for TFIDF, token-based distance metrics Text fields with few misspellimgs WHIRL was memory-based all data must be centrally stored—no federated data. small datasets onlyString Similarity Metrics: String Similarity Metrics Tf-idf measures are not really very good at handling similarity between “short textual attributes” (e.g. titles) String similarity metrics are more suitable String similarity can be handled in terms of Edit distance (# of primitive ops such as “backspace”, “overtype”) needed to convert one string into another N-gram distance (see next slide)N-gram distance: N-gram distance An n-gram of a string is a contiguous n-character subsequence of the string 3 grams of string “hitchhiker” are {hit, itc, tch, chh, hhi, hik, ike, ker} “space” can be treated as a special character A string S can be represented as a set of its n-grams Similarity between two strings can be defined in terms of the similarity between the sets Can do jaccard similarity N-grams are to strings what K-shingles are to documents Document duplicate detection is often done in terms of the set similarity between its shingles Each shingle is hashed to a hash signature. A jaccard similarity is computed between the document shingle sets Useful for plagiarism detection (see Turnitin software does it..)Performance: PerformanceQuery Processing in Data Integration(Gathering and Using Source Statistics): Query Processing in Data Integration (Gathering and Using Source Statistics) Query Optimization Challenges: Query Optimization Challenges -- Deciding what to optimize --Getting the statistics on sources --Doing the optimization What to Optimize: What to Optimize Traditional DB optimizers compare candidate plans purely in terms of the time they take to produce all answers to a query. In Integration scenarios, the optimization is “multi-objective” Total time of execution Cost to first few tuples Often, the users are happier with plans that give first tuples faster Coverage of the plan Full coverage is no longer an iron-clad requirement Too many relevant sources, Uncontrolled overlap between the sources Can’t call them all! (Robustness, Access premiums…)Roadmap: Roadmap We will first focus on optimization issues in vertical integration (“data aggregation” ) scenarios Learning source statistics Using them to do source selection Then move to optimization issues in horizontal integration (“data linking”) scenarios. Join optimization issues in data integration scenariosQuery Processing Issues in Data Aggregation: Query Processing Issues in Data Aggregation Recall that in DA, all sources are exporting fragments of the same relation R E.g. Employment opps; bibliography records; item/price records etc The fragment of R exported by a source may have fewer columns and/or fewer rows The main issue in DA is “Source Selection” Given a query q, which source(s) should be selected and in what order Objective: Call the least number of sources that will give most number of high-quality tuples in the least amount of time Decision version: Call k sources that …. Quality of tuples– may be domain specific (e.g. give lowest price records) or domain independent (e.g. give tuples with fewest null values) Issues affecting Source Selection in DA: Issues affecting Source Selection in DA Source Overlap In most cases you want to avoid calling overlapping sources …but in some cases you want to call overlapping sources E.g. to get as much information about a tuple as possible; to get the lowest priced tuple etc. Source latency You want to call sources that are likely to respond fast Source quality You want to call sources that have high quality data Domain independent: E.g. High density (fewer null values) Domain specific E.g. sources having lower cost books Source “consistency”? Exports data that is error free Learning Source Statistics: Learning Source Statistics Coverage, overlap, latency, density and quality statistics about sources are not likely to be exported by sources! Need to learn them Most of the statistics are source and query specific Coverage and Overlap of a source may depend on the query Latency may depend on the query Density may depend on the query Statistics can be learned in a qualitative or quantitative way LCW vs. coverage/overlap statistics Feasible access patterns vs. binding pattern specific latency statistics Quantitative is more general and amenable to learning Too costly to learn statistics w.r.t. each specific query Challenge: Find right type of query classes with respect to which statistics are learned Query class definition may depend on the type of statistics Since sources, user population and network are all changing, statistics need to be maintained (through incremental changes)Source Limitations: Source Limitations Sources are not really fully-relational databases Legacy systems Limited access patters (Can’s ask a white-pages source for the list of all numbers) Limited local processing power Typically only selections (on certain attributes) are supported Access limitations modeled in terms of allowed (“feasible”) binding patterns with which the source can be accessed E.g. S(X,Y,Z) with feasible patterns f,f,b or b,b,f Access Restrictions & Recursive Reformulations: Access Restrictions & Recursive Reformulations Create Source S1 as select * from Cites given paper1 Create Source S2 as select paper from ASU-Papers Create Source S3 as select paper from AwardPapers given paper Query: select * from AwardPapers S1bf(p1,p2) :- cites(p1,p2) S2(p) :- Asp(p) S3b(p) :- Awp(p) Q(p) :- Awp(p) Awp(p) :- S3b(p) Asp(p) :- S2(p) Cites(p1,p2) :- S1bf(p) Dom(p) :- S2(p) Dom(p) :- Dom(p1), S1(p1,p) [Kwok&Weld, 96; Duschka &Levy, 97]Learning & Using Source Statistics: Learning & Using Source Statistics Managing Source Overlap: Managing Source Overlap Often, sources on the Internet have overlapping contents The overlap is not centrally managed (unlike DDBMS—data replication etc.) Reasoning about overlap is important for plan optimality We cannot possibly call all potentially relevant sources! Qns: How do we characterize, get and exploit source overlap? Qualitative approaches (LCW statements) Quantitative approaches (Coverage/Overlap statistics)Local Completeness Information: Local Completeness Information If sources are incomplete, we need to look at each one of them. Often, sources are locally complete. Movie(title, director, year) complete for years after 1960, or for American directors. Question: given a set of local completeness statements, is a query Q’ a complete answer to Q? True source contents Advertised description Guarantees (LCW; Inter-source comparisons) Problems: 1. Sources may not be interested in giving these! Need to learn hard to learn! 2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is uselessQuantitative ways of modeling inter-source overlap: Quantitative ways of modeling inter-source overlap Coverage & Overlap statistics [Koller et. al., 97] S1 has 80% of the movies made after 1960; while S2 has 60% of the movies S1 has 98% of the movies stored in S2 Computing cardinalities of unions given intersections Who gives these statistics? -Third party -Probing BibFinder Case Study: BibFinder Case Study See the bibfinder slidesCase Study: BibFinder: Case Study: BibFinder BibFinder: A popular CS bibliographic mediator Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect, Network Bibliography, CSB, CiteSeer More than 58000 real user queries collected Mediated schema relation in BibFinder: paper(title, author, conference/journal, year) Primary key: title+author+year Focus on Selection queries Q(title, author, year) :- paper(title, author, conference/journal, year), conference=SIGMOD Selecting top-K sources for a given query: Selecting top-K sources for a given query Given a query Q, and sources S1….Sn, we need the coverage and overlap statistics of sources Si w.r.t. Q P(S|Q) is the coverage (Probability that a random tuple belonging to Q is exported by source S) P({S1..Sj}|Q) is the overlap between S1..Sj w.r.t. query Q (Probability that a random tuple belonging to Q is exported by all the sources S1..Sj). If we have the coverage and overlap statistics, then it is possible to pick the top-K sources that will give maximal number of tuples for Q. Computing Effective Coverage provided by a set of sources: Computing Effective Coverage provided by a set of sources Suppose we are calling 3 sources S1, S2, S3 to answer a query Q. The effective coverage we get is P(S1US2US3|Q). In order to compute this union, we need the intersection (overlap) statistics (in addition to the coverage statistics) Given the above, we can pick the optimal 3-sources for answering Q by considering all 3-sized subsets of source set S1….Sn, and picking the set with highest coverageSelecting top-K sources: the greedy way: Selecting top-K sources: the greedy way Selecting optimal K sources is hard in general. One way to reduce cost is to select sources greedily, one after other. For example, to select 3 sources, we select first source Si as the source with highest P(Si|Q) value. To pick the jth source, we will compute the residual coverage of each of the remaining sources, given the 1,2…j-1 sources we have already picked The residual coverage computation requires overlap statistics). For example picking a third source in the context of sources S1 and S2 will require us to calculate:Response time can depend on the query type: Response time can depend on the query type Range queries on year Effect of binding author field --Response times can also depend on the time of the day, and the day of the week.Multi-objective Query optimization: Multi-objective Query optimization Need to optimize queries jointly for both high coverage and low response time Staged optimization won’t quite work. An idea: Make the source selection be dependent on both (residual)coverage and response timeResults on BibFinder: Results on BibFinderChallenges in gathering overlap statistics: Challenges in gathering overlap statistics Sources are incomplete and partially overlapping Calling every possible source is inefficient and impolite Need coverage and overlap statistics to figure out what sources are most relevant for every possible query! We introduce a frequency-based approach for mining these statisticsSlide40: Outline Motivation BibFinder/StatMiner Architecture StatMiner Approach Automatically learning AV Hierarchies Discovering frequent query classes Learning coverage and overlap Statistics Using Coverage and Overlap Statistics StatMiner evaluation with BibFinder Related Work ConclusionMotivation: Motivation We introduce StatMiner A threshold based hierarchical mining approach Store statistics w.r.t. query classes Keep more accurate statistics for more frequently asked queries Handling the efficiency and accuracy tradeoffs by adjusting the thresholdsSlide42: BibFinder/StatMiner Slide43: Query List Each query q corresponds to a Vector of coverage/overlap Statistics. If there are 3 sources S1, s2, s3, we have: [P(S1|q),P(S2|q), P(S3|q) P(S1&S2|q), P(S2&S3|q) P(S1&S3|q) P(S1&S2&S3|q) ] A sparse vector with exponential dimensions By keeping thresholds on min overlap, we can avoid remembering small values The larger the thresholds, the sparser the vectorsIssues in Storing & Using Statistics: Issues in Storing & Using Statistics Storing statistics for each query is disadvantageous Too many queries Stored statistics can only be useful if the same query comes up again Idea1: Focus on only frequently asked queries Idea 2: Store statistics w.r.t. query classes Generate query classes by clustering.. When a new query comes, we can map it to some existing query classes But Clustering directly on queries won’t work Because we won’t know how to map a new query into existing query classes Idea: First do “subspace” clustering—cluster attribute values A query class is then defined as a cross product of attribute value clustersSlide45: AV Hierarchies and Query ClassesSlide46: StatMiner A query is a vector of overlap statisticsSlide47: Learned Conference HierarchySlide48: Using Coverage and Overlap Statistics to Rank SourcesSlide49: Outline Motivation BibFinder/StatMiner Architecture StatMiner Approach Automatically learning AV Hierarchies Discovering frequent query classes Learning coverage and overlap Statistics Using Coverage and Overlap Statistics StatMiner evaluation with BibFinder Related Work ConclusionBibFinder/StatMiner Evaluation: BibFinder/StatMiner Evaluation Experimental setup with BibFinder: Mediator relation: Paper(title,author,conference/journal,year) 25000 real user queries are used. Among them 4500 queries are randomly chosen as test queries. AV Hierarchies for all of the four attributes are learned automatically. 8000 distinct values in author, 1200 frequent asked keywords itemsets in title, 600 distinct values in conference/journal, and 95 distinct values in year.Slide51: Learned Conference HierarchySpace Consumption for Different minfreq and minoverlap: Space Consumption for Different minfreq and minoverlap We use a threshold on the support of a class, called minfreq, to identify frequent classes We use a minimum support threshold minoverlap to prune overlap statistics for uncorrelated source sets. As we increase any of the these two thresholds, the memory consumption drops, especially in the beginning.Accuracy of the Learned Statistics: Accuracy of the Learned Statistics Absolute Error No dramatic increases Keeping very detailed overlap statistics would not necessarily increase the accuracy while requiring much more space. For example: minfreq=0.13 and minoverlap=0.1 versus minfreq=0.33 and minoverlap=0Plan Precision: Plan Precision Here we observe the average precision of the top-2 source plans The plans using our learned statistics have high precision compared to random select, and it decreases very slowly as we change the minfreq and minoverlap threshold.Slide55: Plan Precision on Controlled Sources We observer the plan precision of top-5 source plans (totally 25 simulated sources). Using greedy select do produce better plans. See Section 3.8 and Section 3.9 for detailed informationNumber of Distinct Results: Number of Distinct Results Here we observe the average number of distinct results of top-2 source plans. Our methods gets on average 50 distinct answers, while random search gets only about 30 answers.Applications: Applications Path Selection in Bioinformatics [LNRV03] More and More Bioinformatics sources available on Internet Thousands of paths existing for answering users’ queries Path Coverage and Overlap Statistics are needed Text Database Selection in Information Retrieval StatMiner can provide a better way of learning and storing representatives of the databases Main Ideas Maintain a query list and discover frequent asked keyword-sets Learn keyword-set hierarchy based on the statistics distance Learn and store coverage (document frequency) for frequent asked keyword-set classes. A new query will be mapped to a set of close classes and use their statistics to estimate statistics for the query. Advantages Multiple-word-term & ScalabilityLatency statistics(Or what good is coverage without good response time?): Latency statistics (Or what good is coverage without good response time?) Sources vary significantly in terms of their response times The response time depends both on the source itself, as well as the query that is asked of it Specifically, what fields are bound in the selection query can make a difference ..So, learn statistics w.r.t. binding patterns Query Binding Patterns: Query Binding Patterns A binding pattern refers to which arguments of a relational query are “bound” Given a relation S(X,Y,Z) A query S(“Rao”, Y, “Tom”) has binding pattern bfb A query S(X,Y, “TOM”) has binding pattern ffb Binding patterns can be generalized to take “types” of bindings E.g. S(X,Y,1) may be ffn (n being numeric binding) and S(X,Y, “TOM”) may be ffs (s being string binding) Sources tend to have different latencies based on the binding pattern In extreme cases, certain binding patterns may have infinite latency (i.e., you are not allowed to ask that query) Called “infeasible” binding patterns(Digression): (Digression) LCWs are the “qualitative” versions of quantitative coverage/overlap statistics Feasible binding patterns are “qualitative” versions of quantitative latency statisticsBinding-specific latency stats are more effective: Binding-specific latency stats are more effective Combining coverage and response time: Combining coverage and response time Qn: How do we define an optimal plan in the context of both coverage/overlap and response time requirements? An instance of “multi-objective” optimization General solution involves presenting a set of “pareto-optimal” solutions to the user and let her decide Pareto-optimal set is a set of solutions where no solution is dominated by another one in all optimization dimensions (i.e., both better coverage and lower response time) Another idea is to combine both objectives into a single weighted objective Combining Coverage & Repsonse Time:Two Models: Combining Coverage & Repsonse Time: Two ModelsIt is possible to optimize for first tuples: It is possible to optimize for first tuplesDifferent “kinds” of plans: Different “kinds” of plans