logging in or signing up pubsubfinal Jolene Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 53 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 15, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... By: zippo (47 month(s) ago) very good!! i am a student ,and very interested in the special topic. will you please send this document to my e-mail address :haiininwan@sina.com.cn thank you very much!! Saving..... Post Reply Close Saving..... Edit Comment Close By: zippo (47 month(s) ago) very good!! i am a student ,and very interested in the special topic what it will you please send this document to my e-mail address :haiininwan@sina.com.cn thank you very much!! Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Continuous Queries and Publish/Subscribe: Continuous Queries and Publish/Subscribe Aaron Hirshfield Salinee (NOI) Jencharat Ened Ketri Esther Ryvkina Miriam SpeertPresentation Outline: Presentation Outline 1:40 - 1:50 Introduction to Pub/Sub 1:50 - 2:25 EXPRESSION/EVALUATE 2:25 - 3:10 TAPESTRY 3:10 - 3:20 Break 3:20 - 4:00 Efficient Filtering of XML 4:00 - 4:30 DiscussionReferences: References Aguilera, M., Strom, R., Sturman, D., Astley, M., and Chandra, T. “Matching Events in a Content-based Subscription System”. 18th ACM Symposium on Principles of Distributed Computing, 1999: 53-61 Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J. “Filtering Algorithms and Implementation for Very Fast Publish/Subscribe System”. ACM SIGMOD 2001: 115-126. Yalamanchi, A., Srinivasan, J., Gawlick, D. “Managing Expressions as Data in Relational Database Systems”. Proceedings of the 2003 CIDR Conference.References (Continued): References (Continued) Douglas B. Terry, David Goldberg, David Nichols and Brian M. Oki. Continuous Queries over Append-Only Databases. SIGMOD Conference 1992: 321-330. What is Publish/Subscribe?: What is Publish/Subscribe? Connecting information providers to information consumers Delivering published events to subscribers who expressed interests in such events. Group-Based: Group-Based Publishers categorize events into fixed groups such as subjects, channels, or topics. Subscribers choose certain a group(s) and then obtain every event in the specified group(s).Content-Based : Content-Based No pre-defined groups - subscribers can filter criteria along multiple dimensions. No administrative overhead for publishers to maintain groups. Issues: Issues Large number of subscribers Large number/high rates of events Minimize delay Simple and expressive subscription Changing subscriptionHow does it relate to us?: How does it relate to us? Subset of stream database with only select queries Events in form of streams Subscribers’ interests in form of continuous queries or possibly as another streamManaging Expressions as Data in Relational Database Systems: Managing Expressions as Data in Relational Database Systems Aravind Yalamanchi Jagannathan Srinivasan Dieter GawlickFirst, a note on efficiency…: First, a note on efficiency… Save: time space computation computation resourcesMotivation: Motivation Sophisticated Publish/Subscribe systems do not currently exist Authors propose approach to Pub/Sub that allows greater expressiveness Idea:: Idea: I want a list of movies meeting the following criteria: G Rating Cartoon Shorter than 3 hours Could execute a query, or…Different Approach: Different Approach … we could store the query (as data)! Arrival of new data search DB for queries that are satisfied Notify users who issued matching queriesDefinition: Expressions: Definition: Expressions Boolean conditions characterizing user’s interests Syntax-equivalent to SQL-WHERE clause Stored in a column of database table Treated as dataMore on Expressions: More on Expressions can reference variables and functions Example: Tivo/Digital Guide ratingReview(movie) = 4 and Day = 6 and Length(movie) < 200; finds movies that received 4 stars, less than 200 minutes, and are being shown on day 6 (Saturday)Expression Data Type: Expression Data Type Conditional expression stored as text Associated with Expression Set (set of expressions in db) Metadata (variables and functions common to set)Expression Metadata: Expression Metadata Expressions share a common list of variables Metadata made up of these variables, their data types and functions Expressions can use this Metadata in their predicates Expression Metadata Cont’d: Expression Metadata Cont’d Metadata determines evaluation context of expressions in a column Insertion or update to expression: expression checked against metadata to ensure its validity Expression Metadata: Expression Metadata Variable names and their data type List of user-defined Function Definition: EVALUATE: Definition: EVALUATE An operation comparable to (eval) in LISP/Scheme Equivalent SQL Query metadata in FROM clause expression in WHERE clause SELECT 1 FROM (SELECT :Genre as Genre, :Rating as Rating FROM <any_table_with_one_row>) WHERE Genre = ‘Comedy’ and Rating = ‘PG’EVALUATE Operator: EVALUATE Operator Evaluates an expression into 0 or 1. Takes 2 arguments text representation of expression Data Item (String or AnyData type) SELECT * FROM consumer WHERE EVALUATE( consumer.interest, ‘Genre => “COMEDY”, Rating => “PG”’ ) = 1 SELECT * FROM consumer WHERE EVALUATE( consumer.interest, AnyData.convertObject( Movie(‘Comedy’,’PG’))= 1 Use of EVALUTE in SQL: Use of EVALUTE in SQL Use of aggregates to allow complex expressions. using GROUP BY or ORDER BY on the customer’s other attributes SELECT * FROM consumer WHERE EVALUATE(consumer.interest, <movie details> ) = 1 Order BY consumer.ageUse of EVALUTE (continued): Use of EVALUTE (continued) Can be combined with other predicates to allow “Mutual Filtering” (Publishers can filter data, not only Subscribers) notifying only customers with age >= 17 about the R-rated movies Use of EVALUTE (continued): Use of EVALUTE (continued) Perform JOINS on multiple relations order the movies in the newMovie table by the number of subscribers interested in the type of the movie (using count(*) in SQL) SELECT DISTINCT (newMovie.id), count(*) as demand FROM consumer, newMovie WHERE EVALUATE(consumer.interest, <movie details from newMovie table> ) = 1 GROUP BY newMove.id ORDER BY demand DESCUse of EVALUTE (continued): Use of EVALUTE (continued) Maintain complex (many-to-many) relationship between multiple tables. joining customer and movie tables to create a listing of all the actual movies from movie table that each customer is interested in. Expression Indexing: Expression Indexing As in previously seen systems, Oracle developers exploit commonalities Large set of expressions many will tend to share common parts of their predicates Index defined on group of expressions in column to process more “efficiently” with EVALUATEExpression Indexing Cont’d: Expression Indexing Cont’d Exploitation of commonalities leads to more “efficient” processing of expressions Processing costs shared across multiple predicates Example: Given 2 predicates: one wants movie rated PG and one wants movie rated R If one predicate is satisfied, the other is notExploitation of Commonalities: Exploitation of Commonalities Logical grouping of predicates by their LHS ( length(movie) < 2; group on length(movie)) LHS OP RHS Example: want movies shorter than 3 hours and movies shorter than 2 hours, movies shorter than 2 hours satisfy both predicatesPredicate Table: Predicate Table Predicates are grouped by LHS Contains 1 row for each expression Index Processing: Index Processing Data’s attribute compared with the RHS and OP columns Example: SELECT exp_id from predicate_table WHERE G2_OP = ‘=‘ and G2_RHS = :rhs_val OR G2_OP = ‘>’ and G2_RHS < :rhs_val OR … Index on Predicate Table: Index on Predicate Table We can have an index on the Predicate Table. Processing: 1. Indexed predicate group 2. Stored predicate group 3. Sparse predicate group Some expressions are filtered out after each step - Less to check with slower predicate group.Predicate Handling-Comparisons: Predicate Handling-ComparisonsPerformance Characterization: Performance Characterization Oracle says: preliminary experiments promising results for the EFI Scheme Performed best when “fine-tuned” for given expression set-- Tunable Characteristics of an index: list of common predicates list of common ops for these preds # of indexed predicatesFuture Work: Future Work Native support needed for: Expression data type EVALUATE operatorWhy Does It Matter?(What to take away): Why Does It Matter? (What to take away) Expression/EVALUATE allows us to list all interested subscribers for a data by just one query. Makes it easier to scale for large number of subscribers. Using Relational Database Management system for Pub/Sub.Discussion Sneak Preview: Discussion Sneak Preview What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit? Continuous Queries over Append-Only Databases: Continuous Queries over Append-Only Databases Using the Tapestry SystemTalk Topics: Talk Topics Continuous Queries TQL / Tapestry Query Transformations Results ConclusionsContinuous Queries: Continuous Queries What does ‘continuous’ mean? (Continuous Semantics) The results of a continuous query is the set of data that would be returned if the query were executed at every instant in time. Why an ‘append-only’ database? Users are being presented with answers, and we don’t want a data item to be an answer at one point, but not an answer later in time.Periodic Query Example: Periodic Query Example Show all e-mail messages that have received a reply.Problems with Periodic Queries: Problems with Periodic Queries Non-deterministic results If two people make this query, will they obtain the same results? Duplicates can be returned Every time the query is run, the same results might be returned. Inefficiency of the system The query is run over all data, not new data, resulting in slow processing time do to un-necessary work being done by the system.The Tapestry System: The Tapestry System A system that keeps a database of all mail and news messages received. Designed to filter mail and newsgroup messages. Allows users to query over this database. Since Tapestry doesn't use triggers (time is used), it can be implemented in any commercial database that supports SQL applications.TQL: TQL TQL – Tapestry Query Language Works much like SQL, with the addition of a timestamp. Every query uses DISTINCT. No aggregates at this time. Does not support outer joins. TQL monotone incremental SQLTQL Advantages: TQL Advantages Allows ad hoc queries. Can be used in a conventional relational database. Time-oriented queries are supported. Flexible scheduling of query running order.Continuous Query Example: Continuous Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. Method: Subtract old answers from new answers and return this new set of data. This is done because some past answers might be included in the ‘new’ data.Incremental Query Example: Incremental Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. (Same query as before) Method: Run the query on the new data that has arrived, not the entire database or by comparing to past answers.Solutions: Solutions Incremental Queries can eliminate duplicates. Timestamp Query Transformations To run continuous queries and solve the problem of non-determinism, TQL was developed.Query Transformation: Query Transformation Q(t) QM(t) QI (t, τ) QM(t) – monotone query QI (t, τ) – incremental query Monotone query: QM(t1) QM(t2) when t1 < t2 Minimum bounding monotone query Incremental query: QM(t)-QM(τ) QI(t, τ) QM(t) Should return enough, but not too muchMonotone queries: Monotone queries QM(t) = Us<=t Q(s) Monotone query result = = regular query + continuous semantics Not all queries are monotone, but for an append-only database many queries are Monotone queries: Monotone queries Monotone (for append-only) Boolean predicates over column values of a table. Queries involving joins. Possibly non-monotone Functions that read current time. Sub-queries prefaced by ‘NOT EXISTS.’Queries with time: Queries with time SELECT * FROM tbl WHERE tbl.field op GetDate() tbl.field < GetDate() -- monotone (<=) “Select all messages that have been replied to” tbl.field > GetDate() -- non-monotone (>=) “Select all messages that haven’t expired” =, <> -- likely to be non-monotoneQueries with NOT EXIST: Queries with NOT EXIST SELECT * FROM msgs m WHERE NOT EXISTS ( SELECT * FROM msgs m1 WHERE m1.inReplyTo = m.msgId) Same as with time “Select all messages that haven’t been replied to.” Modification of this query that makes more sense in continuous semantics: “Select all messages that haven’t been replied to for 2 weeks.” Query Transformation: Query Transformation Q(t) QM(t) QI (t, τ) Standard form of a query: SELECT … FROM table1, table2, … WHERE (E1 AND E2 AND … Ek) OR ( … ) OR … OR ( … ) If Q is in the form (P ۷ R), then QM is in form (PM ۷ RM) => Can consider each of the ‘AND’ expressions separately.QQM Rewrite Rules:handling time: QQM Rewrite Rules: handling time Assume (for now) no NOT EXISTS For each table mentioned in a query, add a clause: table.ts < t Terms that don’t contain time don’t affect monotonicity. Group all terms with time and simplify. Authorized User: should add example?? (extra slide)QQM Rewrite Rules:handling time: QQM Rewrite Rules: handling time C1 < t AND … AND Cn < t AND t < D1 AND… AND t < Dm AND P MAX (C1,…, Cn) < t AND t < MIN (D1,…, Dm ) AND P C < t AND t < D AND P C < D AND C < t AND PQQM Rewrite Rules:handling NOT EXISTS: QQM Rewrite Rules: handling NOT EXISTS … NOT EXISTS (SELECT * FROM m1, … mk WHERE R(t)) Not always possible to convert to a monotone query without changing the meaning If R(t) not monotone -- usually impossible “Select messages that have no reply” --> “Select messages that are at least N weeks old and have no reply.”QM QI Rewrite Rules: QM QI Rewrite Rules QI (t, τ) = QM(t) AND (NOT QM(τ)) -- simple but inefficient. Just check the records that arrived in (t, τ) interval – seems reasonable SELECT * FROM m WHERE P can rewrite as SELECT * FROM m WHERE P AND τ m.ts < tIn Practice: In Practice Experiments to evaluate performance were conducted. Goal: to prove effectiveness of implementation of continuous queries as incremental queries Description and results -- to followExperiments: Experiments Database 380,000 news articles (messages) 1 table -- “msgs” with schema: {msgId, from, subject, date, newsgroup, inReplyTo, ts} Idea compare times to execute a non-incremental query, vs. incremental version of the same query for different values of τ τ: 100%, 10%, 1% Time measured average of 10-20 executions Authorized User: for time -- talk about warm-up run, not including I/Os, since data in buffer cache, impossibility to flash cacheExperiments: Queries: Experiments: Queries Q1: “Select messages from the comp.databases newsgroup” (simple selection predicate) Q2: “Select messages from all of the comp.sys newsgroups” (selection predicate with LIKE) Q3: “Select messages that have a reply sent to comp.databases” (simple join) Q4: “Select messages that are 4 weeks old and have not been replied to” (selection with NOT EXIST) Q5: “Select first messages in conversation chains greater than 2 in length” (more sophisticated join)Results: Results Expected: QI (100%) comparable to Q QI (1%) QI (10%) QI (100%) In practice:Experiments: Results: Experiments: Results QI (100%) comparable to Q Holds for for Q2, Q3, Q4, Q5 For Q1: incremental uses index on ts, while regular uses index on newsgroup QI (1%) QI (10%) QI (100%) Holds for Q1, Q2, Q4 Q3, Q5: joins. Limitation of Sybase -- can’t utilize multiple indices for joins. Experiments: More Results: Experiments: More Results Sybase can’t deal efficiently with ORed attributes => uses index on newsgroup, not on ts => no improvement with smaller τ. Revised queries Q3 and Q5 perform much better. General technique for rewriting queries with joins – breaking 1 query into a few that are optimized easily. Q3 – original version SELECT m.msgId FROM msgs m, msg m1 WHERE m1.inReplyTo = m.msgId AND m1.newsgroup = “comp.databases” Q3 – incremental version SELECT m.msgId FROM msgs m, msgs m1 WHERE m1.inReplyTo = m.msgId AND m1.newsgroup = “comp.databese” AND (m.ts > OR m1.ts > ) For future discussion: For future discussion Optimization doesn’t happen automatically! Is it a serious limitation? Can you think of any ways to improve this aspect of the system?Other Experiments: Other Experiments Cost of incremental queries as a function of τ Q1: 0% to 100% Cost proportional to τ Cost of incremental queries as a function of database size Q3 (revised) Execution cost doesn’t grow with DB size (stays basically constant).Other Experiments: Other Experiments Previous experiments Simple index built on values of single fields. New experiments Composite index made up of the timestamp value appended to the field value: Optimizer doesn’t have to choose between using index on ts and the newsgroup. Better results for Q1 and Q3 (Q3 was significantly better).Conclusion: Conclusion Continuous Queries TQL / Tapestry Monotone / Incremental QueriesFuture Work: Future Work Make a system that allows for updates and deletions (use the techniques of this paper in a historical database). Supporting aggregate queries. Example – Show all e-mails that have received at least 5 reply’s.Discussion: Discussion Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ?Discussion: Discussion Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use? Next…….: Next……. After the break, we have the paper “Efficient Filtering of XML Documents for Selective Dissemination of Information” by Ened. Following this paper, we conclude with a discussion about all the papers presented today. Efficient Filtering of XML Documents for SDI: Efficient Filtering of XML Documents for SDI Mehmet Altmel, Michael J. Franklin Presentation by Ened KetriRelation to the previous papers: Relation to the previous papersOverview: Overview Introduction, streams connection Alternatives, Motivation Execution model XML use Xpath Xfilter Enhancements Empirical DataReferences: References XML http://www.w3.org/TR/REC-xml XPath http://www.w3.org/TR/path XML Query Model http://www.w3.org/TR/query-datamodel/ SDI : Selective Dissemination of Information: SDI : Selective Dissemination of Information Process of filtering and delivering documents based on user interests Data Collection (input streams) Filtering (predicates) Profiling (continuous queries) Delivery (output streams)Alternatives: Alternatives IR Matching Boolean Similarity Problems Domain Precision/Efficiency ScaleMotivation: Motivation Big improvements in bandwidth Overwhelming Information New tools available (XML related) Millions of users Huge volumes of data (Web)Example Applications: Example Applications Tickers (stocks, sports, weather) Traffic information, GPS Electronic newspapers Entertainment Delivery Streams and SDI: Streams and SDI SDBMS model Queries stored (profiles) Data comes in (XML documents from web) Predicates (filtering) Previously seen relevant systems PSoup (Query indexing) NiagaraCQ (XML oriented)Execution Model: Execution ModelNotes on architecture: Notes on architecture Profile creation Users System Huge set of profiles (millions) Efficient/Quick search capability XML documents as DataXML: XML Why Popularity (interoperability/compatibility) Implementation (well-formedness) Standard Hierarchical structure Precision (matching) EfficiencyXML: XML DTD (Document Type Definition) Example <?xml version="1.0"?> <!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello, world!</greeting> </xml> Tagging Language Attributes/Elements XPathXPath: XPath XML Query Language W3C standard XML doc Tree of nodes Path expressions Matching Document applicationXPath: XPath Paths Absolute Relative Operators (/) (//) (*) ([..]) Example /movies/*//actors[age<30]/nameLooking back…: Looking back… NiagaraCQ IR based Document data (input) oriented DTD limited selectivity Triggers Scalability Matching Execution modelXFilter: XFilterXFilter: Issues: XFilter: Issues Problems Order checking Wildcards Filter evaluation (nested queries) Solution XPath FSM (Finite State Machine) Query IndexingFSM: FSM States Path nodes State Lists Candidate List, Wait ListFSM: FSMParsing : Parsing SAX Event Handlers Nested path expressionsEnhancements: Enhancements Improve Performance Scalability Solutions List Balancing PrefilteringList Balancing: List Balancing Why Poor selectivity from first elements Skewed Candidate Lists Solution Chose element to be placed in CL based on selectivity and list lengthPreFiltering: PreFiltering Why One level step analysis Unnecessary Work Solution Preliminary parsing of documents Only check queries whose elements appear in the documentWorkload parameters: Workload parametersVarying number of profiles: Varying number of profilesVarying Query/Document depth: Varying Query/Document depthVarying Wildcard Probability: Varying Wildcard ProbabilityConclusions: Conclusions Summary A Pub/Sub system, that basically filters XML documents based on user profiles Internet-scale SDI suitable for huge number of users Profiles are treated as standing queries Future work / Issues Partial results Handling joins/multiple streams List Balancing: List BalancingDiscussion Questions: Discussion Questions What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit?Discussion Questions: Discussion Questions Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ?Discussion Questions: Discussion Questions Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use?Discussion Questions: Discussion Questions Optimization doesn’t happen automatically (remember – queries with joins) Is it a serious limitation? Can you think of any ways to improve this aspect of the system?Discussion Questions: Discussion Questions What would be the optimal granularity in data being served? Is profile/CQ content updating an important feature and if so how could it be implemented? How would a structure/content description language based (XML) standard model of data sources affect Pub/Sub, CQ, SDBMS? You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
pubsubfinal Jolene Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 53 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 15, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... By: zippo (47 month(s) ago) very good!! i am a student ,and very interested in the special topic. will you please send this document to my e-mail address :haiininwan@sina.com.cn thank you very much!! Saving..... Post Reply Close Saving..... Edit Comment Close By: zippo (47 month(s) ago) very good!! i am a student ,and very interested in the special topic what it will you please send this document to my e-mail address :haiininwan@sina.com.cn thank you very much!! Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript Continuous Queries and Publish/Subscribe: Continuous Queries and Publish/Subscribe Aaron Hirshfield Salinee (NOI) Jencharat Ened Ketri Esther Ryvkina Miriam SpeertPresentation Outline: Presentation Outline 1:40 - 1:50 Introduction to Pub/Sub 1:50 - 2:25 EXPRESSION/EVALUATE 2:25 - 3:10 TAPESTRY 3:10 - 3:20 Break 3:20 - 4:00 Efficient Filtering of XML 4:00 - 4:30 DiscussionReferences: References Aguilera, M., Strom, R., Sturman, D., Astley, M., and Chandra, T. “Matching Events in a Content-based Subscription System”. 18th ACM Symposium on Principles of Distributed Computing, 1999: 53-61 Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J. “Filtering Algorithms and Implementation for Very Fast Publish/Subscribe System”. ACM SIGMOD 2001: 115-126. Yalamanchi, A., Srinivasan, J., Gawlick, D. “Managing Expressions as Data in Relational Database Systems”. Proceedings of the 2003 CIDR Conference.References (Continued): References (Continued) Douglas B. Terry, David Goldberg, David Nichols and Brian M. Oki. Continuous Queries over Append-Only Databases. SIGMOD Conference 1992: 321-330. What is Publish/Subscribe?: What is Publish/Subscribe? Connecting information providers to information consumers Delivering published events to subscribers who expressed interests in such events. Group-Based: Group-Based Publishers categorize events into fixed groups such as subjects, channels, or topics. Subscribers choose certain a group(s) and then obtain every event in the specified group(s).Content-Based : Content-Based No pre-defined groups - subscribers can filter criteria along multiple dimensions. No administrative overhead for publishers to maintain groups. Issues: Issues Large number of subscribers Large number/high rates of events Minimize delay Simple and expressive subscription Changing subscriptionHow does it relate to us?: How does it relate to us? Subset of stream database with only select queries Events in form of streams Subscribers’ interests in form of continuous queries or possibly as another streamManaging Expressions as Data in Relational Database Systems: Managing Expressions as Data in Relational Database Systems Aravind Yalamanchi Jagannathan Srinivasan Dieter GawlickFirst, a note on efficiency…: First, a note on efficiency… Save: time space computation computation resourcesMotivation: Motivation Sophisticated Publish/Subscribe systems do not currently exist Authors propose approach to Pub/Sub that allows greater expressiveness Idea:: Idea: I want a list of movies meeting the following criteria: G Rating Cartoon Shorter than 3 hours Could execute a query, or…Different Approach: Different Approach … we could store the query (as data)! Arrival of new data search DB for queries that are satisfied Notify users who issued matching queriesDefinition: Expressions: Definition: Expressions Boolean conditions characterizing user’s interests Syntax-equivalent to SQL-WHERE clause Stored in a column of database table Treated as dataMore on Expressions: More on Expressions can reference variables and functions Example: Tivo/Digital Guide ratingReview(movie) = 4 and Day = 6 and Length(movie) < 200; finds movies that received 4 stars, less than 200 minutes, and are being shown on day 6 (Saturday)Expression Data Type: Expression Data Type Conditional expression stored as text Associated with Expression Set (set of expressions in db) Metadata (variables and functions common to set)Expression Metadata: Expression Metadata Expressions share a common list of variables Metadata made up of these variables, their data types and functions Expressions can use this Metadata in their predicates Expression Metadata Cont’d: Expression Metadata Cont’d Metadata determines evaluation context of expressions in a column Insertion or update to expression: expression checked against metadata to ensure its validity Expression Metadata: Expression Metadata Variable names and their data type List of user-defined Function Definition: EVALUATE: Definition: EVALUATE An operation comparable to (eval) in LISP/Scheme Equivalent SQL Query metadata in FROM clause expression in WHERE clause SELECT 1 FROM (SELECT :Genre as Genre, :Rating as Rating FROM <any_table_with_one_row>) WHERE Genre = ‘Comedy’ and Rating = ‘PG’EVALUATE Operator: EVALUATE Operator Evaluates an expression into 0 or 1. Takes 2 arguments text representation of expression Data Item (String or AnyData type) SELECT * FROM consumer WHERE EVALUATE( consumer.interest, ‘Genre => “COMEDY”, Rating => “PG”’ ) = 1 SELECT * FROM consumer WHERE EVALUATE( consumer.interest, AnyData.convertObject( Movie(‘Comedy’,’PG’))= 1 Use of EVALUTE in SQL: Use of EVALUTE in SQL Use of aggregates to allow complex expressions. using GROUP BY or ORDER BY on the customer’s other attributes SELECT * FROM consumer WHERE EVALUATE(consumer.interest, <movie details> ) = 1 Order BY consumer.ageUse of EVALUTE (continued): Use of EVALUTE (continued) Can be combined with other predicates to allow “Mutual Filtering” (Publishers can filter data, not only Subscribers) notifying only customers with age >= 17 about the R-rated movies Use of EVALUTE (continued): Use of EVALUTE (continued) Perform JOINS on multiple relations order the movies in the newMovie table by the number of subscribers interested in the type of the movie (using count(*) in SQL) SELECT DISTINCT (newMovie.id), count(*) as demand FROM consumer, newMovie WHERE EVALUATE(consumer.interest, <movie details from newMovie table> ) = 1 GROUP BY newMove.id ORDER BY demand DESCUse of EVALUTE (continued): Use of EVALUTE (continued) Maintain complex (many-to-many) relationship between multiple tables. joining customer and movie tables to create a listing of all the actual movies from movie table that each customer is interested in. Expression Indexing: Expression Indexing As in previously seen systems, Oracle developers exploit commonalities Large set of expressions many will tend to share common parts of their predicates Index defined on group of expressions in column to process more “efficiently” with EVALUATEExpression Indexing Cont’d: Expression Indexing Cont’d Exploitation of commonalities leads to more “efficient” processing of expressions Processing costs shared across multiple predicates Example: Given 2 predicates: one wants movie rated PG and one wants movie rated R If one predicate is satisfied, the other is notExploitation of Commonalities: Exploitation of Commonalities Logical grouping of predicates by their LHS ( length(movie) < 2; group on length(movie)) LHS OP RHS Example: want movies shorter than 3 hours and movies shorter than 2 hours, movies shorter than 2 hours satisfy both predicatesPredicate Table: Predicate Table Predicates are grouped by LHS Contains 1 row for each expression Index Processing: Index Processing Data’s attribute compared with the RHS and OP columns Example: SELECT exp_id from predicate_table WHERE G2_OP = ‘=‘ and G2_RHS = :rhs_val OR G2_OP = ‘>’ and G2_RHS < :rhs_val OR … Index on Predicate Table: Index on Predicate Table We can have an index on the Predicate Table. Processing: 1. Indexed predicate group 2. Stored predicate group 3. Sparse predicate group Some expressions are filtered out after each step - Less to check with slower predicate group.Predicate Handling-Comparisons: Predicate Handling-ComparisonsPerformance Characterization: Performance Characterization Oracle says: preliminary experiments promising results for the EFI Scheme Performed best when “fine-tuned” for given expression set-- Tunable Characteristics of an index: list of common predicates list of common ops for these preds # of indexed predicatesFuture Work: Future Work Native support needed for: Expression data type EVALUATE operatorWhy Does It Matter?(What to take away): Why Does It Matter? (What to take away) Expression/EVALUATE allows us to list all interested subscribers for a data by just one query. Makes it easier to scale for large number of subscribers. Using Relational Database Management system for Pub/Sub.Discussion Sneak Preview: Discussion Sneak Preview What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit? Continuous Queries over Append-Only Databases: Continuous Queries over Append-Only Databases Using the Tapestry SystemTalk Topics: Talk Topics Continuous Queries TQL / Tapestry Query Transformations Results ConclusionsContinuous Queries: Continuous Queries What does ‘continuous’ mean? (Continuous Semantics) The results of a continuous query is the set of data that would be returned if the query were executed at every instant in time. Why an ‘append-only’ database? Users are being presented with answers, and we don’t want a data item to be an answer at one point, but not an answer later in time.Periodic Query Example: Periodic Query Example Show all e-mail messages that have received a reply.Problems with Periodic Queries: Problems with Periodic Queries Non-deterministic results If two people make this query, will they obtain the same results? Duplicates can be returned Every time the query is run, the same results might be returned. Inefficiency of the system The query is run over all data, not new data, resulting in slow processing time do to un-necessary work being done by the system.The Tapestry System: The Tapestry System A system that keeps a database of all mail and news messages received. Designed to filter mail and newsgroup messages. Allows users to query over this database. Since Tapestry doesn't use triggers (time is used), it can be implemented in any commercial database that supports SQL applications.TQL: TQL TQL – Tapestry Query Language Works much like SQL, with the addition of a timestamp. Every query uses DISTINCT. No aggregates at this time. Does not support outer joins. TQL monotone incremental SQLTQL Advantages: TQL Advantages Allows ad hoc queries. Can be used in a conventional relational database. Time-oriented queries are supported. Flexible scheduling of query running order.Continuous Query Example: Continuous Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. Method: Subtract old answers from new answers and return this new set of data. This is done because some past answers might be included in the ‘new’ data.Incremental Query Example: Incremental Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. (Same query as before) Method: Run the query on the new data that has arrived, not the entire database or by comparing to past answers.Solutions: Solutions Incremental Queries can eliminate duplicates. Timestamp Query Transformations To run continuous queries and solve the problem of non-determinism, TQL was developed.Query Transformation: Query Transformation Q(t) QM(t) QI (t, τ) QM(t) – monotone query QI (t, τ) – incremental query Monotone query: QM(t1) QM(t2) when t1 < t2 Minimum bounding monotone query Incremental query: QM(t)-QM(τ) QI(t, τ) QM(t) Should return enough, but not too muchMonotone queries: Monotone queries QM(t) = Us<=t Q(s) Monotone query result = = regular query + continuous semantics Not all queries are monotone, but for an append-only database many queries are Monotone queries: Monotone queries Monotone (for append-only) Boolean predicates over column values of a table. Queries involving joins. Possibly non-monotone Functions that read current time. Sub-queries prefaced by ‘NOT EXISTS.’Queries with time: Queries with time SELECT * FROM tbl WHERE tbl.field op GetDate() tbl.field < GetDate() -- monotone (<=) “Select all messages that have been replied to” tbl.field > GetDate() -- non-monotone (>=) “Select all messages that haven’t expired” =, <> -- likely to be non-monotoneQueries with NOT EXIST: Queries with NOT EXIST SELECT * FROM msgs m WHERE NOT EXISTS ( SELECT * FROM msgs m1 WHERE m1.inReplyTo = m.msgId) Same as with time “Select all messages that haven’t been replied to.” Modification of this query that makes more sense in continuous semantics: “Select all messages that haven’t been replied to for 2 weeks.” Query Transformation: Query Transformation Q(t) QM(t) QI (t, τ) Standard form of a query: SELECT … FROM table1, table2, … WHERE (E1 AND E2 AND … Ek) OR ( … ) OR … OR ( … ) If Q is in the form (P ۷ R), then QM is in form (PM ۷ RM) => Can consider each of the ‘AND’ expressions separately.QQM Rewrite Rules:handling time: QQM Rewrite Rules: handling time Assume (for now) no NOT EXISTS For each table mentioned in a query, add a clause: table.ts < t Terms that don’t contain time don’t affect monotonicity. Group all terms with time and simplify. Authorized User: should add example?? (extra slide)QQM Rewrite Rules:handling time: QQM Rewrite Rules: handling time C1 < t AND … AND Cn < t AND t < D1 AND… AND t < Dm AND P MAX (C1,…, Cn) < t AND t < MIN (D1,…, Dm ) AND P C < t AND t < D AND P C < D AND C < t AND PQQM Rewrite Rules:handling NOT EXISTS: QQM Rewrite Rules: handling NOT EXISTS … NOT EXISTS (SELECT * FROM m1, … mk WHERE R(t)) Not always possible to convert to a monotone query without changing the meaning If R(t) not monotone -- usually impossible “Select messages that have no reply” --> “Select messages that are at least N weeks old and have no reply.”QM QI Rewrite Rules: QM QI Rewrite Rules QI (t, τ) = QM(t) AND (NOT QM(τ)) -- simple but inefficient. Just check the records that arrived in (t, τ) interval – seems reasonable SELECT * FROM m WHERE P can rewrite as SELECT * FROM m WHERE P AND τ m.ts < tIn Practice: In Practice Experiments to evaluate performance were conducted. Goal: to prove effectiveness of implementation of continuous queries as incremental queries Description and results -- to followExperiments: Experiments Database 380,000 news articles (messages) 1 table -- “msgs” with schema: {msgId, from, subject, date, newsgroup, inReplyTo, ts} Idea compare times to execute a non-incremental query, vs. incremental version of the same query for different values of τ τ: 100%, 10%, 1% Time measured average of 10-20 executions Authorized User: for time -- talk about warm-up run, not including I/Os, since data in buffer cache, impossibility to flash cacheExperiments: Queries: Experiments: Queries Q1: “Select messages from the comp.databases newsgroup” (simple selection predicate) Q2: “Select messages from all of the comp.sys newsgroups” (selection predicate with LIKE) Q3: “Select messages that have a reply sent to comp.databases” (simple join) Q4: “Select messages that are 4 weeks old and have not been replied to” (selection with NOT EXIST) Q5: “Select first messages in conversation chains greater than 2 in length” (more sophisticated join)Results: Results Expected: QI (100%) comparable to Q QI (1%) QI (10%) QI (100%) In practice:Experiments: Results: Experiments: Results QI (100%) comparable to Q Holds for for Q2, Q3, Q4, Q5 For Q1: incremental uses index on ts, while regular uses index on newsgroup QI (1%) QI (10%) QI (100%) Holds for Q1, Q2, Q4 Q3, Q5: joins. Limitation of Sybase -- can’t utilize multiple indices for joins. Experiments: More Results: Experiments: More Results Sybase can’t deal efficiently with ORed attributes => uses index on newsgroup, not on ts => no improvement with smaller τ. Revised queries Q3 and Q5 perform much better. General technique for rewriting queries with joins – breaking 1 query into a few that are optimized easily. Q3 – original version SELECT m.msgId FROM msgs m, msg m1 WHERE m1.inReplyTo = m.msgId AND m1.newsgroup = “comp.databases” Q3 – incremental version SELECT m.msgId FROM msgs m, msgs m1 WHERE m1.inReplyTo = m.msgId AND m1.newsgroup = “comp.databese” AND (m.ts > OR m1.ts > ) For future discussion: For future discussion Optimization doesn’t happen automatically! Is it a serious limitation? Can you think of any ways to improve this aspect of the system?Other Experiments: Other Experiments Cost of incremental queries as a function of τ Q1: 0% to 100% Cost proportional to τ Cost of incremental queries as a function of database size Q3 (revised) Execution cost doesn’t grow with DB size (stays basically constant).Other Experiments: Other Experiments Previous experiments Simple index built on values of single fields. New experiments Composite index made up of the timestamp value appended to the field value: Optimizer doesn’t have to choose between using index on ts and the newsgroup. Better results for Q1 and Q3 (Q3 was significantly better).Conclusion: Conclusion Continuous Queries TQL / Tapestry Monotone / Incremental QueriesFuture Work: Future Work Make a system that allows for updates and deletions (use the techniques of this paper in a historical database). Supporting aggregate queries. Example – Show all e-mails that have received at least 5 reply’s.Discussion: Discussion Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ?Discussion: Discussion Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use? Next…….: Next……. After the break, we have the paper “Efficient Filtering of XML Documents for Selective Dissemination of Information” by Ened. Following this paper, we conclude with a discussion about all the papers presented today. Efficient Filtering of XML Documents for SDI: Efficient Filtering of XML Documents for SDI Mehmet Altmel, Michael J. Franklin Presentation by Ened KetriRelation to the previous papers: Relation to the previous papersOverview: Overview Introduction, streams connection Alternatives, Motivation Execution model XML use Xpath Xfilter Enhancements Empirical DataReferences: References XML http://www.w3.org/TR/REC-xml XPath http://www.w3.org/TR/path XML Query Model http://www.w3.org/TR/query-datamodel/ SDI : Selective Dissemination of Information: SDI : Selective Dissemination of Information Process of filtering and delivering documents based on user interests Data Collection (input streams) Filtering (predicates) Profiling (continuous queries) Delivery (output streams)Alternatives: Alternatives IR Matching Boolean Similarity Problems Domain Precision/Efficiency ScaleMotivation: Motivation Big improvements in bandwidth Overwhelming Information New tools available (XML related) Millions of users Huge volumes of data (Web)Example Applications: Example Applications Tickers (stocks, sports, weather) Traffic information, GPS Electronic newspapers Entertainment Delivery Streams and SDI: Streams and SDI SDBMS model Queries stored (profiles) Data comes in (XML documents from web) Predicates (filtering) Previously seen relevant systems PSoup (Query indexing) NiagaraCQ (XML oriented)Execution Model: Execution ModelNotes on architecture: Notes on architecture Profile creation Users System Huge set of profiles (millions) Efficient/Quick search capability XML documents as DataXML: XML Why Popularity (interoperability/compatibility) Implementation (well-formedness) Standard Hierarchical structure Precision (matching) EfficiencyXML: XML DTD (Document Type Definition) Example <?xml version="1.0"?> <!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello, world!</greeting> </xml> Tagging Language Attributes/Elements XPathXPath: XPath XML Query Language W3C standard XML doc Tree of nodes Path expressions Matching Document applicationXPath: XPath Paths Absolute Relative Operators (/) (//) (*) ([..]) Example /movies/*//actors[age<30]/nameLooking back…: Looking back… NiagaraCQ IR based Document data (input) oriented DTD limited selectivity Triggers Scalability Matching Execution modelXFilter: XFilterXFilter: Issues: XFilter: Issues Problems Order checking Wildcards Filter evaluation (nested queries) Solution XPath FSM (Finite State Machine) Query IndexingFSM: FSM States Path nodes State Lists Candidate List, Wait ListFSM: FSMParsing : Parsing SAX Event Handlers Nested path expressionsEnhancements: Enhancements Improve Performance Scalability Solutions List Balancing PrefilteringList Balancing: List Balancing Why Poor selectivity from first elements Skewed Candidate Lists Solution Chose element to be placed in CL based on selectivity and list lengthPreFiltering: PreFiltering Why One level step analysis Unnecessary Work Solution Preliminary parsing of documents Only check queries whose elements appear in the documentWorkload parameters: Workload parametersVarying number of profiles: Varying number of profilesVarying Query/Document depth: Varying Query/Document depthVarying Wildcard Probability: Varying Wildcard ProbabilityConclusions: Conclusions Summary A Pub/Sub system, that basically filters XML documents based on user profiles Internet-scale SDI suitable for huge number of users Profiles are treated as standing queries Future work / Issues Partial results Handling joins/multiple streams List Balancing: List BalancingDiscussion Questions: Discussion Questions What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit?Discussion Questions: Discussion Questions Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ?Discussion Questions: Discussion Questions Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use?Discussion Questions: Discussion Questions Optimization doesn’t happen automatically (remember – queries with joins) Is it a serious limitation? Can you think of any ways to improve this aspect of the system?Discussion Questions: Discussion Questions What would be the optimal granularity in data being served? Is profile/CQ content updating an important feature and if so how could it be implemented? How would a structure/content description language based (XML) standard model of data sources affect Pub/Sub, CQ, SDBMS?