pubsubfinal

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

By: zippo (47 month(s) ago)

very good!! i am a student ,and very interested in the special topic. will you please send this document to my e-mail address :haiininwan@sina.com.cn thank you very much!!

By: zippo (47 month(s) ago)

very good!! i am a student ,and very interested in the special topic what it will you please send this document to my e-mail address :haiininwan@sina.com.cn thank you very much!!

Presentation Transcript

Continuous Queries and Publish/Subscribe: 

Continuous Queries and Publish/Subscribe Aaron Hirshfield Salinee (NOI) Jencharat Ened Ketri Esther Ryvkina Miriam Speert

Presentation Outline: 

Presentation Outline 1:40 - 1:50 Introduction to Pub/Sub 1:50 - 2:25 EXPRESSION/EVALUATE 2:25 - 3:10 TAPESTRY 3:10 - 3:20 Break 3:20 - 4:00 Efficient Filtering of XML 4:00 - 4:30 Discussion

References: 

References Aguilera, M., Strom, R., Sturman, D., Astley, M., and Chandra, T. “Matching Events in a Content-based Subscription System”. 18th ACM Symposium on Principles of Distributed Computing, 1999: 53-61 Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J. “Filtering Algorithms and Implementation for Very Fast Publish/Subscribe System”. ACM SIGMOD 2001: 115-126. Yalamanchi, A., Srinivasan, J., Gawlick, D. “Managing Expressions as Data in Relational Database Systems”. Proceedings of the 2003 CIDR Conference.

References (Continued): 

References (Continued) Douglas B. Terry, David Goldberg, David Nichols and Brian M. Oki. Continuous Queries over Append-Only Databases. SIGMOD Conference 1992: 321-330.

What is Publish/Subscribe?: 

What is Publish/Subscribe? Connecting information providers to information consumers Delivering published events to subscribers who expressed interests in such events.

Group-Based: 

Group-Based Publishers categorize events into fixed groups such as subjects, channels, or topics. Subscribers choose certain a group(s) and then obtain every event in the specified group(s).

Content-Based : 

Content-Based No pre-defined groups - subscribers can filter criteria along multiple dimensions. No administrative overhead for publishers to maintain groups.

Issues: 

Issues Large number of subscribers Large number/high rates of events Minimize delay Simple and expressive subscription Changing subscription

How does it relate to us?: 

How does it relate to us? Subset of stream database with only select queries Events in form of streams Subscribers’ interests in form of continuous queries or possibly as another stream

Managing Expressions as Data in Relational Database Systems: 

Managing Expressions as Data in Relational Database Systems Aravind Yalamanchi Jagannathan Srinivasan Dieter Gawlick

First, a note on efficiency…: 

First, a note on efficiency… Save: time space computation computation resources

Motivation: 

Motivation Sophisticated Publish/Subscribe systems do not currently exist Authors propose approach to Pub/Sub that allows greater expressiveness

Idea:: 

Idea: I want a list of movies meeting the following criteria: G Rating Cartoon Shorter than 3 hours Could execute a query, or…

Different Approach: 

Different Approach … we could store the query (as data)! Arrival of new data  search DB for queries that are satisfied Notify users who issued matching queries

Definition: Expressions: 

Definition: Expressions Boolean conditions characterizing user’s interests Syntax-equivalent to SQL-WHERE clause Stored in a column of database table Treated as data

More on Expressions: 

More on Expressions can reference variables and functions Example: Tivo/Digital Guide ratingReview(movie) = 4 and Day = 6 and Length(movie) < 200; finds movies that received 4 stars, less than 200 minutes, and are being shown on day 6 (Saturday)

Expression Data Type: 

Expression Data Type Conditional expression stored as text Associated with Expression Set (set of expressions in db) Metadata (variables and functions common to set)

Expression Metadata: 

Expression Metadata Expressions share a common list of variables Metadata made up of these variables, their data types and functions Expressions can use this Metadata in their predicates

Expression Metadata Cont’d: 

Expression Metadata Cont’d Metadata determines evaluation context of expressions in a column Insertion or update to expression: expression checked against metadata to ensure its validity

Expression Metadata: 

Expression Metadata Variable names and their data type List of user-defined Function

Definition: EVALUATE: 

Definition: EVALUATE An operation comparable to (eval) in LISP/Scheme Equivalent SQL Query metadata in FROM clause expression in WHERE clause SELECT 1 FROM (SELECT :Genre as Genre, :Rating as Rating FROM <any_table_with_one_row>) WHERE Genre = ‘Comedy’ and Rating = ‘PG’

EVALUATE Operator: 

EVALUATE Operator Evaluates an expression into 0 or 1. Takes 2 arguments text representation of expression Data Item (String or AnyData type) SELECT * FROM consumer WHERE EVALUATE( consumer.interest, ‘Genre => “COMEDY”, Rating => “PG”’ ) = 1 SELECT * FROM consumer WHERE EVALUATE( consumer.interest, AnyData.convertObject( Movie(‘Comedy’,’PG’))= 1

Use of EVALUTE in SQL: 

Use of EVALUTE in SQL Use of aggregates to allow complex expressions. using GROUP BY or ORDER BY on the customer’s other attributes SELECT * FROM consumer WHERE EVALUATE(consumer.interest, <movie details> ) = 1 Order BY consumer.age

Use of EVALUTE (continued): 

Use of EVALUTE (continued) Can be combined with other predicates to allow “Mutual Filtering” (Publishers can filter data, not only Subscribers) notifying only customers with age >= 17 about the R-rated movies

Use of EVALUTE (continued): 

Use of EVALUTE (continued) Perform JOINS on multiple relations order the movies in the newMovie table by the number of subscribers interested in the type of the movie (using count(*) in SQL) SELECT DISTINCT (newMovie.id), count(*) as demand FROM consumer, newMovie WHERE EVALUATE(consumer.interest, <movie details from newMovie table> ) = 1 GROUP BY newMove.id ORDER BY demand DESC

Use of EVALUTE (continued): 

Use of EVALUTE (continued) Maintain complex (many-to-many) relationship between multiple tables. joining customer and movie tables to create a listing of all the actual movies from movie table that each customer is interested in.

Expression Indexing: 

Expression Indexing As in previously seen systems, Oracle developers exploit commonalities Large set of expressions many will tend to share common parts of their predicates Index defined on group of expressions in column to process more “efficiently” with EVALUATE

Expression Indexing Cont’d: 

Expression Indexing Cont’d Exploitation of commonalities leads to more “efficient” processing of expressions Processing costs shared across multiple predicates Example: Given 2 predicates: one wants movie rated PG and one wants movie rated R If one predicate is satisfied, the other is not

Exploitation of Commonalities: 

Exploitation of Commonalities Logical grouping of predicates by their LHS ( length(movie) < 2; group on length(movie)) LHS OP RHS Example: want movies shorter than 3 hours and movies shorter than 2 hours, movies shorter than 2 hours satisfy both predicates

Predicate Table: 

Predicate Table Predicates are grouped by LHS Contains 1 row for each expression

Index Processing: 

Index Processing Data’s attribute compared with the RHS and OP columns Example: SELECT exp_id from predicate_table WHERE G2_OP = ‘=‘ and G2_RHS = :rhs_val OR G2_OP = ‘>’ and G2_RHS < :rhs_val OR …

Index on Predicate Table: 

Index on Predicate Table We can have an index on the Predicate Table. Processing: 1. Indexed predicate group 2. Stored predicate group 3. Sparse predicate group Some expressions are filtered out after each step - Less to check with slower predicate group.

Predicate Handling-Comparisons: 

Predicate Handling-Comparisons

Performance Characterization: 

Performance Characterization Oracle says: preliminary experiments  promising results for the EFI Scheme Performed best when “fine-tuned” for given expression set-- Tunable Characteristics of an index: list of common predicates list of common ops for these preds # of indexed predicates

Future Work: 

Future Work Native support needed for: Expression data type EVALUATE operator

Why Does It Matter? (What to take away): 

Why Does It Matter? (What to take away) Expression/EVALUATE allows us to list all interested subscribers for a data by just one query. Makes it easier to scale for large number of subscribers. Using Relational Database Management system for Pub/Sub.

Discussion Sneak Preview: 

Discussion Sneak Preview What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit?

Continuous Queries over Append-Only Databases: 

Continuous Queries over Append-Only Databases Using the Tapestry System

Talk Topics: 

Talk Topics Continuous Queries TQL / Tapestry Query Transformations Results Conclusions

Continuous Queries: 

Continuous Queries What does ‘continuous’ mean? (Continuous Semantics) The results of a continuous query is the set of data that would be returned if the query were executed at every instant in time. Why an ‘append-only’ database? Users are being presented with answers, and we don’t want a data item to be an answer at one point, but not an answer later in time.

Periodic Query Example: 

Periodic Query Example Show all e-mail messages that have received a reply.

Problems with Periodic Queries: 

Problems with Periodic Queries Non-deterministic results If two people make this query, will they obtain the same results? Duplicates can be returned Every time the query is run, the same results might be returned. Inefficiency of the system The query is run over all data, not new data, resulting in slow processing time do to un-necessary work being done by the system.

The Tapestry System: 

The Tapestry System A system that keeps a database of all mail and news messages received. Designed to filter mail and newsgroup messages. Allows users to query over this database. Since Tapestry doesn't use triggers (time is used), it can be implemented in any commercial database that supports SQL applications.

TQL: 

TQL TQL – Tapestry Query Language Works much like SQL, with the addition of a timestamp. Every query uses DISTINCT. No aggregates at this time. Does not support outer joins. TQL  monotone  incremental  SQL

TQL Advantages: 

TQL Advantages Allows ad hoc queries. Can be used in a conventional relational database. Time-oriented queries are supported. Flexible scheduling of query running order.

Continuous Query Example: 

Continuous Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. Method: Subtract old answers from new answers and return this new set of data. This is done because some past answers might be included in the ‘new’ data.

Incremental Query Example: 

Incremental Query Example Show all e-mail messages that have received a reply, and that have not been returned by this query before. (Same query as before) Method: Run the query on the new data that has arrived, not the entire database or by comparing to past answers.

Solutions: 

Solutions Incremental Queries can eliminate duplicates. Timestamp Query Transformations To run continuous queries and solve the problem of non-determinism, TQL was developed.

Query Transformation: 

Query Transformation Q(t)  QM(t)  QI (t, τ) QM(t) – monotone query QI (t, τ) – incremental query Monotone query: QM(t1)  QM(t2) when t1 < t2 Minimum bounding monotone query Incremental query: QM(t)-QM(τ)  QI(t, τ)  QM(t) Should return enough, but not too much

Monotone queries: 

Monotone queries QM(t) = Us<=t Q(s) Monotone query result = = regular query + continuous semantics Not all queries are monotone, but for an append-only database many queries are

Monotone queries: 

Monotone queries Monotone (for append-only) Boolean predicates over column values of a table. Queries involving joins. Possibly non-monotone Functions that read current time. Sub-queries prefaced by ‘NOT EXISTS.’

Queries with time: 

Queries with time SELECT * FROM tbl WHERE tbl.field op GetDate() tbl.field < GetDate() -- monotone (<=) “Select all messages that have been replied to” tbl.field > GetDate() -- non-monotone (>=) “Select all messages that haven’t expired” =, <> -- likely to be non-monotone

Queries with NOT EXIST: 

Queries with NOT EXIST SELECT * FROM msgs m WHERE NOT EXISTS ( SELECT * FROM msgs m1 WHERE m1.inReplyTo = m.msgId) Same as with time “Select all messages that haven’t been replied to.” Modification of this query that makes more sense in continuous semantics: “Select all messages that haven’t been replied to for 2 weeks.”

Query Transformation: 

Query Transformation Q(t)  QM(t)  QI (t, τ) Standard form of a query: SELECT … FROM table1, table2, … WHERE (E1 AND E2 AND … Ek) OR ( … ) OR … OR ( … ) If Q is in the form (P ۷ R), then QM is in form (PM ۷ RM) => Can consider each of the ‘AND’ expressions separately.

QQM Rewrite Rules: handling time: 

QQM Rewrite Rules: handling time Assume (for now) no NOT EXISTS For each table mentioned in a query, add a clause: table.ts < t Terms that don’t contain time don’t affect monotonicity. Group all terms with time and simplify. Authorized User: should add example?? (extra slide)

QQM Rewrite Rules: handling time: 

QQM Rewrite Rules: handling time C1 < t AND … AND Cn < t AND t < D1 AND… AND t < Dm AND P  MAX (C1,…, Cn) < t AND t < MIN (D1,…, Dm ) AND P  C < t AND t < D AND P  C < D AND C < t AND P

QQM Rewrite Rules: handling NOT EXISTS: 

QQM Rewrite Rules: handling NOT EXISTS … NOT EXISTS (SELECT * FROM m1, … mk WHERE R(t)) Not always possible to convert to a monotone query without changing the meaning If R(t) not monotone -- usually impossible “Select messages that have no reply” --> “Select messages that are at least N weeks old and have no reply.”

QM QI Rewrite Rules: 

QM QI Rewrite Rules QI (t, τ) = QM(t) AND (NOT QM(τ)) -- simple but inefficient. Just check the records that arrived in (t, τ) interval – seems reasonable SELECT * FROM m WHERE P can rewrite as SELECT * FROM m WHERE P AND τ  m.ts < t

In Practice: 

In Practice Experiments to evaluate performance were conducted. Goal: to prove effectiveness of implementation of continuous queries as incremental queries Description and results -- to follow

Experiments: 

Experiments Database 380,000 news articles (messages) 1 table -- “msgs” with schema: {msgId, from, subject, date, newsgroup, inReplyTo, ts} Idea compare times to execute a non-incremental query, vs. incremental version of the same query for different values of τ τ: 100%, 10%, 1% Time measured average of 10-20 executions Authorized User: for time -- talk about warm-up run, not including I/Os, since data in buffer cache, impossibility to flash cache

Experiments: Queries: 

Experiments: Queries Q1: “Select messages from the comp.databases newsgroup” (simple selection predicate) Q2: “Select messages from all of the comp.sys newsgroups” (selection predicate with LIKE) Q3: “Select messages that have a reply sent to comp.databases” (simple join) Q4: “Select messages that are 4 weeks old and have not been replied to” (selection with NOT EXIST) Q5: “Select first messages in conversation chains greater than 2 in length” (more sophisticated join)

Results: 

Results Expected: QI (100%) comparable to Q QI (1%)  QI (10%)  QI (100%) In practice:

Experiments: Results: 

Experiments: Results QI (100%) comparable to Q Holds for for Q2, Q3, Q4, Q5 For Q1: incremental uses index on ts, while regular uses index on newsgroup QI (1%)  QI (10%)  QI (100%) Holds for Q1, Q2, Q4 Q3, Q5: joins. Limitation of Sybase -- can’t utilize multiple indices for joins.

Experiments: More Results: 

Experiments: More Results Sybase can’t deal efficiently with ORed attributes => uses index on newsgroup, not on ts => no improvement with smaller τ. Revised queries Q3 and Q5 perform much better. General technique for rewriting queries with joins – breaking 1 query into a few that are optimized easily. Q3 – original version SELECT m.msgId FROM msgs m, msg m1 WHERE m1.inReplyTo = m.msgId AND m1.newsgroup = “comp.databases”  Q3 – incremental version  SELECT m.msgId FROM msgs m, msgs m1 WHERE m1.inReplyTo = m.msgId AND m1.newsgroup = “comp.databese” AND (m.ts >  OR m1.ts > )

For future discussion: 

For future discussion Optimization doesn’t happen automatically! Is it a serious limitation? Can you think of any ways to improve this aspect of the system?

Other Experiments: 

Other Experiments Cost of incremental queries as a function of τ Q1: 0% to 100% Cost proportional to τ Cost of incremental queries as a function of database size Q3 (revised) Execution cost doesn’t grow with DB size (stays basically constant).

Other Experiments: 

Other Experiments Previous experiments Simple index built on values of single fields. New experiments Composite index made up of the timestamp value appended to the field value: Optimizer doesn’t have to choose between using index on ts and the newsgroup. Better results for Q1 and Q3 (Q3 was significantly better).

Conclusion: 

Conclusion Continuous Queries TQL / Tapestry Monotone / Incremental Queries

Future Work: 

Future Work Make a system that allows for updates and deletions (use the techniques of this paper in a historical database). Supporting aggregate queries. Example – Show all e-mails that have received at least 5 reply’s.

Discussion: 

Discussion Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ?

Discussion: 

Discussion Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use?

Next…….: 

Next……. After the break, we have the paper “Efficient Filtering of XML Documents for Selective Dissemination of Information” by Ened. Following this paper, we conclude with a discussion about all the papers presented today.

Efficient Filtering of XML Documents for SDI: 

Efficient Filtering of XML Documents for SDI Mehmet Altmel, Michael J. Franklin Presentation by Ened Ketri

Relation to the previous papers: 

Relation to the previous papers

Overview: 

Overview Introduction, streams connection Alternatives, Motivation Execution model XML use Xpath Xfilter Enhancements Empirical Data

References: 

References XML http://www.w3.org/TR/REC-xml XPath http://www.w3.org/TR/path XML Query Model http://www.w3.org/TR/query-datamodel/

SDI : Selective Dissemination of Information: 

SDI : Selective Dissemination of Information Process of filtering and delivering documents based on user interests Data Collection (input streams) Filtering (predicates) Profiling (continuous queries) Delivery (output streams)

Alternatives: 

Alternatives IR Matching Boolean Similarity Problems Domain Precision/Efficiency Scale

Motivation: 

Motivation Big improvements in bandwidth Overwhelming Information New tools available (XML related) Millions of users Huge volumes of data (Web)

Example Applications: 

Example Applications Tickers (stocks, sports, weather) Traffic information, GPS Electronic newspapers Entertainment Delivery

Streams and SDI: 

Streams and SDI SDBMS model Queries stored (profiles) Data comes in (XML documents from web) Predicates (filtering) Previously seen relevant systems PSoup (Query indexing) NiagaraCQ (XML oriented)

Execution Model: 

Execution Model

Notes on architecture: 

Notes on architecture Profile creation Users System Huge set of profiles (millions) Efficient/Quick search capability XML documents as Data

XML: 

XML Why Popularity (interoperability/compatibility) Implementation (well-formedness) Standard Hierarchical structure Precision (matching) Efficiency

XML: 

XML DTD (Document Type Definition) Example <?xml version="1.0"?> <!DOCTYPE greeting SYSTEM "hello.dtd"> <greeting>Hello, world!</greeting> </xml> Tagging Language Attributes/Elements XPath

XPath: 

XPath XML Query Language W3C standard XML doc  Tree of nodes Path expressions Matching Document application

XPath: 

XPath Paths Absolute Relative Operators (/) (//) (*) ([..]) Example /movies/*//actors[age<30]/name

Looking back…: 

Looking back… NiagaraCQ IR based Document data (input) oriented DTD limited selectivity Triggers Scalability Matching Execution model

XFilter: 

XFilter

XFilter: Issues: 

XFilter: Issues Problems Order checking Wildcards Filter evaluation (nested queries) Solution XPath  FSM (Finite State Machine) Query Indexing

FSM: 

FSM States Path nodes State Lists Candidate List, Wait List

FSM: 

FSM

Parsing : 

Parsing SAX Event Handlers Nested path expressions

Enhancements: 

Enhancements Improve Performance Scalability Solutions List Balancing Prefiltering

List Balancing: 

List Balancing Why Poor selectivity from first elements Skewed Candidate Lists Solution Chose element to be placed in CL based on selectivity and list length

PreFiltering: 

PreFiltering Why One level step analysis Unnecessary Work Solution Preliminary parsing of documents Only check queries whose elements appear in the document

Workload parameters: 

Workload parameters

Varying number of profiles: 

Varying number of profiles

Varying Query/Document depth: 

Varying Query/Document depth

Varying Wildcard Probability: 

Varying Wildcard Probability

Conclusions: 

Conclusions Summary A Pub/Sub system, that basically filters XML documents based on user profiles Internet-scale SDI suitable for huge number of users Profiles are treated as standing queries Future work / Issues Partial results Handling joins/multiple streams

List Balancing: 

List Balancing

Discussion Questions: 

Discussion Questions What if we don’t have all information? Must conform to metadata, so now what? Have we seen Pub/Sub before? What about myYahoo portals? So, if RDMS can be extended to handle Pub/Sub, is this the limit?

Discussion Questions: 

Discussion Questions Scalability of Tapestry: Size of the database -- Yes Number of queries -- ??? Continuous queries and streams: Similarities ? Differences ?

Discussion Questions: 

Discussion Questions Append-only databases: Is it a vital limitation in practice? This approach was invented in 1992. Why didn’t it find its way to wide use?

Discussion Questions: 

Discussion Questions Optimization doesn’t happen automatically (remember – queries with joins) Is it a serious limitation? Can you think of any ways to improve this aspect of the system?

Discussion Questions: 

Discussion Questions What would be the optimal granularity in data being served? Is profile/CQ content updating an important feature and if so how could it be implemented? How would a structure/content description language based (XML) standard model of data sources affect Pub/Sub, CQ, SDBMS?