logging in or signing up telegraph retreat Davidson Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 35 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript TelegraphEndeavour Retreat 2000: Telegraph Endeavour Retreat 2000 Joe HellersteinRoadmap: Roadmap Motivation & Goals Application Scenarios Quickie core technology overview Adaptive dataflow Event-based storage manager Come hear more about these tonight/tomorrow! Status and Plans Dataflow infrastructure & apps Storage manager?Motivations: Motivations Global Data Federation All the data is online – what are we waiting for? The plumbing is coming XML/HTTP, XML/WAP, etc. give LCD communication but how do you flow, summarize, query and analyze data robustly over many sources in the wide area? Ubiquitous computing: more than clients sensors and their data feeds are key smart dust, biomedical (MEMS sensors) each consumer good records (mis)use disposable computing video from surveillance cameras, broadcasts, etc. Huge Data flood a’comin’! will it capsize the good ship Endeavour?Initial Telegraph Goals: Initial Telegraph Goals Unify data access & dataflow apps Commercial wrappers for most infosources Most info-centric apps can be cast as dataflow The data flood needs a big dataflow manager! Goal: a robust, adaptive dataflow engine Unify storage Currently lots of disparate data stores Databases, Files, Email servers (and http access on these) Goal: A single, clean storage manager that can serve: DB records & semantics Files and “semantics” Email folders, calendars, etc. and semanticsChallenge for Dataflow: Volatility!: Challenge for Dataflow: Volatility! Federated query processors A la Cohera, IBM DataJoiner No control over stats, performance, administration Large Cluster Systems “Scaling Out” No control over “system balance” User “CONTROL” of running dataflows Long-running dataflow apps are interactive No control over user interaction Sensor Nets No control over anything! Telegraph Dataflow Engine for these environmentsThe Data Flood: Main Features: The Data Flood: Main Features What does it look like? Never ends: interactivity required Online, controllable algorithms for all tasks! Big: data reduction/aggregation is key Volatile: this scale of devices and nets will not behave nicelyThe Telegraph Dataflow Engine: The Telegraph Dataflow Engine Key technologies Interactive Control interactivity with early answers and examples online aggregation for data reduction Dataflow programming via paths/iterators Elevate query processing frameworks out of DBMSs Long tradition of static optimization here Suggestive, but not sufficient for volatile environments Continuously adaptive flow optimization massively parallel, adaptive dataflow Rivers and Eddies Static Query Plans: Static Query Plans Volatile environments like sensors need to adapt at a much finer grainContinuous Adaptivity: Eddies: Continuous Adaptivity: Eddies How to order and reorder operators over time based on performance, economic/admin feedback Vs.River: River optimizes each operator “horizontally” Eddies optimize a pipeline “vertically” EddyUnifying Storage: Unifying Storage Storage management buried inside specific systems Elevate and expose the core services & semantic options Layout/indexing Concurrent access/modification Recovery Design for clustered environments Replicate for reliability (tie-ins with Ninja) Cluster options: your RAM vs. my disk Events & State Machines for scalability Unify eventflow and dataflow? Share optimization lessons?Status: Adaptive Dataflow: Status: Adaptive Dataflow Initial Eddy results promising, well received (SIGMOD 2K) Finishing Telegraph v0 in Java/Jaguar Prototype now running Demo service to go live on web this summer Analysis queries over web sites We’ve picked a provocative app to go live with (stay tuned!) Incorporates Ninja “path” project for caching Goal: Telegraph is to “facts and figures” as search engines are to “documents” Longer-term goals: Formalize & optimize Eddy/River scheduling policies Study HCI/systems/stats issues for interaction Crawl “Dark Matter” on the web Attack streams from sensors Sequence queries and mining, data reduction, browsing, etc.Status: Unified Storage Manager: Status: Unified Storage Manager Prototype implementation in Java/Jaguar ACID transactions + (non-ACID) Java file access Robust enough to get TPC-W numbers Events/states vs. threads Echoes Gribble/Welsh results: better than threaded under load, but Java complicates detailed mesurement Time to re-evaluate importance of this part Interest? More mindshare in dataflow infrastructure. Vs. tuning an off-the-shelf solution (e.g. Berkeley DB)? Goal? unified lessons about dataflow/eventflow optimization on clusters.Integration with Rest of Endeavour: Integration with Rest of Endeavour Give Be dataflow backbone for diverse “clients” Our own Telegraph apps (federated dataflow, sensors) Replication/delivery dataflow engine for OceanStore Scalable infrastructure for tacit info mining algorithms? Pipes for next version of Iceberg? Telegraph Storage Manager provides storage (xactional/otherwise) for OceanStore? Ninja? Take OceanStore to manage distributed metadata, security Leverage protocols out of TinyOS for sensors Partner with Ninja to manage local metadata? Work with GUIR on interacting with streams?More Info: More Info People: Joe Hellerstein, Mike Franklin, Eric Brewer, Christos Papadimitriou Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah Software http://telegraph.cs.berkeley.edu coming soon ABC interactive data anlysis/cleansing at http://control.cs.berkeley.edu Papers: See http://db.cs.berkeley.edu/telegraphExtra slides for backup: Extra slides for backup Connectivity & Heterogeneity: Connectivity & Heterogeneity Lots of folks working on data format translation, parsing we will borrow, not build currently using JDBC & Cohera Net Query commercial tool, donated by Cohera Corp. gateways XML/HTML (via http) to ODBC/JDBC we may write “Teletalk” gateways from sensors Heterogeneity never a simple problem Control project developed interactive, online data transformation tool: ABCCONTROLContinuous Output and Navigation Technology with Refinement On Line: CONTROL Continuous Output and Navigation Technology with Refinement On Line Data-intensive jobs are long-running. How to give early answers and interactivity? online interactivity over feeds pipelining “online” operators, data “juggle” online data correlation algs: ripple joins, online mining and aggregation statistical estimators, and their performance implications Deliver data to satisfy statistical goals Appreciate interplay of massive data processing, stats, and HCI “Of all men's miseries, the bitterest is this: to know so much and have control over nothing” HerodotusPerformance Regime for CONTROL: Performance Regime for CONTROL New “Greedy” Performance Regime Maximize 1st derivative of the user-happiness function Time 100% CONTROL TraditionalCONTROLContinuous Output and Navigation Technology with Refinement On Line: CONTROL Continuous Output and Navigation Technology with Refinement On Line CONTROLContinuous Output and Navigation Technology with Refinement On Line: CONTROL Continuous Output and Navigation Technology with Refinement On Line River: River We built the world’s fastest sorting machine On the “NOW”: 100 Sun workstations + SAN But it only beat the record under ideal conditions! River: performance adaptivity for data flows on clusters simplifies management and programming perfect for sensor-based streamsDeclarative Dataflow: NOT new: Declarative Dataflow: NOT new Database Systems have been doing this for years Xlate declarative queries into an efficient dataflow plan “query optimization” considers: Alternate data sources (“access methods”) Alternate implementations of operators Multiple orders of operators A space of alternatives defined by transformation rules Estimate costs and “data rates”, then search space But in a very static way! Gather statistics once a week Optimize query at submission time Run a fixed plan for the life of the query And these ideas are ripe to elevate out of DBMSs And outside of DBMSs, the world is very volatile There are surely going to be lessons “outside the box”Static Query Plans: Static Query Plans Volatile environments like sensors need to adapt at a much finer grainContinuous Adaptivity: Eddies: Continuous Adaptivity: Eddies How to order and reorder operators over time based on performance, economic/admin feedback Vs.River: River optimizes each operator “horizontally” Eddies optimize a pipeline “vertically” EddyCompetitive Eddies: Competitive Eddies Potter’s Wheel Anomaly Detection: Potter’s Wheel Anomaly Detection The Data Flood is Real: The Data Flood is Real Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdfDisk Appetite, cont.: Disk Appetite, cont. Greg Papadopoulos, CTO Sun: Disk sales doubling every 9 months Note: only counts the data we’re saving! Translate: Time to process all your data doubles every 18 months MOORE’S LAW INVERTED! (and Moore’s Law may run out in the next couple decades?) Big challenge (opportunity?) for SW systems research Traditional scalability research won’t help “Ideal” linear scaleup is NOT NEARLY ENOUGH!Data Volume: Prognostications: Data Volume: Prognostications Today SwipeStream E.g. Wal-Mart 24 Tb Data Warehouse ClickStream Web Internet Archive: ?? Tb Replicated OS/Apps Tomorrow Sensors Galore DARPA/Berkeley “Smart Dust” Note: the privacy issues only get more complex! Both technically and ethically Temperature, light, humidity, pressure, accelerometer, magneticsExplaining Disk Appetite: Explaining Disk Appetite Areal density increases 60%/yr Yet Mb/$ rises much faster! Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdf You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
telegraph retreat Davidson Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 35 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: November 20, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript TelegraphEndeavour Retreat 2000: Telegraph Endeavour Retreat 2000 Joe HellersteinRoadmap: Roadmap Motivation & Goals Application Scenarios Quickie core technology overview Adaptive dataflow Event-based storage manager Come hear more about these tonight/tomorrow! Status and Plans Dataflow infrastructure & apps Storage manager?Motivations: Motivations Global Data Federation All the data is online – what are we waiting for? The plumbing is coming XML/HTTP, XML/WAP, etc. give LCD communication but how do you flow, summarize, query and analyze data robustly over many sources in the wide area? Ubiquitous computing: more than clients sensors and their data feeds are key smart dust, biomedical (MEMS sensors) each consumer good records (mis)use disposable computing video from surveillance cameras, broadcasts, etc. Huge Data flood a’comin’! will it capsize the good ship Endeavour?Initial Telegraph Goals: Initial Telegraph Goals Unify data access & dataflow apps Commercial wrappers for most infosources Most info-centric apps can be cast as dataflow The data flood needs a big dataflow manager! Goal: a robust, adaptive dataflow engine Unify storage Currently lots of disparate data stores Databases, Files, Email servers (and http access on these) Goal: A single, clean storage manager that can serve: DB records & semantics Files and “semantics” Email folders, calendars, etc. and semanticsChallenge for Dataflow: Volatility!: Challenge for Dataflow: Volatility! Federated query processors A la Cohera, IBM DataJoiner No control over stats, performance, administration Large Cluster Systems “Scaling Out” No control over “system balance” User “CONTROL” of running dataflows Long-running dataflow apps are interactive No control over user interaction Sensor Nets No control over anything! Telegraph Dataflow Engine for these environmentsThe Data Flood: Main Features: The Data Flood: Main Features What does it look like? Never ends: interactivity required Online, controllable algorithms for all tasks! Big: data reduction/aggregation is key Volatile: this scale of devices and nets will not behave nicelyThe Telegraph Dataflow Engine: The Telegraph Dataflow Engine Key technologies Interactive Control interactivity with early answers and examples online aggregation for data reduction Dataflow programming via paths/iterators Elevate query processing frameworks out of DBMSs Long tradition of static optimization here Suggestive, but not sufficient for volatile environments Continuously adaptive flow optimization massively parallel, adaptive dataflow Rivers and Eddies Static Query Plans: Static Query Plans Volatile environments like sensors need to adapt at a much finer grainContinuous Adaptivity: Eddies: Continuous Adaptivity: Eddies How to order and reorder operators over time based on performance, economic/admin feedback Vs.River: River optimizes each operator “horizontally” Eddies optimize a pipeline “vertically” EddyUnifying Storage: Unifying Storage Storage management buried inside specific systems Elevate and expose the core services & semantic options Layout/indexing Concurrent access/modification Recovery Design for clustered environments Replicate for reliability (tie-ins with Ninja) Cluster options: your RAM vs. my disk Events & State Machines for scalability Unify eventflow and dataflow? Share optimization lessons?Status: Adaptive Dataflow: Status: Adaptive Dataflow Initial Eddy results promising, well received (SIGMOD 2K) Finishing Telegraph v0 in Java/Jaguar Prototype now running Demo service to go live on web this summer Analysis queries over web sites We’ve picked a provocative app to go live with (stay tuned!) Incorporates Ninja “path” project for caching Goal: Telegraph is to “facts and figures” as search engines are to “documents” Longer-term goals: Formalize & optimize Eddy/River scheduling policies Study HCI/systems/stats issues for interaction Crawl “Dark Matter” on the web Attack streams from sensors Sequence queries and mining, data reduction, browsing, etc.Status: Unified Storage Manager: Status: Unified Storage Manager Prototype implementation in Java/Jaguar ACID transactions + (non-ACID) Java file access Robust enough to get TPC-W numbers Events/states vs. threads Echoes Gribble/Welsh results: better than threaded under load, but Java complicates detailed mesurement Time to re-evaluate importance of this part Interest? More mindshare in dataflow infrastructure. Vs. tuning an off-the-shelf solution (e.g. Berkeley DB)? Goal? unified lessons about dataflow/eventflow optimization on clusters.Integration with Rest of Endeavour: Integration with Rest of Endeavour Give Be dataflow backbone for diverse “clients” Our own Telegraph apps (federated dataflow, sensors) Replication/delivery dataflow engine for OceanStore Scalable infrastructure for tacit info mining algorithms? Pipes for next version of Iceberg? Telegraph Storage Manager provides storage (xactional/otherwise) for OceanStore? Ninja? Take OceanStore to manage distributed metadata, security Leverage protocols out of TinyOS for sensors Partner with Ninja to manage local metadata? Work with GUIR on interacting with streams?More Info: More Info People: Joe Hellerstein, Mike Franklin, Eric Brewer, Christos Papadimitriou Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah Software http://telegraph.cs.berkeley.edu coming soon ABC interactive data anlysis/cleansing at http://control.cs.berkeley.edu Papers: See http://db.cs.berkeley.edu/telegraphExtra slides for backup: Extra slides for backup Connectivity & Heterogeneity: Connectivity & Heterogeneity Lots of folks working on data format translation, parsing we will borrow, not build currently using JDBC & Cohera Net Query commercial tool, donated by Cohera Corp. gateways XML/HTML (via http) to ODBC/JDBC we may write “Teletalk” gateways from sensors Heterogeneity never a simple problem Control project developed interactive, online data transformation tool: ABCCONTROLContinuous Output and Navigation Technology with Refinement On Line: CONTROL Continuous Output and Navigation Technology with Refinement On Line Data-intensive jobs are long-running. How to give early answers and interactivity? online interactivity over feeds pipelining “online” operators, data “juggle” online data correlation algs: ripple joins, online mining and aggregation statistical estimators, and their performance implications Deliver data to satisfy statistical goals Appreciate interplay of massive data processing, stats, and HCI “Of all men's miseries, the bitterest is this: to know so much and have control over nothing” HerodotusPerformance Regime for CONTROL: Performance Regime for CONTROL New “Greedy” Performance Regime Maximize 1st derivative of the user-happiness function Time 100% CONTROL TraditionalCONTROLContinuous Output and Navigation Technology with Refinement On Line: CONTROL Continuous Output and Navigation Technology with Refinement On Line CONTROLContinuous Output and Navigation Technology with Refinement On Line: CONTROL Continuous Output and Navigation Technology with Refinement On Line River: River We built the world’s fastest sorting machine On the “NOW”: 100 Sun workstations + SAN But it only beat the record under ideal conditions! River: performance adaptivity for data flows on clusters simplifies management and programming perfect for sensor-based streamsDeclarative Dataflow: NOT new: Declarative Dataflow: NOT new Database Systems have been doing this for years Xlate declarative queries into an efficient dataflow plan “query optimization” considers: Alternate data sources (“access methods”) Alternate implementations of operators Multiple orders of operators A space of alternatives defined by transformation rules Estimate costs and “data rates”, then search space But in a very static way! Gather statistics once a week Optimize query at submission time Run a fixed plan for the life of the query And these ideas are ripe to elevate out of DBMSs And outside of DBMSs, the world is very volatile There are surely going to be lessons “outside the box”Static Query Plans: Static Query Plans Volatile environments like sensors need to adapt at a much finer grainContinuous Adaptivity: Eddies: Continuous Adaptivity: Eddies How to order and reorder operators over time based on performance, economic/admin feedback Vs.River: River optimizes each operator “horizontally” Eddies optimize a pipeline “vertically” EddyCompetitive Eddies: Competitive Eddies Potter’s Wheel Anomaly Detection: Potter’s Wheel Anomaly Detection The Data Flood is Real: The Data Flood is Real Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdfDisk Appetite, cont.: Disk Appetite, cont. Greg Papadopoulos, CTO Sun: Disk sales doubling every 9 months Note: only counts the data we’re saving! Translate: Time to process all your data doubles every 18 months MOORE’S LAW INVERTED! (and Moore’s Law may run out in the next couple decades?) Big challenge (opportunity?) for SW systems research Traditional scalability research won’t help “Ideal” linear scaleup is NOT NEARLY ENOUGH!Data Volume: Prognostications: Data Volume: Prognostications Today SwipeStream E.g. Wal-Mart 24 Tb Data Warehouse ClickStream Web Internet Archive: ?? Tb Replicated OS/Apps Tomorrow Sensors Galore DARPA/Berkeley “Smart Dust” Note: the privacy issues only get more complex! Both technically and ethically Temperature, light, humidity, pressure, accelerometer, magneticsExplaining Disk Appetite: Explaining Disk Appetite Areal density increases 60%/yr Yet Mb/$ rises much faster! Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdf