TelegraphEndeavour Retreat 2000 : Telegraph Endeavour Retreat 2000 Joe Hellerstein
Roadmap : Roadmap Motivation & Goals
Application Scenarios
Quickie core technology overview
Adaptive dataflow
Event-based storage manager
Come hear more about these tonight/tomorrow!
Status and Plans
Dataflow infrastructure & apps
Storage manager?
Motivations : Motivations Global Data Federation
All the data is online – what are we waiting for?
The plumbing is coming
XML/HTTP, XML/WAP, etc. give LCD communication
but how do you flow, summarize, query and analyze data robustly over many sources in the wide area?
Ubiquitous computing: more than clients
sensors and their data feeds are key
smart dust, biomedical (MEMS sensors)
each consumer good records (mis)use
disposable computing
video from surveillance cameras, broadcasts, etc.
Huge Data flood a’comin’!
will it capsize the good ship Endeavour?
Initial Telegraph Goals : Initial Telegraph Goals Unify data access & dataflow apps
Commercial wrappers for most infosources
Most info-centric apps can be cast as dataflow
The data flood needs a big dataflow manager!
Goal: a robust, adaptive dataflow engine
Unify storage
Currently lots of disparate data stores
Databases, Files, Email servers (and http access on these)
Goal: A single, clean storage manager that can serve:
DB records & semantics
Files and “semantics”
Email folders, calendars, etc. and semantics
Challenge for Dataflow: Volatility! : Challenge for Dataflow: Volatility! Federated query processors
A la Cohera, IBM DataJoiner
No control over stats, performance, administration
Large Cluster Systems “Scaling Out”
No control over “system balance”
User “CONTROL” of running dataflows
Long-running dataflow apps are interactive
No control over user interaction
Sensor Nets
No control over anything!
Telegraph
Dataflow Engine for these environments
The Data Flood: Main Features : The Data Flood: Main Features What does it look like?
Never ends: interactivity required
Online, controllable algorithms for all tasks!
Big: data reduction/aggregation is key
Volatile: this scale of devices and nets will not behave nicely
The Telegraph Dataflow Engine : The Telegraph Dataflow Engine Key technologies
Interactive Control
interactivity with early answers and examples
online aggregation for data reduction
Dataflow programming via paths/iterators
Elevate query processing frameworks out of DBMSs
Long tradition of static optimization here
Suggestive, but not sufficient for volatile environments
Continuously adaptive flow optimization
massively parallel, adaptive dataflow
Rivers and Eddies
Static Query Plans : Static Query Plans Volatile environments like sensors need to adapt at a much finer grain
Continuous Adaptivity: Eddies : Continuous Adaptivity: Eddies How to order and reorder operators over time
based on performance, economic/admin feedback
Vs.River:
River optimizes each operator “horizontally”
Eddies optimize a pipeline “vertically”
Eddy
Unifying Storage : Unifying Storage Storage management buried inside specific systems
Elevate and expose the core services & semantic options
Layout/indexing
Concurrent access/modification
Recovery
Design for clustered environments
Replicate for reliability (tie-ins with Ninja)
Cluster options: your RAM vs. my disk
Events & State Machines for scalability
Unify eventflow and dataflow?
Share optimization lessons?
Status: Adaptive Dataflow : Status: Adaptive Dataflow Initial Eddy results promising, well received (SIGMOD 2K)
Finishing Telegraph v0 in Java/Jaguar
Prototype now running
Demo service to go live on web this summer
Analysis queries over web sites
We’ve picked a provocative app to go live with (stay tuned!)
Incorporates Ninja “path” project for caching
Goal: Telegraph is to “facts and figures” as search engines are to “documents”
Longer-term goals:
Formalize & optimize Eddy/River scheduling policies
Study HCI/systems/stats issues for interaction
Crawl “Dark Matter” on the web
Attack streams from sensors
Sequence queries and mining, data reduction, browsing, etc.
Status: Unified Storage Manager : Status: Unified Storage Manager Prototype implementation in Java/Jaguar
ACID transactions + (non-ACID) Java file access
Robust enough to get TPC-W numbers
Events/states vs. threads
Echoes Gribble/Welsh results: better than threaded under load, but Java complicates detailed mesurement
Time to re-evaluate importance of this part
Interest? More mindshare in dataflow infrastructure.
Vs. tuning an off-the-shelf solution (e.g. Berkeley DB)?
Goal? unified lessons about dataflow/eventflow optimization on clusters.
Integration with Rest of Endeavour : Integration with Rest of Endeavour Give
Be dataflow backbone for diverse “clients”
Our own Telegraph apps (federated dataflow, sensors)
Replication/delivery dataflow engine for OceanStore
Scalable infrastructure for tacit info mining algorithms?
Pipes for next version of Iceberg?
Telegraph Storage Manager provides storage (xactional/otherwise) for OceanStore? Ninja?
Take
OceanStore to manage distributed metadata, security
Leverage protocols out of TinyOS for sensors
Partner with Ninja to manage local metadata?
Work with GUIR on interacting with streams?
More Info : More Info People:
Joe Hellerstein, Mike Franklin, Eric Brewer, Christos Papadimitriou
Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah
Software
http://telegraph.cs.berkeley.edu coming soon
ABC interactive data anlysis/cleansing at http://control.cs.berkeley.edu
Papers:
See http://db.cs.berkeley.edu/telegraph
Extra slides for backup : Extra slides for backup
Connectivity & Heterogeneity : Connectivity & Heterogeneity Lots of folks working on data format translation, parsing
we will borrow, not build
currently using JDBC & Cohera Net Query
commercial tool, donated by Cohera Corp.
gateways XML/HTML (via http) to ODBC/JDBC
we may write “Teletalk” gateways from sensors
Heterogeneity
never a simple problem
Control project developed interactive, online data transformation tool: ABC
CONTROLContinuous Output and Navigation Technology with Refinement On Line : CONTROL Continuous Output and Navigation Technology with Refinement On Line Data-intensive jobs are long-running. How to give early answers and interactivity?
online interactivity over feeds
pipelining “online” operators, data “juggle”
online data correlation algs: ripple joins, online mining and aggregation
statistical estimators, and their performance implications
Deliver data to satisfy statistical goals
Appreciate interplay of massive data processing, stats, and HCI “Of all men's miseries, the bitterest is this: to know so much and have control over nothing”
Herodotus
Performance Regime for CONTROL : Performance Regime for CONTROL New “Greedy” Performance Regime
Maximize 1st derivative of the user-happiness function Time 100% CONTROL Traditional
CONTROLContinuous Output and Navigation Technology with Refinement On Line : CONTROL Continuous Output and Navigation Technology with Refinement On Line
CONTROLContinuous Output and Navigation Technology with Refinement On Line : CONTROL Continuous Output and Navigation Technology with Refinement On Line
River : River We built the world’s fastest sorting machine
On the “NOW”: 100 Sun workstations + SAN
But it only beat the record under ideal conditions!
River: performance adaptivity for data flows on clusters
simplifies management and programming
perfect for sensor-based streams
Declarative Dataflow: NOT new : Declarative Dataflow: NOT new Database Systems have been doing this for years
Xlate declarative queries into an efficient dataflow plan
“query optimization” considers:
Alternate data sources (“access methods”)
Alternate implementations of operators
Multiple orders of operators
A space of alternatives defined by transformation rules
Estimate costs and “data rates”, then search space
But in a very static way!
Gather statistics once a week
Optimize query at submission time
Run a fixed plan for the life of the query
And these ideas are ripe to elevate out of DBMSs
And outside of DBMSs, the world is very volatile
There are surely going to be lessons “outside the box”
Static Query Plans : Static Query Plans Volatile environments like sensors need to adapt at a much finer grain
Continuous Adaptivity: Eddies : Continuous Adaptivity: Eddies How to order and reorder operators over time
based on performance, economic/admin feedback
Vs.River:
River optimizes each operator “horizontally”
Eddies optimize a pipeline “vertically”
Eddy
Competitive Eddies : Competitive Eddies
Potter’s Wheel Anomaly Detection : Potter’s Wheel Anomaly Detection
The Data Flood is Real : The Data Flood is Real Source: J. Porter, Disk/Trend, Inc.
http://www.disktrend.com/pdf/portrpkg.pdf
Disk Appetite, cont. : Disk Appetite, cont. Greg Papadopoulos, CTO Sun:
Disk sales doubling every 9 months
Note: only counts the data we’re saving!
Translate:
Time to process all your data doubles every 18 months
MOORE’S LAW INVERTED!
(and Moore’s Law may run out in the next couple decades?)
Big challenge (opportunity?) for SW systems research
Traditional scalability research won’t help
“Ideal” linear scaleup is NOT NEARLY ENOUGH!
Data Volume: Prognostications : Data Volume: Prognostications Today
SwipeStream
E.g. Wal-Mart 24 Tb Data Warehouse
ClickStream
Web
Internet Archive: ?? Tb
Replicated OS/Apps
Tomorrow
Sensors Galore
DARPA/Berkeley “Smart Dust”
Note: the privacy issues only get more complex!
Both technically and ethically Temperature, light, humidity, pressure,
accelerometer,
magnetics
Explaining Disk Appetite : Explaining Disk Appetite Areal density increases 60%/yr
Yet Mb/$ rises much faster! Source: J. Porter, Disk/Trend, Inc.
http://www.disktrend.com/pdf/portrpkg.pdf