Quality Aware Sensor Database (QUASAR) Project** : Quality Aware Sensor Database (QUASAR) Project** Sharad Mehrotra
Department of Information and Computer Science
University of California, Irvine **Supported in part by a collaborative NSF ITR grant entitled “real-time data capture, analysis, and querying of dynamic spatio-temporal events” in collaboration with UCLA, U. Maryland, U. Chicago
Talk Outline : Talk Outline Quasar Project
motivation and background
data collection and archival components
query processing
tracking application using QUASAR framework
challenges and ongoing work
Brief overview of other research projects
MARS Project - incorporating similarity retrieval and refinement over structured and semi-structured data to aid interactive data analysis/mining
Database as a Service (DAS) Project - supporting the application service provider model for data management
Emerging Computing Infrastructure… : Emerging Computing Infrastructure… Instrumented
wide-area spaces In-body, in-cell, in-vitro spaces Generational advances to computing infrastructure
sensors will be everywhere
Emerging applications with limitless possibilities
real-time monitoring and control, analysis
New challenges
limited bandwidth & energy
highly dynamic systems
System architectures are due for an overhaul
at all levels of the system OS, middleware, databases, applications
Impact to Data Management … : Impact to Data Management … Traditional data management
client-server architecture
efficient approaches to data storage & querying
query shipping versus data shipping
data changes with explicit update
Emerging Challenge
data producers must be considered as “first class” entities
sensors generate continuously changing highly dynamic data
sensors may store, process, and communicate data
Data Management Architecture Issues : Data Management Architecture Issues Where to store data?
Do not store -- stream model
not suitable if we wish to archive data for future analysis or if data is too important to lose
at the producers
limited storage, network, compute resources
at the servers
server may not be able to cope with high data production rates. May lead to data staleness and/or wasted resources
Where to compute?
At the client, server, data producers Data/query request Data/query result server producer cache
Quasar Architecture : Quasar Architecture Hierarchical architecture
data flows from producers to server to clients periodically
queries flow the other way:
If client cache does not suffices, then
query routed to appropriate server
If server cache does not suffice, then access current data at producer
This is a logical architecture-- producers could also be clients. server Client cache Server cache & archive producer cache data flow Query flow
Quasar: Observations & Approach : Quasar: Observations & Approach Applications can tolerate errors in sensor data
applications may not require exact answers:
small errors in location during tracking or error in answer to query result may be OK
data cannot be precise due to measurement errors, transmission delays, etc.
Communication is the dominant cost
limited wireless bandwidth, source of major energy drain
Quasar Approach
exploit application error tolerance to reduce communication between producer and server
Two approaches
Minimize resource usage given quality constraints
Maximize quality given resource constraints
Quality-based Data Collection Problem : Quality-based Data Collection Problem Let P = < p[1], p[2], …, p[n] > be a sequence of environmental measurements (time series) generated by the producer, where n = now
Let S = be the server side representation of the sequence
A within- quality data collection protocol guarantees that
for all i error(p[i], s[i]) <
is derived from application quality tolerance Sensor time series
…p[n], p[n-1], …, p[1]
Simple Data Collection Protocol : Simple Data Collection Protocol sensor Logic (at time step n)
Let p’ = last value sent to server
if error(p[n], p’) >
send p[n] to server
server logic (at time step n)
If new update p[n] received at step n
s[n] = p[n]
Else
s[n] = last update sent by sensor
guarantees maximum error at server less than equal to Sensor time series
…p[n], p[n-1], …, p[1]
Exploiting Prediction Models : Exploiting Prediction Models Producer and server agree upon a prediction model (M, )
Let spred[i] be the predicted value at time i based on (M, )
sensor Logic (at time step n)
if error(p[n], spred[n] ) >
send p[n] to server
server logic (at time step n)
If new update p[n] received at step n
s[n] = p[n]
Else
s[n] = spred[n] based on model (M, )
Challenges in Prediction : Challenges in Prediction Simple versus complex models?
Complex and more accurate models require more parameters (that will need to be transmitted).
Goal is to minimize communication not necessarily best prediction
How is a model M generated?
static -- one out of a fixed set of models
dynamic -- dynamically learn a model from data
When should a model M or parameters be changed?
immediately on model violation:
too aggressive -- violation may be a temporary phenomena
never changed:
too conservative -- data rarely follows a single model
Challenges in Prediction (cont.) : Challenges in Prediction (cont.) who does the model update?
Server
Long-haul prediction models possible, since server maintains history
might not predict recent behavior well since server does not know exact S sequence; server has only samples
extra communication to inform the producer
Producer
better knowledge of recent history
long haul models not feasible since producer does not have history
producers share computation load
Both
server looks for new models, sensor performs parameter fitting given existing models.
Archiving Sensor Data : Archiving Sensor Data Often sensor-based applications are built with only the real-time utility of time series data.
Values at time instants <
Problem Formulation : Problem Formulation Let P = < p[1], p[2], …, p[n] > be the sensor time series
Let S = < s[1], s[2], …, s[n] > be the server side representation
A within archive quality data archival protocol guarantees that
error(p[i], s[i]) < archive
Trivial Solution: modify collection protocol to collect data at quality guarantee of min(archive , collect)
then prediction model by itself will provide a archive quality data stream that can be archived.
Better solutions possible since
archived data not needed for immediate access by real-time or forecasting applications (such as monitoring, tracking)
compression can be used to reduce data transfer
Data Archival Protocol : Data Archival Protocol Sensors compresses observed time series p[1:n] and sends a lossy compression to the server
At time n :
p[1:n-nlag] is at the server in compressed form s’ [1:n-nlag] within-archive
s[n-nlag+1:n] is estimated via a predictive model (M, )
collection protocol guarantees that this remains within- collect
s[n+1:] can be predicted but its quality is not guaranteed (because it is in the future and thus the sensor has not observed these values) …p[n], p[n-1], .. compress Sensor memory buffer Sensor updates for data collection Compressed representation for archiving processing at sensor exploited to reduce communication cost and hence battery drain
Slide16 : Piecewise Constant Approximation (PCA) Given a time series Sn = s[1:n] a piecewise constant approximation of it is a sequence
PCA(Sn) = < (ci, ei) >
that allows us to estimate s[j] as:
scapt [j] = ci if j in [ei-1+1, ei]
= c1 if j
Slide17 : Online Compression using PCA Goal: Given stream of sensor values, generate a within-archive PCA representation of a time series
Approach (PMC-midrange)
Maintain m, M as the minimum/maximum values of observed samples since last segment
On processing p[n], update m and M if needed
if M - m > 2archive , output a segment ((m+M )/2, n)
Time Value Example: archive = 1.5 1 2 3 4 5
Slide18 : Online Compression using PCA PMC-MR …
guarantees that each segment compresses the corresponding time series segment to within-archive
requires O(1) storage
is instance optimal
no other PCA representation with fewer segments can meet the within-archive constraint
Variant of PMC-MR
PMC-MEAN, which takes the mean of the samples seen thus far instead of mid range.
Slide19 : Improving PMC using Prediction Observation: Prediction models guarantee a within- collect version of the time series at server even before the compressed time series arrives from the producer.
Can the prediction model be exploited to reduce the overhead of compression.
If archive> collect no additional effort is required for archival --> simply archive the predicted model.
Approach:
Define an error time series E[i] = p[i]-spred[i]
Compress E[1:n] to within-archive instead of compressing p[1:n]
The archive contains the prediction parameters and the compressed error time series
Within-archive of E[I] + (M, ) can be used to reconstruct a within- archive version of p
Slide20 : Combing Compression and Prediction (Example)
Slide21 : Estimating Time Series Values Historical samples (before n-nlag) is maintained at the server within-archive
Recent samples (between n-nlag+1 and n) is maintained by the sensor and predicted at the server.
If an application requires q precision, then:
if q collect then it must wait for time in case a parameter refresh is en route
if q archive but q < collect then it may probe the sensor or wait for a compressed segment
Otherwise only probing meets precision
For future samples (after n) immediate probing not available as an option
Slide22 : Experiments Data sets:
Synthetic Random-Walk
x[1] = 0 and x[i]=x[i-1]+sn where sn drawn uniformly from [-1,1]
Oceanographic Buoy Data
Environmental attributes (temperature, salinity, wind-speed, etc.) sampled at 10min intervals from a buoy in the Pacific Ocean (Tropical Atmosphere Ocean Project, Pacific Marine Environment Laboratory)
GPS data collected using IPAQs
Experiments to test:
Compression Performance of PMC
Benefits of Model Selection
Query Accuracy over Compressed Data
Benefits of Prediction/Compression Combination
Slide23 : Compression Performance K/n ratio: number of segments/number of samples
Slide24 : Query Performance Over Compressed Data “How many sensors have values >v?” (Mean selectivity = 50)
Slide25 : Impact of Model Selection K/n ratio: number of segments/number of samples. pred is the localization tolerance in meters Objects moved at approximately constant speed (+ measurement noise)
Three models used:
loc[n] = c
loc[n] = c+vt
loc[n] = c+vt+0.5at2
Parameters v, a were estimated at sensor over moving-window of 5 samples
Slide26 : Combining Prediction with Compression K/n ratio: number of segments/number of samples
Slide27 : QUASAR Client
Time Series GPS Mobility Data from Mobile Clients (iPAQs) Latitude Time Series: 1800 samples Compressed Time Series (PMC-MR, ICDE 2003)
Accuracy of ~100 m
130 segments
Query Processing in Quasar : Query Processing in Quasar Problem Definition
Given
sensor time series with quality-guarantees captured at the server
A query with a specified quality-tolerance
Return
query results incurring least cost
Techniques depend upon
nature of queries
Cost measures
resource consumption -- energy, communication, I/O
query response time
Aggregate Queries : Aggregate Queries S minQ = 2
maxQ = 7
countQ = 3
sumQ = 2+7+6 = 15
avgQ = 15/3 = 5
Processing Aggregate Queries (minimize producer probe) : Processing Aggregate Queries (minimize producer probe) MIN Query
c = minj(si.high)
b = c - query
Probe all sensors where sj.low < b
only s1 and s3 will be probed
Sum Query
select a minimal subset S’ S such that
si in S’ (jpred) >= si in S(jpred)- query
If query = 15, only s1 will be probed Let S = be set of sensors that meet the query criteria
si.high = sipred[t] + jpred
sj.low = sipred[t] - jpred a b c s1 s2 sn s3 s1 s2 s5 s3 s4 10 5 2 5 3
Minimizing Cost at Server : Minimizing Cost at Server Error tolerance of queries can be exploited to reduce processing at server.
Key Idea
Use a multi-resolution index structure (MRA-tree) for processing aggregate queries at server.
An MRA-Tree is a modified multi-dimensional index trees (R-Tree, quadtree, Hybrid tree, etc.)
A non-leaf node contains (for each of its subtrees) four aggregates {MIN,MAX,COUNT,SUM}
A leaf node contains the actual data points (sensor models)
MRA Tree Data Structure : MRA Tree Data Structure Spatial View A B C D E F G A B C D E F G Tree Structure View
Slide33 : Non-Leaf Node Disk Page Pointers
(each costs 1 I/O) Leaf Node Probe “Pointers”
(each costs 2 messages) MRA-Tree Node Structure
Slide34 : Two sets of nodes:
NP (partial contribution to the query)
NC (complete contribution) Node Classification
Aggregate Queries using MRA Tree : Aggregate Queries using MRA Tree Initialize NP with the root
At each iteration: Remove one node N from NP and for each Nchild of its children
discard, if Nchild disjoint with Q
insert into NP if Q is contained or partially overlaps with Nchild
“insert” into NC if Q contains Nchild (we only need to maintain aggNC)
compute the best estimate based on contents of NP and NC Q N
MIN (and MAX) : MIN (and MAX) 3 9 4 5 Interval
minNC = min { 4, 5 } = 4
minNP = min { 3, 9 } = 3
L = min {minNC, minNP} = 3
H = minNC = 4
hence, I = [3, 4]
Estimate
Lower bound:
E(minQ) = L = 3
Slide37 : MRA Tree Traversal A B C D E F G Progressive answer refinement until NP is exhausted
Greedy priority-based local decision for next node to be explored based on:
Cost (1 I/O or 2 messages)
Benefit (Expected Reduction in answer uncertainty)
Adaptive Tracking of mobile objects in sensor networks : Adaptive Tracking of mobile objects in sensor networks Tracking Architecture
A network of wireless acoustic sensors arranged as a grid transmitting via a base station to server
A track of the mobile object generated at the base station or server
Objective
Track a mobile object at the server such that the track deviates from the real trajectory within a user defined error threshold track with minimum communication overhead.
Sensor Model : Sensor Model Wireless sensors : battery operated, energy constrained
Operate on the received acoustic waveforms
Signal attenuation of target object given by : Is(t) = P /4 r2
P : source object power
r= distance of object from sensor
Is(t) = intensity reading at time t at ith sensor
Ith : Intensity threshold at ith sensor
Sensor States : Sensor States S0 : Monitor ( processor on, sensor on, radio off )
shift to S1 if intensity above threshold
S1 : Active state ( processor on, sensor on, radio on)
send intensity readings to base station.
On receiving message from BS containing error tolerance shift to S2
S2 : Quasi-active (processor on, sensor on, radio intermittent)
send intensity reading to BS if error from previous reading exceeds error threshold
Quasar Collection approach used in Quasi-active state
Server side protocol : Server side protocol Server maintains:
list of sensors in the active/ quasi-active state
history of their intensity readings over a period of time
Server Side Protocol
convert track quality to a relative intensity error at sensors
Send relative intensity error to sensor when sensor state = S1( quasi- active state)
Triangulate using n sensor readings at discrete time intervals.
Basic Triangulation Algorithm (using 3 sensor readings) : Basic Triangulation Algorithm (using 3 sensor readings) P: source object power, Ii = intensity reading at ith sensor
(x-x1)2 + (y- y1)2 = P/4 I1
(x-x2)2 + (y- y2)2 = P/4 I2
(x-x3)2 + (y- y3)2 = P/4 I3
Solving we get (x, y)=f(x1,x2,x3,y1,y2,y3, P,I1, I2 , I3, ) (x, y) More complex approaches to amalgamate more than three sensor readings possible
Based on numerical methods -- do not provide a closed form equation between sensor reading and tracking location !
Server can use simple triangulation to convert track quality to sensor intensity quality tolerances and a more complex approach to track.
Adaptive Tracking : Mapping track quality to sensor reading : Adaptive Tracking : Mapping track quality to sensor reading Intensity
( I2 ) time I2 Intensity
( I3 ) time I3 t i t( i+1 ) t i t( i+1 ) t i t( i+1 ) X (m) Y (m) Claim 1 (power constant)
Let Ii be the intensity value of sensor
If then, track quality is guaranteed to be within track
where and C is a constant derived from the known locations of the sensors and the power of the object.
Claim 2 (power varies between [Pmin , Pmax ])
If then
track quality is guaranteed to be within track where C’ = C/ P2 and is a constant .
The above constraint is a conservative estimate. Better bounds possible track
Adaptive Tracking: prediction to improve performance : Communication overhead further reduced by exploiting the predictability of the object being tracked
Static Prediction : sensor & server agree on a set of prediction models
only 2 models used: stationary & constant velocity
Who Predicts: sensor based mobility prediction protocol
Every sensor by default follows a stationary model
Based on its history readings may change to constant velocity model (number of readings limited by sensor memory size)
informs server of model switch Adaptive Tracking: prediction to improve performance
Actual Track versus track on Adaptive Tracking (error tolerance 20m) : Actual Track versus track on Adaptive Tracking (error tolerance 20m) A restricted random motion : the object starts at (0,d) and moves from one node to another randomly chosen node until it walks out of the grid.
Energy Savings due to Adaptive Tracking : Energy Savings due to Adaptive Tracking total energy consumption over all sensor nodes for random mobility model with varying track or track error.
significant energy savings using adaptive precision protocol over non adaptive tracking ( constant line in graph)
for a random model, prediction does not work well !
Energy consumption with Distance from BS : Energy consumption with Distance from BS total energy consumption over all sensor nodes for random mobility model with varying base station distance from sensor grid.
As base station moves away, one can expect energy consumption to increase since transmission cost varies as d n ( n =2 )
adaptive precision algorithm gives us better results with increasing base station distance
Challenges & Ongoing Work : Challenges & Ongoing Work Ongoing Work:
Supporting a larger class of SQL queries
Supporting continuous monitoring queries
Larger class of sensors (e.g., video sensors)
Better approaches to model fitting/switching in prediction
In the future:
distributed Quasar architecture
optimizing quality given resource constraints
supporting applications with real-time constraints
dealing with failures
The DAS Project** : The DAS Project** Goals:
Support Database as a Service on the Internet
Collaboration:
IBM (Dr. Bala Iyer)
UCI (Gene Tsudik)
** Supported in part by NSF ITR grant entitled “Privacy in Database as a Service” and by the IBM Corporation
Software as a Service : Software as a Service
Get …
what you need
when you need
Pay …
what you use
Don’t worry …
how to deploy, implement, maintain, upgrade
Software As a Service: Why? : Software As a Service: Why? Advantages
reduced cost to client
pay for what you use and not for hardware, software infrastructure or personnel to deploy, maintain, upgrade…
reduced overall cost
cost amortization across users
Better service
leveraging experts across organizations Driving Forces
Faster, cheaper, more accessible networks
Virtualization in server and storage technologies
Established e-business infrastructures
Already in Market
ERP and CRM (many examples)
More horizontal storage services, disaster recovery services, e-mail services, rent-a-spreadsheet services etc.
Sun ONE, Oracle Online Services, Microsoft .NET My Services etc Better Service for Cheaper
Database As a Service : Database As a Service Why?
Most organizations need DBMSs
DBMSs extremely complex to deploy, setup, maintain
require skilled DBAs with high cost
What do we want to do? : What do we want to do? Database as a Service (DAS) Model
DB management transferred to service provider for
backup, administration, restoration, space management, upgrades etc.
use the database “as a service” provided by an ASP
use SW, HW, human resources of ASP, instead of your own
Application Service Provider (ASP) Server BUT….
Challenges : Challenges Economic/business model?
How to charge for service, what kind of service guarantees can be offered, costing of guarantees, liability of service provider.
Powerful interfaces to support complete application development environment
User Interface for SQL, support for embedded SQL programming, support for user defined interfaces, etc.
Scalability in the web environment
overheads due to network latency (data proxies?)
Privacy and Security
Protecting data at service providers from intruders and attacks.
Protecting clients from misuse of data by service providers
Ensuring result integrity
Data privacy from service provider : Data privacy from service provider Encrypted User Database User Data
The problem is we do not trust “the service provider” for sensitive information!
Fact 1: Theft of intellectual property due to database vulnerabilities costs American businesses $103 billion annually
Fact 2: 45% of those attacks are conducted by insiders! (CSI/FBI Computer Crime and Security Survey, 2001)
encrypt the data and store it
but still be able to run queries over the encrypted data
do most of the work at the server Server Application Service Provider Untrusted Server Site
System Architecture : System Architecture
Server Site Original Query Server Side Query Encrypted Results Actual Results Service Provider Client Site Client Side Query ? ? ?
NetDB2 Service : NetDB2 Service
Developed in collaboration with IBM
Deployed on the Internet about 2 years ago
Been used by 15 universities and more than 2500 students to help teaching database classes
Currently offered through IBM Scholars Program
MARS Project** : MARS Project** Goals: integration of similarity retrieval and query refinement over structured and semi-structured databases to help interactive data analysis/mining
**Supported in part by NSF CAREER award, NSF grant entitled “learning digital behavior” and a KDD grant entitled “Mining events and entities over large spatio-temporal data sets”
Similarity Search in Databases (SR) : Similarity Search in Databases (SR) Alice Honda sedan,
inexpensive,
after 1994,
around LA
Query Refinement (QR) : Refined Results Query Refinement (QR)
Why are SR and QR important? : Why are SR and QR important? Most queries are similarity searches
Specially in exploratory data analysis tasks (e.g., catalog search)
Users have only a partial idea of their information need
Existing Search technologies (text retrieval, SQL) do not provide appropriate support for SR and (almost) no support for QR.
Users must artificially convert similarity queries to keyword-searches or exact-match queries
Good mappings difficult or not feasible
Lack of good knowledge of the underlying data or its structure
Exact-match may be meaningless for certain data types (e.g., images, text, multimedia)
Slide62 : Similarity Access and Interactive Mining Architecture Search Client Query Session
Manager Query Log
Manager/Miner Similarity
Query
Processor Feedback-based
Refinement
Method ORDBMS Similarity
Operators Types Feedback
Table Database Query
Log Answer
Table Scores
Table Refinement
Manager History-based
Refinement
Method Query/Feedback Ranked Results Initial
Query Feedback Ranked
Results Query Results Schemes Ranking
Rules Legend:
--- logging
__ Process
MARS Challenges... : MARS Challenges... Learning queries from
user interactions
user profiles
past history of other users
Efficient implementation of
similarity queries
refined queries
Role of similarity queries in
OLAP
interactive data mining
Slide64 : Query-Session Manager -parse the query
- check query validity
-generate schema for support tables
- maintain sessions registry Similarity Query Processor -executes query on ORDBMS
- ranks results (e.g. can exclude already
see tuples, etc)
- logs query(query or Top-k) Refinement Manager - maintains a registry of query refinement
policies (content/collaborative)
- generates the scores table
- identifies and invokes intra-predicate refiners. Query Log Manager/Miner - maintains query log
. Initial-Final pair
. Top-K results
. Complete trajectory
- Query-query similarity (can have multiple policies)
- Query clustering
Text Search Technologies (Altavista, Verity, Vality, Infoseek) : Text Search Technologies (Altavista, Verity, Vality, Infoseek) Strengths
support ranked retrieval
can handle missing data, synonyms, data entry errors Approach
convert enterprise structured data into a searchable text index. Limitations
cannot capture semantics of relationships in data
cannot capture semantics of non-text fields (e.g., multimedia)
limited support for refinement or preferences in current systems
cannot express similarity queries over structured or semi-structured data (e.g., price, location) Movies Actors Directors Al Pacino acted in a movie directed by Francis Ford Coppola
SQL-based Search TechnologiesOracle, Informix, DB2, Mercado : SQL-based Search Technologies Oracle, Informix, DB2, Mercado Approach
translate similarity query into exact SQL query. Strengths
support structured as well as semi-structured data
support for arbitrary data types
Scalable attribute-based lookup Limitations
translation is difficult or not possible
difficult to guess right ranges
causes near misses
not feasible for non-numeric fields
cannot rank answers based on relevance
does not account for user preference or query refinement select * from user_car_catalog where model = Honda Accord and 1993 £ year £1995 and dist(90210) £ 50 and price < 5000