Talk Maha

Uploaded from authorPOINTLite
Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Large-scale Distributed Databases: State of the Art & Open Issues: 

Large-scale Distributed Databases: State of the Art & Open Issues Maha Abdallah LIP6, Paris VI

Why a DBMS ?: 

Why a DBMS ? Simplified Data Management & Access Data Definition Language (DDL) Data Manipulation Language (DML) Query Language (SQL) Advanced DBMS Functionalities Query Engine Query Optimizer Transaction Management Concurrency control Fault-tolerance

DBMS : A Brief History: 

DBMS : A Brief History Client Database DB Server DBMS Connections (ODBC, JDBC) Client Client A Client-Server Architecture

Relational Data Representation: 

Relational Data Representation CREATE TABLE Accounts ( AccountID varchar(30) PRIMARY KEY, CustomerID varchar(30) NOT NULL, Branch varchar(10) NOT NULL, Balance Decimal(10, 2), …) Balance Branch CustomerID AccountID Accounts table INSERT INTO Accounts VALUES (‘200’, ‘5’, ‘New York’, 3500.84) 3500.84 New York 5 200 200 SELECT Balance FROM Accounts WHERE AccountID = ‘400’; . . . New York San Francisco Secondary Index

Relational Data Representation: 

Relational Data Representation Accounts table Balance Branch CustomerID AccountID 3500.84 New York 5 200 333.00 New York 4 456 500.00 San Francisco 6 400 340.14 San Francisco 3 350 23.17 New York 1 345 200.00 San Francisco 2 324 Phone # Address Name CustomerID 1238357982 Main St, NY JONES 1 1243658790 Market St, SF KHAN 6 1276890445 41th St, NY ONO 5 3658932698 5th. Av, NY GREEN 4 6790365138 Cole St, SF SMITH 3 7693195680 30th Av, SF GRAY 2 Customers table SELECT Phone # FROM Accounts, Customers WHERE AccountID = ‘350’ AND Accounts.CustomerID = Customers.CustomerID ; Join Query

Distributed DBMS Design: 

Distributed DBMS Design Top-down Approach Global schema definition Data fragmentation Local sites for fragments storage Transparent

Distributed DBMS Design: 

Distributed DBMS Design Horizontal Fragmentation Vertical Fragmentation (Accounts) (Accounts) branch = ‘New York’ accountID, balance 333.00 New York 4 456 500.00 San Francisco 6 400 340.14 San Francisco 3 350 23.17 New York 1 345 200.00 San Francisco 2 324 3500.84 New York 5 200 Balance Branch CustomerID AccountID

Distributed DBMS Design: 

Distributed DBMS Design Bottom-up Approach Integration of pre-existing legacy systems Mediated schema approach Others: Ontology-based approach, … Global As View (GAV) Global schema as view over Source ones query(G) = query(f(S1,…, Sn)) Local As View (LAV) Sources as views over global schema query(G) = query(f1-1 (S1),…, fn-1 (Sn)) Integrated Schema User view Integration Mediator OODBMS ORDBMS Heterogeneous Systems RDBMS Adapter Adapter Adapter Local TX

Distributed DBMSs…: 

Distributed DBMSs… Assessment in large-scale environments Static approaches Centralized Mediation Schema Centralized / Replicated Global Index Current DB technologies DO NOT SCALE P2P: a key paradigm for scalable solutions Can we Bring P2P systems’ scalability to the DB world ?

Why P2P ?: 

Why P2P ? A distributed system architecture Typically many nodes, unreliable and heterogeneous No centralized control Nodes are symmetric in function Harness distributed, shared resources (bandwidth, CPU, storage) on peer nodes Fault-tolerant, self-organizing Dynamic environment with frequent join and leave

P2P Challenge : Locating Content: 

P2P Challenge : Locating Content Unstructured Systems Napster : Centralized global index Gnutella : No global index Flooding who has file F ? 1 Node 1 Napster central server Napster Gnutella

P2P Challenge : Locating Content: 

P2P Challenge : Locating Content Structured Systems Distributed Hash Tables Efficient : hash-based indexing Scalable : distributed global index ... x y z hash table hash bucket insert (key, data) lookup (key) → data 0 1 2 ... node insert (key, data) lookup (key) → data N-1 Hash Tables Associate data with keys Key is hashed to find bucket in hash table DHTs Associate data with keys Key is hashed to find peer node in the system

Internet Scale DHTs : A Case Study : 

Internet Scale DHTs : A Case Study Content Addressable Networks (CAN) Interface - insert(key,value) - value = retrieve(key)

DHTs : Dealing with System Dynamicity: 

DHTs : Dealing with System Dynamicity Consistent Hashing Fixed hash space shared by all peer nodes 2-dimensional space 1 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

CAN : Sharing Hash Space 1 2 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

CAN : Sharing Hash Space 1 2 3 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

1 2 3 4 CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

(1) a = hx(K) CAN: Example x = a node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

(1) a = hx(K) b = hy(K) CAN: Example x = a y = b node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

(1) a = hx(K) b = hy(K) CAN: Example (2) route(K,V) -> (a,b) node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = hx(K) b = hy(K) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = hx(K) b = hy(K) node J::retrieve(K) J By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: routing table: 

CAN: routing table A node only maintains state for its immediate neighboring nodes By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: routing: 

CAN: routing (a,b) (x,y) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion Bootstrap node Discover some node “I” already in CAN new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion I new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion 2) pick random point in space I (p,q) new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion (p,q) I routes to (p,q), discovers node J I J new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion new J Split J’s zone in half… new owns one half By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

DHTs…: 

DHTs… Assessment Coarse-grained indexing File level data sharing Exact-match lookups Random graph construction w.r.t data semantics Read-only data sharing Key Issues for Wider Applicability Fine-grained data indexing Content-based / keyword lookups Approximate / Range queries Semantic awareness Read/Write data access Impoverished query language

Bridging the Gap…: 

Bridging the Gap… Can we bring P2P technologies to the DBMS world ? DBMS Technologies P2P Technologies Scalable Fine-grained Sharing of Semantically Rich Data Targeted Applications : Scientific data sharing among research communities

Fine-Grained Indexing: Existing Work: 

Fine-Grained Indexing: Existing Work PeerDB project Unstructured network NO globally shared schema SQL-like query language Mobile agents for remote query execution No indexing mechanisms Flooding-based data search

Fine-Grained Indexing: Existing Work: 

Fine-Grained Indexing: Existing Work PIER project DHT-based tuple/value level indexing SQL-like query language Standard shared schema Inefficient for data-intensive applications No range queries

Fine-Grained Indexing: Our Approach: 

Fine-Grained Indexing: Our Approach DHT-based Relation/attribute level indexing NO globally shared schema SQL-like query language No direct value access No range queries

Relational Data Indexing: 

Relational Data Indexing Peers support relational data Heterogeneous local schemas HOW? : Meta-Data for semantic/structural content description Relations set of keywords Attributes set of keywords SELECT treatment FROM Sickness WHERE Name = “Cancer” General query Q

Slide39: 

Relational Data Indexing

Slide40: 

Relational Data Indexing Relation & Attribute indexing (key, value) hash index, ∀ key ∈ {relation name, relation keywords} value = node IP@ Sample hash index : Relation entries h(sickness) Peer3 h(Illness) Peer2

Relational Data Indexing: 

Relational Data Indexing Sample hash index : Attributes entries h(sickness||name) Peer3 h(sickness||treatment) Peer1 Peer1 issues query Q : SELECT treatment SELECT Attribute FROM Sickness FROM Relation WHERE sickName = “Cancer” WHERE Predicate h(sickness||name) Peer3 h(Illness||name) Peer2 h(sickness||treatment) Peer1 h(Illness||treatment) Peer4 Send request Q to Peers 1, 2 & 3 only

Slide42: 

Fine-Grained Data Indexing Assessment No globally shared index No globally shared schema Content-based look-up and access Highly expressive query language Extensions : Store value intervals in index What about range queries ? SELECT Attribute1 FROM Relation WHERE Attribute2 BETWEEN 350 AND 500

Slide43: 

Range Queries : Existing Work Gaiia project [UC, Santa Barbara] P2P distributed cache system DHT-based Assumes centralized DB server Server Clients P2P Cache Server Clients

Slide44: 

Range Queries : Gaiia (UCSB) Each attribute is mapped to a 2-d CAN space Space bounded by attribute domain Every queried range <x,y> is mapped to point (x,y) Super ranges in upper-left region x y 20 80 80 20 x 20 80 80 20 Attribute domain = [20, 80]

Slide45: 

Range Queries : Gaiia (UCSB) A

Slide46: 

Range Queries : Gaiia (UCSB) A C requires <50,55> P

Range Queries…: 

Range Queries… Gaiia : Assessment Enables range queries BUT… …assumes a central server ! Key perspectives Tree-like ordered indices where… Keys stored at leaf nodes Leaf nodes form a chained list Nodes mapped/hashed to peers

Semantic Clustering & Indexing: 

Semantic Clustering & Indexing Meta-Data for Content Description General core meta-data schema Theme-specialized schema Hierarchical semantic representation Semantic/content characterization Documents : keyword vectors Document indexing Index placement Peers : semantic vectors/matrices Semantic node clustering Semantic Groups/Subgroups

Slide49: 

Semantic Clustering & Indexing Vector construction from meta-data Vector values represent relevance degree for content description Semantic Proximity Node vectors <theme1, theme2, theme3,…> <sub-theme1, sub-theme2, sub-theme3,…>, etc. Semantic Distance d, where d = cosine of angle between vectors Neighbors chosen wrt d

Slide50: 

K V K V K V K V K V K V K V K V K V K V K V Semantic Clustering & Indexing

Slide51: 

Semantic Clustering & Indexing Direct Issues Group specification - Completeness, Validity Group management - Evolution ? Semantic distance between groups Beyond Semantic Grouping… Multi criteria grouping Confidence work spheres Other Issues Schema mapping Query reformulation Data / replica consistency …