MAHA Talk

Uploaded from authorPOINTLite
Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Data Indexing and Sharing in Large-scale Distributed Systems: 

Data Indexing and Sharing in Large-scale Distributed Systems Maha Abdallah LIP6, Paris VI

DBMS : A Brief History: 

DBMS : A Brief History Application Files DB Server DBMS Connections (ODBC, JDBC) Application Application A Client-Server Architecture

Why a DBMS ?: 

Why a DBMS ? Simplified Data Management & Access Data Definition Language (DDL) Data Manipulation Language (DML) Specific DBMS Functionalities Query Engine Query Optimizer Transaction Management CREATE TABLE Accounts ( AccountID varchar(30) PRIMARY KEY, CustomerID varchar(30) NOT NULL, Branch varchar(10) NOT NULL, Balance Decimal(10, 2), …) INSERT INTO Accounts VALUES (‘200’, ‘JONES’, ‘New York’, 3500.84)

Relational Data Representation: 

Relational Data Representation 333.00 New York KHAN 456 500.00 San Francisco ONO 400 340.14 San Francisco GREEN 350 23.17 New York SMITH 345 200.00 San Francisco GRAY 324 1000.00 New York JONES 200 Balance Branch CustomerID AccountID . . . New York San Francisco Primary Index Index

Distributed DBMS Design: 

Distributed DBMS Design Top-down Approach Global schema definition Data fragmentation Local sites for fragments storage Transparent

Distributed DBMS Design: 

Distributed DBMS Design Horizontal Fragmentation (Account) branch = ‘New York’ Vertical Fragmentation (Account) accountID, balance

Distributed DBMS Design: 

Distributed DBMS Design Bottom-up Approaches Integration of preexisting legacy systems Mediated schema approach Others: Ontology-based approach, … Global As View (GAV) Global schema as view over Source ones query(G) = query(f(S1,…, Sn)) Local As View (LAV) Sources as views over global schema query(G) = query(f1-1 (S1),…, fn-1 (Sn)) Integrated Schema User view Integration Mediator OODBMS ORDBMS Heterogeneous Systems RDBMS Adapter Adapter Adapter

Distributed DBMSs…: 

Distributed DBMSs… Assessment in large-scale environments Central Mediation Schema Centralized / Replicated Global Index Issue : Current DB technologies DO NOT SCALE Can we Bring P2P systems’ scalability to the DB world ?

Why P2P ?: 

Why P2P ? A distributed system architecture Typically many nodes, unreliable and heterogeneous No centralized control Nodes are symmetric in function Harness distributed, shared resources (bandwidth, CPU, storage) on peer nodes Fault-tolerant, self-organizing Dynamic environment with frequent join and leave

P2P Challenge : Locating Content: 

P2P Challenge : Locating Content Unstructured Systems Napster : Centralized global index Gnutella : No global index Flooding who has file F ? 1 Node 1 Napster central server Napster Gnutella

P2P Challenge : Locating Content: 

P2P Challenge : Locating Content Structured Systems Distributed Hash Tables Efficient : hash-based indexing Scalable : distributed global index ... x y z hash table hash bucket insert (key, data) lookup (key) → data 0 1 2 ... node insert (key, data) lookup (key) → data N-1 Hash Tables Associate data with keys Key is hashed to find bucket in hash table DHTs Associate data with keys Key is hashed to find peer node in the system

Internet Scale DHTs : A Case Study : 

Internet Scale DHTs : A Case Study Content Addressable Networks (CAN) Interface - insert(key,value) - value = retrieve(key)

DHTs : Dealing with System Dynamicity: 

DHTs : Dealing with System Dynamicity Consistent Hashing Fixed hash space shared by all peer nodes 2-dimensional space 1 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

CAN : Sharing Hash Space 1 2 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

CAN : Sharing Hash Space 1 2 3 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

1 2 3 4 CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN : Sharing Hash Space: 

CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

(1) a = hx(K) CAN: Example x = a node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

(1) a = hx(K) b = hy(K) CAN: Example x = a y = b node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

(1) a = hx(K) b = hy(K) CAN: Example (2) route(K,V) -> (a,b) node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = hx(K) b = hy(K) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: Example: 

CAN: Example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = hx(K) b = hy(K) node J::retrieve(K) J By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: routing table: 

CAN: routing table A node only maintains state for its immediate neighboring nodes By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: routing: 

CAN: routing (a,b) (x,y) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion Bootstrap node Discover some node “I” already in CAN new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion I new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion 2) pick random point in space I (p,q) new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion (p,q) I routes to (p,q), discovers node J I J new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

CAN: node insertion: 

CAN: node insertion new J Split J’s zone in half… new owns one half By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe Networks

DHTs…: 

DHTs… Assessment in large-scale environments Coarse-grained indexing by document ID File level Content and semantic unawareness Random graph construction No “practical” relation between neighbors, as… - Semantic proximity - Geographical proximity, … Read-only data sharing Problem: do not adapt to complex data sharing fine-grained data semantically rich data Read/Write data sharing

Bridging the Gap…: 

Bridging the Gap… Can we bring DBMSs’ technologies to the P2P world ? DBMS Technologies P2P Technologies scalable Fine-grained Sharing of Semantically Rich Data Targeted Applications : Scientific data sharing among research communities

Fine-Grained Indexing : Basic Idea: 

Fine-Grained Indexing : Basic Idea DHT-based approach Indexing granularity Relations, attributes Key-words Content-based look-up & access Strong query language expressiveness Relational algebra operators

Relational Data Indexing: 

Relational Data Indexing Peers support relational data No central mediated schema Global index distributed over peers Meta-Data for semantic/structural content description Relations set of keywords Attributes set of keywords SELECT treatment FROM Sickness WHERE Name = “Cancer” Typical query Q

Slide36: 

Relational Data Indexing

Slide37: 

Relational Data Indexing Relation & Attribute indexing (key, value) hash index, ∀ key ∈ {relation name, relation keywords} value = node IP@ Sample hash index : Relations entries h(sickness) Peer3 h(Illness) Peer2

Relational Data Indexing: 

Relational Data Indexing Sample hash index : Attributes entries h(sickness||name) Peer3 h(sickness||treatment) Peer1 Peer1 issues query Q SELECT treatment SELECT Attribute FROM Sickness FROM Relation WHERE name = “Cancer” WHERE Predicate h(sickness||name) Peer3 & h(Illness||name) Peer2 h(sickness||treatment) Peer1 & h(Illness||treatment) Peer4 Send request Q to Peers 1, 2 & 3 only

Slide39: 

Relational Data Indexing Assessment Scalability No global mediation schema Attribute-level distributed indexing, look-up and access Highly expressive query languages Current Extensions Store value intervals in index Support for range queries SELECT Attribute1 FROM Relation WHERE Attribute2 BETWEEN 350 AND 500 Still a lot to do…

Semantic Clustering & Indexing: 

Semantic Clustering & Indexing Meta-Data for Content Description General core meta-data schema Theme-specialized schema Hierarchical semantic representation Semantic/content characterization Documents : keyword vectors Document indexing Index placement Peers : semantic vectors/matrices Overlay construction Semantic node clustering Semantic Groups/Subgroups

Slide41: 

Semantic Clustering & Indexing Vector construction from meta-data Vector values represent relevance degree for content description Semantic Proximity Node vectors <theme1, theme2, theme3,…> <sub-theme1, sub-theme2, sub-theme3,…>, etc. Semantic Distance d, where d = cosine of angle between vectors Neighbors chosen wrt d

Slide42: 

K V K V K V K V K V K V K V K V K V K V K V Semantic Clustering & Indexing Document vectors : <keyword1, keyword2, …, keywordN> Document indexing : keywords hashing

Slide43: 

Semantic Clustering & Indexing Direct Issues Group specification - Completeness - Validity - Evolution Group management Semantic distance between groups Index placement wrt group semantics Other Issues Beyond semantic grouping … Confidence work spheres Multi criteria grouping And still… Data consistency and updates Schema mappings