logging in or signing up MAHA Talk VolteMort Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 42 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Data Indexing and Sharing in Large-scale Distributed Systems: Data Indexing and Sharing in Large-scale Distributed Systems Maha Abdallah LIP6, Paris VIDBMS : A Brief History: DBMS : A Brief History Application Files DB Server DBMS Connections (ODBC, JDBC) Application Application A Client-Server ArchitectureWhy a DBMS ?: Why a DBMS ? Simplified Data Management & Access Data Definition Language (DDL) Data Manipulation Language (DML) Specific DBMS Functionalities Query Engine Query Optimizer Transaction Management CREATE TABLE Accounts ( AccountID varchar(30) PRIMARY KEY, CustomerID varchar(30) NOT NULL, Branch varchar(10) NOT NULL, Balance Decimal(10, 2), …) INSERT INTO Accounts VALUES (‘200’, ‘JONES’, ‘New York’, 3500.84)Relational Data Representation: Relational Data Representation 333.00 New York KHAN 456 500.00 San Francisco ONO 400 340.14 San Francisco GREEN 350 23.17 New York SMITH 345 200.00 San Francisco GRAY 324 1000.00 New York JONES 200 Balance Branch CustomerID AccountID . . . New York San Francisco Primary Index IndexDistributed DBMS Design: Distributed DBMS Design Top-down Approach Global schema definition Data fragmentation Local sites for fragments storage Transparent Distributed DBMS Design: Distributed DBMS Design Horizontal Fragmentation (Account) branch = ‘New York’ Vertical Fragmentation (Account) accountID, balance Distributed DBMS Design: Distributed DBMS Design Bottom-up Approaches Integration of preexisting legacy systems Mediated schema approach Others: Ontology-based approach, … Global As View (GAV) Global schema as view over Source ones query(G) = query(f(S1,…, Sn)) Local As View (LAV) Sources as views over global schema query(G) = query(f1-1 (S1),…, fn-1 (Sn)) Integrated Schema User view Integration Mediator OODBMS ORDBMS Heterogeneous Systems RDBMS Adapter Adapter AdapterDistributed DBMSs…: Distributed DBMSs… Assessment in large-scale environments Central Mediation Schema Centralized / Replicated Global Index Issue : Current DB technologies DO NOT SCALE Can we Bring P2P systems’ scalability to the DB world ? Why P2P ?: Why P2P ? A distributed system architecture Typically many nodes, unreliable and heterogeneous No centralized control Nodes are symmetric in function Harness distributed, shared resources (bandwidth, CPU, storage) on peer nodes Fault-tolerant, self-organizing Dynamic environment with frequent join and leaveP2P Challenge : Locating Content: P2P Challenge : Locating Content Unstructured Systems Napster : Centralized global index Gnutella : No global index Flooding who has file F ? 1 Node 1 Napster central server Napster GnutellaP2P Challenge : Locating Content: P2P Challenge : Locating Content Structured Systems Distributed Hash Tables Efficient : hash-based indexing Scalable : distributed global index ... x y z hash table hash bucket insert (key, data) lookup (key) → data 0 1 2 ... node insert (key, data) lookup (key) → data N-1 Hash Tables Associate data with keys Key is hashed to find bucket in hash table DHTs Associate data with keys Key is hashed to find peer node in the system Internet Scale DHTs : A Case Study : Internet Scale DHTs : A Case Study Content Addressable Networks (CAN) Interface - insert(key,value) - value = retrieve(key) DHTs : Dealing with System Dynamicity: DHTs : Dealing with System Dynamicity Consistent Hashing Fixed hash space shared by all peer nodes 2-dimensional space 1 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: CAN : Sharing Hash Space 1 2 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: CAN : Sharing Hash Space 1 2 3 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: 1 2 3 4 CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: (1) a = hx(K) CAN: Example x = a node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: (1) a = hx(K) b = hy(K) CAN: Example x = a y = b node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: (1) a = hx(K) b = hy(K) CAN: Example (2) route(K,V) -> (a,b) node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = hx(K) b = hy(K) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = hx(K) b = hy(K) node J::retrieve(K) J By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: routing table: CAN: routing table A node only maintains state for its immediate neighboring nodes By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: routing: CAN: routing (a,b) (x,y) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion Bootstrap node Discover some node “I” already in CAN new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion I new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion 2) pick random point in space I (p,q) new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion (p,q) I routes to (p,q), discovers node J I J new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion new J Split J’s zone in half… new owns one half By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksDHTs…: DHTs… Assessment in large-scale environments Coarse-grained indexing by document ID File level Content and semantic unawareness Random graph construction No “practical” relation between neighbors, as… - Semantic proximity - Geographical proximity, … Read-only data sharing Problem: do not adapt to complex data sharing fine-grained data semantically rich data Read/Write data sharing Bridging the Gap…: Bridging the Gap… Can we bring DBMSs’ technologies to the P2P world ? DBMS Technologies P2P Technologies scalable Fine-grained Sharing of Semantically Rich Data Targeted Applications : Scientific data sharing among research communities Fine-Grained Indexing : Basic Idea: Fine-Grained Indexing : Basic Idea DHT-based approach Indexing granularity Relations, attributes Key-words Content-based look-up & access Strong query language expressiveness Relational algebra operators Relational Data Indexing: Relational Data Indexing Peers support relational data No central mediated schema Global index distributed over peers Meta-Data for semantic/structural content description Relations set of keywords Attributes set of keywords SELECT treatment FROM Sickness WHERE Name = “Cancer” Typical query QSlide36: Relational Data IndexingSlide37: Relational Data Indexing Relation & Attribute indexing (key, value) hash index, ∀ key ∈ {relation name, relation keywords} value = node IP@ Sample hash index : Relations entries h(sickness) Peer3 h(Illness) Peer2 Relational Data Indexing: Relational Data Indexing Sample hash index : Attributes entries h(sickness||name) Peer3 h(sickness||treatment) Peer1 Peer1 issues query Q SELECT treatment SELECT Attribute FROM Sickness FROM Relation WHERE name = “Cancer” WHERE Predicate h(sickness||name) Peer3 & h(Illness||name) Peer2 h(sickness||treatment) Peer1 & h(Illness||treatment) Peer4 Send request Q to Peers 1, 2 & 3 only Slide39: Relational Data Indexing Assessment Scalability No global mediation schema Attribute-level distributed indexing, look-up and access Highly expressive query languages Current Extensions Store value intervals in index Support for range queries SELECT Attribute1 FROM Relation WHERE Attribute2 BETWEEN 350 AND 500 Still a lot to do…Semantic Clustering & Indexing: Semantic Clustering & Indexing Meta-Data for Content Description General core meta-data schema Theme-specialized schema Hierarchical semantic representation Semantic/content characterization Documents : keyword vectors Document indexing Index placement Peers : semantic vectors/matrices Overlay construction Semantic node clustering Semantic Groups/Subgroups Slide41: Semantic Clustering & Indexing Vector construction from meta-data Vector values represent relevance degree for content description Semantic Proximity Node vectors <theme1, theme2, theme3,…> <sub-theme1, sub-theme2, sub-theme3,…>, etc. Semantic Distance d, where d = cosine of angle between vectors Neighbors chosen wrt d Slide42: K V K V K V K V K V K V K V K V K V K V K V Semantic Clustering & Indexing Document vectors : <keyword1, keyword2, …, keywordN> Document indexing : keywords hashing Slide43: Semantic Clustering & Indexing Direct Issues Group specification - Completeness - Validity - Evolution Group management Semantic distance between groups Index placement wrt group semantics Other Issues Beyond semantic grouping … Confidence work spheres Multi criteria grouping And still… Data consistency and updates Schema mappings You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
MAHA Talk VolteMort Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 42 Category: News & Reports.. License: All Rights Reserved Like it (0) Dislike it (0) Added: September 27, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Data Indexing and Sharing in Large-scale Distributed Systems: Data Indexing and Sharing in Large-scale Distributed Systems Maha Abdallah LIP6, Paris VIDBMS : A Brief History: DBMS : A Brief History Application Files DB Server DBMS Connections (ODBC, JDBC) Application Application A Client-Server ArchitectureWhy a DBMS ?: Why a DBMS ? Simplified Data Management & Access Data Definition Language (DDL) Data Manipulation Language (DML) Specific DBMS Functionalities Query Engine Query Optimizer Transaction Management CREATE TABLE Accounts ( AccountID varchar(30) PRIMARY KEY, CustomerID varchar(30) NOT NULL, Branch varchar(10) NOT NULL, Balance Decimal(10, 2), …) INSERT INTO Accounts VALUES (‘200’, ‘JONES’, ‘New York’, 3500.84)Relational Data Representation: Relational Data Representation 333.00 New York KHAN 456 500.00 San Francisco ONO 400 340.14 San Francisco GREEN 350 23.17 New York SMITH 345 200.00 San Francisco GRAY 324 1000.00 New York JONES 200 Balance Branch CustomerID AccountID . . . New York San Francisco Primary Index IndexDistributed DBMS Design: Distributed DBMS Design Top-down Approach Global schema definition Data fragmentation Local sites for fragments storage Transparent Distributed DBMS Design: Distributed DBMS Design Horizontal Fragmentation (Account) branch = ‘New York’ Vertical Fragmentation (Account) accountID, balance Distributed DBMS Design: Distributed DBMS Design Bottom-up Approaches Integration of preexisting legacy systems Mediated schema approach Others: Ontology-based approach, … Global As View (GAV) Global schema as view over Source ones query(G) = query(f(S1,…, Sn)) Local As View (LAV) Sources as views over global schema query(G) = query(f1-1 (S1),…, fn-1 (Sn)) Integrated Schema User view Integration Mediator OODBMS ORDBMS Heterogeneous Systems RDBMS Adapter Adapter AdapterDistributed DBMSs…: Distributed DBMSs… Assessment in large-scale environments Central Mediation Schema Centralized / Replicated Global Index Issue : Current DB technologies DO NOT SCALE Can we Bring P2P systems’ scalability to the DB world ? Why P2P ?: Why P2P ? A distributed system architecture Typically many nodes, unreliable and heterogeneous No centralized control Nodes are symmetric in function Harness distributed, shared resources (bandwidth, CPU, storage) on peer nodes Fault-tolerant, self-organizing Dynamic environment with frequent join and leaveP2P Challenge : Locating Content: P2P Challenge : Locating Content Unstructured Systems Napster : Centralized global index Gnutella : No global index Flooding who has file F ? 1 Node 1 Napster central server Napster GnutellaP2P Challenge : Locating Content: P2P Challenge : Locating Content Structured Systems Distributed Hash Tables Efficient : hash-based indexing Scalable : distributed global index ... x y z hash table hash bucket insert (key, data) lookup (key) → data 0 1 2 ... node insert (key, data) lookup (key) → data N-1 Hash Tables Associate data with keys Key is hashed to find bucket in hash table DHTs Associate data with keys Key is hashed to find peer node in the system Internet Scale DHTs : A Case Study : Internet Scale DHTs : A Case Study Content Addressable Networks (CAN) Interface - insert(key,value) - value = retrieve(key) DHTs : Dealing with System Dynamicity: DHTs : Dealing with System Dynamicity Consistent Hashing Fixed hash space shared by all peer nodes 2-dimensional space 1 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: CAN : Sharing Hash Space 1 2 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: CAN : Sharing Hash Space 1 2 3 By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: 1 2 3 4 CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN : Sharing Hash Space: CAN : Sharing Hash Space By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: (1) a = hx(K) CAN: Example x = a node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: (1) a = hx(K) b = hy(K) CAN: Example x = a y = b node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: (1) a = hx(K) b = hy(K) CAN: Example (2) route(K,V) -> (a,b) node I::insert(K,V) I By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example (2) route(K,V) -> (a,b) (3) (a,b) stores (K,V) (K,V) node I::insert(K,V) I (1) a = hx(K) b = hy(K) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: Example: CAN: Example (2) route “retrieve(K)” to (a,b) (K,V) (1) a = hx(K) b = hy(K) node J::retrieve(K) J By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: routing table: CAN: routing table A node only maintains state for its immediate neighboring nodes By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: routing: CAN: routing (a,b) (x,y) By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion Bootstrap node Discover some node “I” already in CAN new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion I new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion 2) pick random point in space I (p,q) new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion (p,q) I routes to (p,q), discovers node J I J new node By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksCAN: node insertion: CAN: node insertion new J Split J’s zone in half… new owns one half By S.Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker ICIR, U.C.Berkeley, Tahoe NetworksDHTs…: DHTs… Assessment in large-scale environments Coarse-grained indexing by document ID File level Content and semantic unawareness Random graph construction No “practical” relation between neighbors, as… - Semantic proximity - Geographical proximity, … Read-only data sharing Problem: do not adapt to complex data sharing fine-grained data semantically rich data Read/Write data sharing Bridging the Gap…: Bridging the Gap… Can we bring DBMSs’ technologies to the P2P world ? DBMS Technologies P2P Technologies scalable Fine-grained Sharing of Semantically Rich Data Targeted Applications : Scientific data sharing among research communities Fine-Grained Indexing : Basic Idea: Fine-Grained Indexing : Basic Idea DHT-based approach Indexing granularity Relations, attributes Key-words Content-based look-up & access Strong query language expressiveness Relational algebra operators Relational Data Indexing: Relational Data Indexing Peers support relational data No central mediated schema Global index distributed over peers Meta-Data for semantic/structural content description Relations set of keywords Attributes set of keywords SELECT treatment FROM Sickness WHERE Name = “Cancer” Typical query QSlide36: Relational Data IndexingSlide37: Relational Data Indexing Relation & Attribute indexing (key, value) hash index, ∀ key ∈ {relation name, relation keywords} value = node IP@ Sample hash index : Relations entries h(sickness) Peer3 h(Illness) Peer2 Relational Data Indexing: Relational Data Indexing Sample hash index : Attributes entries h(sickness||name) Peer3 h(sickness||treatment) Peer1 Peer1 issues query Q SELECT treatment SELECT Attribute FROM Sickness FROM Relation WHERE name = “Cancer” WHERE Predicate h(sickness||name) Peer3 & h(Illness||name) Peer2 h(sickness||treatment) Peer1 & h(Illness||treatment) Peer4 Send request Q to Peers 1, 2 & 3 only Slide39: Relational Data Indexing Assessment Scalability No global mediation schema Attribute-level distributed indexing, look-up and access Highly expressive query languages Current Extensions Store value intervals in index Support for range queries SELECT Attribute1 FROM Relation WHERE Attribute2 BETWEEN 350 AND 500 Still a lot to do…Semantic Clustering & Indexing: Semantic Clustering & Indexing Meta-Data for Content Description General core meta-data schema Theme-specialized schema Hierarchical semantic representation Semantic/content characterization Documents : keyword vectors Document indexing Index placement Peers : semantic vectors/matrices Overlay construction Semantic node clustering Semantic Groups/Subgroups Slide41: Semantic Clustering & Indexing Vector construction from meta-data Vector values represent relevance degree for content description Semantic Proximity Node vectors <theme1, theme2, theme3,…> <sub-theme1, sub-theme2, sub-theme3,…>, etc. Semantic Distance d, where d = cosine of angle between vectors Neighbors chosen wrt d Slide42: K V K V K V K V K V K V K V K V K V K V K V Semantic Clustering & Indexing Document vectors : <keyword1, keyword2, …, keywordN> Document indexing : keywords hashing Slide43: Semantic Clustering & Indexing Direct Issues Group specification - Completeness - Validity - Evolution Group management Semantic distance between groups Index placement wrt group semantics Other Issues Beyond semantic grouping … Confidence work spheres Multi criteria grouping And still… Data consistency and updates Schema mappings