logging in or signing up BARC 97 12 NorCal Saverio Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 72 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: March 20, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript BARCMicrosoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingenhttp://www.research.Microsoft.com/barc/ : BARC Microsoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/ Telepresence: Telepresence The next killer app Space shifting: Reduce travel Time shifting: Retrospective Offer condensations Just in time meetings. Example: ACM 97 NetShow and Web site. More web visitors than attendees People-to-People communication Working with NorCalAn Experiment in Presence: Working with NorCal An Experiment in Presence Is being there, then better than being somewhere else at some other time? December 11, 1997Telework = work + telepresence “being there while being here”: Telework = work + telepresence “being there while being here” The teleworkplace is just an office with limited Communication, computer, and network support! Team interactions for work! Until we understand in situ collaboration, CSCW is a “rat hole”! Serendipitous social interaction in hallway, office, coffee place, meeting room, etc. Administrative support for helping, filing, sending, etc. Telepresentations and communication Computing environment … being always connected and operational, administrivia, help in managing phones and messages, information (especially paper) management SOHOs & COMOHOs is a high growth marketIP Multicast: IP Multicast Is pruned broadcast to a multicast address Unreliable Reliable would require Ack/Nack. State or Nack implosion problem =sender =receiver =not interested What We Are Doing: What We Are Doing Scalable Reliable Multicast (SRM) used by WB (white board) of Mbone Nack suppression (backoff) N2 message traffic to set up Error Correcting SRM (EC SRM) Do not resend lost packets. Send Error Correction in addition to regular (or)Send Error Correction in response to NACK One EC packet repairs any of k lost packets Improved scaleability (millions of subscribers). (n,k) encoding: (n,k) encoding Original packetsECSRM : ECSRM Combine suppression & erasure correction Assign each packet to an EC group of size k NACK: (group, # missing) NACK of (g,c) suppresses all (g,xc). Don’t re-send originals; send EC packets using (n,k) encoding Below, 1 NACK and one EC packet fixes all errors. 1 2 3 4 5 6 7 ECTelepresence Prototypes: Telepresence Prototypes PowerCast: multicast PowerPoint Streaming - pre-sends next anticipated slide Send slides and voice rather than talking head and voice Uses ECSRM for reliable multicast 1000’s of receivers can join and leave any time. No server needed; no pre-load of slides. Cooperating with NetShow FileCast: multicast file transfer. Erasure encodes all packets Receivers only need to receive as many bytes as the length of the file Multicast IE to solve Midnight-Madness problem NT SRM: reliable IP multicast library for NTRAGS: RAndom SQL test Generator: RAGS: RAndom SQL test Generator Microsoft spends a LOT of money on testing. Idea: test SQL by generating random correct queries executing queries against database compare results with SQL 6.5, DB2, Oracle Being used in SQL 7.0 testing. 185 unique bugs found (since 2/97) Very productive test toolSample Rags Generated Statement: Sample Rags Generated Statement SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error: Query processor could not produce a query plan. Reduced Statement Causes Same Error: SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) Reduced Statement Causes Same Error Next steps: Auto-Simplify failure cases Compare outputs with other products Extend to other parts of SQL PatentsScaleup - Big Database: Scaleup - Big Database Build a 1 TB SQL Server database Show off Windows NT and SQL Server scalability Stress test the product Data must be 1 TB Unencumbered Interesting to everyone everywhere And not offensive to anyone anywhere Loaded 1.1 M place names from Encarta World Atlas 1 M Sq Km from USGS (1 meter resolution) 2 M Sq Km from Russian Space agency (2 m) Will be on web (world’s largest atlas) Sell images with commerce server. USGS CRDA: 3 TB more coming.The System: The System DEC Alpha + 8400 324 StorageWorks Drives (2.8 TB) SQL Server 7.0 USGS 1-meter data (30% of US) Russian Space data 1.6 meter resolution imagesDemo: Demo Http://t2b2cTechnical ChallengeKey idea: Technical Challenge Key idea Problem: Geo-Spatial Search without geo-spatial access methods. (just standard SQL Server) Solution: Geo-spatial search key: Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) Z-transform X & Y into single Z value, build B-tree on Z Adjacent images stored next to each other Search Method: Latitude and Longitude => X, Y, then Z Select on matching Z value Live on the internet in 98H1(Tied to Sphinx Beta 2 RTM )For 18 Months: Live on the internet in 98H1 (Tied to Sphinx Beta 2 RTM ) For 18 Months New Since S-Day: More data: 4.8 TB USGS DOQ .5 TB Russian Bigger Server: Alpha 8400 8 proc, 8 GB RAM, 2.8 TB Disk Improved Application Better UI Uses ASP Commerce App Cut images and Load Dec&Jan Built Commerce App for USGS & Spin-2 Release on Internet with Sphinx B2 Launch on Internet in SpringNT Clusters (Wolfpack): NT Clusters (Wolfpack) Scale DOWN to PDA: WindowsCE Scale UP an SMP: TerraServer Scale OUT with a cluster of machines Single-system image Naming Protection/security Management/load balance Fault tolerance “Wolfpack” Hot pluggable hardware & software Symmetric Virtual Server Failover Example: Web site Database Web site files Database files Server 1 Symmetric Virtual Server Failover Example Server 2 Web site files Database files Web site DatabaseClusters & BackOffice: Clusters & BackOffice Research: Instant & Transparent failover Making BackOffice PlugNPlay on Wolfpack Automatic install & configure Virtual Server concept makes it easy simpler management concept simpler context/state migration transparent to applications SQL 6.5E & 7.0 Failover MSMQ (queues), MTS (transactions). 1.2 B tpd: 1.2 B tpd 1 B tpd ran for 24 hrs. Out-of-the-box software Off-the-shelf hardware AMAZING! Sized for 30 days Linear growth 5 micro-dollars per transaction Storage Latency: How Far Away is the Data?: Storage Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk 1 2 10 100 Tape /Optical Robot 10 9 10 6The Memory Hierarchy: Controller The Memory Hierarchy Measuring & Modeling Sequential IO Where is the bottleneck? How does it scale with SMP, RAID, new interconnects Adapter SCSI File cache PCI Memory Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Mem bus App address spacePAP (peak advertised Performance) vs RAP (real application performance) : PAP (peak advertised Performance) vs RAP (real application performance) Goal: PAP = RAP / 2 (the half-power point)The Best Case: Temp File, NO IO: The Best Case: Temp File, NO IO Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware is limited to 150 MBps per processor Out of the Box Disk File Performance: Out of the Box Disk File Performance One NTFS disk Buffered read NTFS does 64 KB read-ahead if you ask FILE_FLAG_SEQUENTIAL or if it thinks you are sequential NTFS does 64 KB write behind under same conditions aggregates many small IO to few big IO. Synchronous Buffered Read/Write: Synchronous Buffered Read/Write Read throughput is GREAT! Write throughput is 40% of read WCE is fast but dangerous Net: default out of the box performance is good. 20 ms/MB ~ 2 instructions/byte! CPU will saturate at 50MBps Bottleneck Analysis: Bottleneck Analysis Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Memory Read/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBpsParallel Access To Data?: Parallel Access To Data? 1 Terabyte 10 MB/s At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 100 second SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. BANDWIDTH 10 GB/sPAP vs RAP: PAP vs RAP Reads are easy, writes are hard Async write can match WCE.Bottleneck Analysis: Bottleneck Analysis NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps Adapter 70 MBpsNT Memory Broker: NT Memory Broker Some servers absorb memory circumvent NT memory management. Complicated by Wolfpack failover, large memory support, interactions among servers. Prototype Memory Broker service augments NT memory management: Separates memory needs and desires Dynamic expand & reclaim memory footprint Monitor memory usage, paging, shared buffers Cross-server arbitration Clients: SQL Server, Exchange, Oracle,.. Working with NT-Team (Lou Perazzoli) Public Service: Public Service Gordon Bell Computer Museum Vanguard Group Edits column in CACM Jim Gray National Research Council Computer Science and Telecommunications Board Presidential Advisory Committee on NGI-IT-HPPC Edit Journals & Conferences. Tom Barclay USGS and Russian cooperative researchBARCMicrosoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Don Slutz Catherine Van Ingenhttp://www.research.Microsoft.com/barc/ : BARC Microsoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/ You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
BARC 97 12 NorCal Saverio Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 72 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: March 20, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript BARCMicrosoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingenhttp://www.research.Microsoft.com/barc/ : BARC Microsoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/ Telepresence: Telepresence The next killer app Space shifting: Reduce travel Time shifting: Retrospective Offer condensations Just in time meetings. Example: ACM 97 NetShow and Web site. More web visitors than attendees People-to-People communication Working with NorCalAn Experiment in Presence: Working with NorCal An Experiment in Presence Is being there, then better than being somewhere else at some other time? December 11, 1997Telework = work + telepresence “being there while being here”: Telework = work + telepresence “being there while being here” The teleworkplace is just an office with limited Communication, computer, and network support! Team interactions for work! Until we understand in situ collaboration, CSCW is a “rat hole”! Serendipitous social interaction in hallway, office, coffee place, meeting room, etc. Administrative support for helping, filing, sending, etc. Telepresentations and communication Computing environment … being always connected and operational, administrivia, help in managing phones and messages, information (especially paper) management SOHOs & COMOHOs is a high growth marketIP Multicast: IP Multicast Is pruned broadcast to a multicast address Unreliable Reliable would require Ack/Nack. State or Nack implosion problem =sender =receiver =not interested What We Are Doing: What We Are Doing Scalable Reliable Multicast (SRM) used by WB (white board) of Mbone Nack suppression (backoff) N2 message traffic to set up Error Correcting SRM (EC SRM) Do not resend lost packets. Send Error Correction in addition to regular (or)Send Error Correction in response to NACK One EC packet repairs any of k lost packets Improved scaleability (millions of subscribers). (n,k) encoding: (n,k) encoding Original packetsECSRM : ECSRM Combine suppression & erasure correction Assign each packet to an EC group of size k NACK: (group, # missing) NACK of (g,c) suppresses all (g,xc). Don’t re-send originals; send EC packets using (n,k) encoding Below, 1 NACK and one EC packet fixes all errors. 1 2 3 4 5 6 7 ECTelepresence Prototypes: Telepresence Prototypes PowerCast: multicast PowerPoint Streaming - pre-sends next anticipated slide Send slides and voice rather than talking head and voice Uses ECSRM for reliable multicast 1000’s of receivers can join and leave any time. No server needed; no pre-load of slides. Cooperating with NetShow FileCast: multicast file transfer. Erasure encodes all packets Receivers only need to receive as many bytes as the length of the file Multicast IE to solve Midnight-Madness problem NT SRM: reliable IP multicast library for NTRAGS: RAndom SQL test Generator: RAGS: RAndom SQL test Generator Microsoft spends a LOT of money on testing. Idea: test SQL by generating random correct queries executing queries against database compare results with SQL 6.5, DB2, Oracle Being used in SQL 7.0 testing. 185 unique bugs found (since 2/97) Very productive test toolSample Rags Generated Statement: Sample Rags Generated Statement SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notes FROM titles T0, roysched T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY ( SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS ( SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 ) This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error: Query processor could not produce a query plan. Reduced Statement Causes Same Error: SELECT roysched.royalty FROM titles, roysched WHERE EXISTS ( SELECT DISTINCT TOP 1 titles.advance FROM sales ORDER BY 1) Reduced Statement Causes Same Error Next steps: Auto-Simplify failure cases Compare outputs with other products Extend to other parts of SQL PatentsScaleup - Big Database: Scaleup - Big Database Build a 1 TB SQL Server database Show off Windows NT and SQL Server scalability Stress test the product Data must be 1 TB Unencumbered Interesting to everyone everywhere And not offensive to anyone anywhere Loaded 1.1 M place names from Encarta World Atlas 1 M Sq Km from USGS (1 meter resolution) 2 M Sq Km from Russian Space agency (2 m) Will be on web (world’s largest atlas) Sell images with commerce server. USGS CRDA: 3 TB more coming.The System: The System DEC Alpha + 8400 324 StorageWorks Drives (2.8 TB) SQL Server 7.0 USGS 1-meter data (30% of US) Russian Space data 1.6 meter resolution imagesDemo: Demo Http://t2b2cTechnical ChallengeKey idea: Technical Challenge Key idea Problem: Geo-Spatial Search without geo-spatial access methods. (just standard SQL Server) Solution: Geo-spatial search key: Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y) Z-transform X & Y into single Z value, build B-tree on Z Adjacent images stored next to each other Search Method: Latitude and Longitude => X, Y, then Z Select on matching Z value Live on the internet in 98H1(Tied to Sphinx Beta 2 RTM )For 18 Months: Live on the internet in 98H1 (Tied to Sphinx Beta 2 RTM ) For 18 Months New Since S-Day: More data: 4.8 TB USGS DOQ .5 TB Russian Bigger Server: Alpha 8400 8 proc, 8 GB RAM, 2.8 TB Disk Improved Application Better UI Uses ASP Commerce App Cut images and Load Dec&Jan Built Commerce App for USGS & Spin-2 Release on Internet with Sphinx B2 Launch on Internet in SpringNT Clusters (Wolfpack): NT Clusters (Wolfpack) Scale DOWN to PDA: WindowsCE Scale UP an SMP: TerraServer Scale OUT with a cluster of machines Single-system image Naming Protection/security Management/load balance Fault tolerance “Wolfpack” Hot pluggable hardware & software Symmetric Virtual Server Failover Example: Web site Database Web site files Database files Server 1 Symmetric Virtual Server Failover Example Server 2 Web site files Database files Web site DatabaseClusters & BackOffice: Clusters & BackOffice Research: Instant & Transparent failover Making BackOffice PlugNPlay on Wolfpack Automatic install & configure Virtual Server concept makes it easy simpler management concept simpler context/state migration transparent to applications SQL 6.5E & 7.0 Failover MSMQ (queues), MTS (transactions). 1.2 B tpd: 1.2 B tpd 1 B tpd ran for 24 hrs. Out-of-the-box software Off-the-shelf hardware AMAZING! Sized for 30 days Linear growth 5 micro-dollars per transaction Storage Latency: How Far Away is the Data?: Storage Latency: How Far Away is the Data? Registers On Chip Cache On Board Cache Memory Disk 1 2 10 100 Tape /Optical Robot 10 9 10 6The Memory Hierarchy: Controller The Memory Hierarchy Measuring & Modeling Sequential IO Where is the bottleneck? How does it scale with SMP, RAID, new interconnects Adapter SCSI File cache PCI Memory Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Mem bus App address spacePAP (peak advertised Performance) vs RAP (real application performance) : PAP (peak advertised Performance) vs RAP (real application performance) Goal: PAP = RAP / 2 (the half-power point)The Best Case: Temp File, NO IO: The Best Case: Temp File, NO IO Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware is limited to 150 MBps per processor Out of the Box Disk File Performance: Out of the Box Disk File Performance One NTFS disk Buffered read NTFS does 64 KB read-ahead if you ask FILE_FLAG_SEQUENTIAL or if it thinks you are sequential NTFS does 64 KB write behind under same conditions aggregates many small IO to few big IO. Synchronous Buffered Read/Write: Synchronous Buffered Read/Write Read throughput is GREAT! Write throughput is 40% of read WCE is fast but dangerous Net: default out of the box performance is good. 20 ms/MB ~ 2 instructions/byte! CPU will saturate at 50MBps Bottleneck Analysis: Bottleneck Analysis Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Memory Read/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBpsParallel Access To Data?: Parallel Access To Data? 1 Terabyte 10 MB/s At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 100 second SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. BANDWIDTH 10 GB/sPAP vs RAP: PAP vs RAP Reads are easy, writes are hard Async write can match WCE.Bottleneck Analysis: Bottleneck Analysis NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Memory Read/Write ~150 MBps PCI ~70 MBps Adapter ~30 MBps Adapter 70 MBpsNT Memory Broker: NT Memory Broker Some servers absorb memory circumvent NT memory management. Complicated by Wolfpack failover, large memory support, interactions among servers. Prototype Memory Broker service augments NT memory management: Separates memory needs and desires Dynamic expand & reclaim memory footprint Monitor memory usage, paging, shared buffers Cross-server arbitration Clients: SQL Server, Exchange, Oracle,.. Working with NT-Team (Lou Perazzoli) Public Service: Public Service Gordon Bell Computer Museum Vanguard Group Edits column in CACM Jim Gray National Research Council Computer Science and Telecommunications Board Presidential Advisory Committee on NGI-IT-HPPC Edit Journals & Conferences. Tom Barclay USGS and Russian cooperative researchBARCMicrosoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Don Slutz Catherine Van Ingenhttp://www.research.Microsoft.com/barc/ : BARC Microsoft Bay Area Research Center Tom Barclay Gordon Bell Joe Barrera Jim Gemmell Jim Gray Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc/