logging in or signing up VOF Pasadena Maitane Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 57 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... By: edhaval09 (34 month(s) ago) hey plse allow me to download this ppt as it will help in my studies Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript ComputerTechnology Forecast: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~GrayReality Check: Reality Check Good news In the limit, processing & storage & network is free Processing & network is infinitely fast Bad news Most of us live in the present. People are getting more expensive. Management/programming cost exceeds hardware cost. Speed of light not improving. WAN prices have not changed much in last 8 years.Interesting Topics: Interesting Topics I’ll talk about server-side hardware What about client hardware? Displays, cameras, speech,…. What about Software? Databases, data mining, PDB, OODB Objects / class libraries … Visualization Open Source movementHow Much Information Is there?: How Much Information Is there? Soon everything can be recorded and indexed Most data never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. www.lesk.com/mlesk/ksg97/ksg.html Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book .Movie All LoC books (words) All Books MultiMedia Everything! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Moore’s Law: Moore’s Law Performance/Price doubles every 18 months 100x per decade Progress in next 18 months = ALL previous progress New storage = sum of all old storage (ever) New processing = sum of all old processing. E. coli double ever 20 minutes! 15 years agoTrends: ops/s/$ Had Three Growth Phases: Trends: ops/s/$ Had Three Growth Phases 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2000 Microprocessor 1.0 year doublingWhat’s a Balanced System?: System Bus PCI Bus PCI Bus What’s a Balanced System?Storage capacity beating Moore’s law: Storage capacity beating Moore’s law 5 k$/TB today (raw disk) Cheap Storage: Cheap Storage Disks are getting cheap: 7 k$/TB disks (25 40 GB disks @ 230$ each) Cheap Storage or Balanced System: Cheap Storage or Balanced System Low cost storage (2 x 1.5k$ servers) 7K$ TB 2x (1K$ system + 8x60GB disks + 100MbEthernet) Balanced server (7k$/.5 TB) 2x800Mhz (2k$) 256 MB (400$) 8 x 60 GB drives (3K$) Gbps Ethernet + switch (1.5k$) 14k$ TB, 28K$/RAIDED TB The “Absurd” Disk: The “Absurd” Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) It’s a tape! 1 TB 100 MB/s 200 KapsHot Swap Drives for Archive or Data Interchange: Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 60 GB in 40 minutes) 60 GB/overnite = ~N x 2 MB/second @ 19.95$/nite 17$ 260$240 GB, 2k$ (now)300 GB by year end.: 240 GB, 2k$ (now) 300 GB by year end. 4x60 GB IDE (2 hot plugable) (1,100$) SCSI-IDE bridge 200k$ Box 500 Mhz cpu 256 MB SRAM Fan, power, Enet 700$ Or 8 disks/box 600 GB for ~3K$ ( or 300 GB RAID)Hot Swap Drives for Archive or Data Interchange: Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 74 GB in 3 hours) 74 GB/overnite = ~N x 2 MB/second @ 19.95$/niteIt’s Hard to Archive a PetabyteIt takes a LONG time to restore it.: It’s Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, use other copy until failure repaired, refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space) Disk vs Tape: Disk vs Tape Disk 60 GB 30 MBps 5 ms seek time 3 ms rotate latency 7$/GB for drive 3$/GB for ctlrs/cabinet 4 TB/rack 1 hour scan Tape 40 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media 8$/GB for drive+library 10 TB/rack 1 week scan The price advantage of tape is narrowing, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape. Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drivesTrends: Gilder’s Law: 3x bandwidth/year for 25 more years: Trends: Gilder’s Law: 3x bandwidth/year for 25 more years Today: 10 Gbps per channel 4 channels per fiber: 40 Gbps 32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 TbpsSense of scale: 300 MBps OC48 = G2 Or memcpy() 90 MBps PCI Sense of scale How fat is your pipe? Fattest pipe on MS campus is the WAN! 20 MBps disk / ATM / OC3 94 MBps Coast to CoastSlide19: Redmond/Seattle, WA San Francisco, CA New York Arlington, VA 5626 km 10 hops Information Sciences Institute Microsoft Qwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA The Path: The Path DC -> SEA C:\tracert -d 131.107.151.194 Tracing route to 131.107.151.194 over a maximum of 30 hops 0 ------- DELL 4400 Win2K WKS Arlington Virginia, ISI Alteon GbE 1 16 ms <10 ms <10 ms 140.173.170.65 ------- Juniper M40 GbE Arlington Virginia, ISI Interface ISIe 2 <10 ms <10 ms <10 ms 205.171.40.61 ------- Cisco GSR OC48 Arlington Virginia, Qwest DC Edge 3 <10 ms <10 ms <10 ms 205.171.24.85 ------- Cisco GSR OC48 Arlington Virginia, Qwest DC Core 4 <10 ms <10 ms 16 ms 205.171.5.233 ------- Cisco GSR OC48 New York, New York, Qwest NYC Core 5 62 ms 63 ms 62 ms 205.171.5.115 ------- Cisco GSR OC48 San Francisco, CA, Qwest SF Core 6 78 ms 78 ms 78 ms 205.171.5.108 ------- Cisco GSR OC48 Seattle, Washington, Qwest Sea Core 7 78 ms 78 ms 94 ms 205.171.26.42 ------- Juniper M40 OC48 Seattle, Washington, Qwest Sea Edge 8 78 ms 79 ms 78 ms 208.46.239.90 ------- Juniper M40 OC48 Seattle, Washington, PNW Gigapop 9 78 ms 78 ms 94 ms 198.48.91.30 ------- Cisco GSR OC48 Redmond Washington, Microsoft 10 78 ms 78 ms 94 ms 131.107.151.194 ------- Compaq SP750 Win2K WKS Redmond Washington, Microsoft SysKonnect GbE “ PetaBumps”: “ PetaBumps” 751 mbps for 300 seconds = (~28 GB) single-thread single-stream tcp/ip desktop-to-desktop out of the box performance* 5626 km x 751Mbps = ~ 4.2e15 bit meter / second ~ 4.2 Peta bmps Multi-steam is 952 mbps ~5.2 Peta bmps 4470 byte MTUs were enabled on all routers. 20 MB window sizeThe Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/: The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ Yesterday: 10 MBps (100 Mbps Ethernet) ~20 MBps tcp/ip saturates 2 cpus round-trip latency ~250 µs Now Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… Fast user-level communication tcp/ip ~ 100 MBps 10% cpu round-trip latency is 15 us 1.6 Gbps demoed on a WANPointers: Pointers The single-stream submission: http://research.microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm The multi-stream submission: http://research.Microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm The code: http://research.Microsoft.com/~gray/papers/speedy.htm speedy.h speedy.c And a PowerPoint presentation about it. http://research.Microsoft.com/~gray/papers/ Windows2000_WAN_Speed_Record.pptNetworking : Networking WANS are getting faster than LANS G8 = OC192 = 8Gbps is “standard” Link bandwidth improves 4x per 3 years Speed of light (60 ms round trip in US) Software stacks have always been the problem. Time = SenderCPU + ReceiverCPU + bytes/bandwidth This has been the problemRules of Thumb in Data Engineering: Rules of Thumb in Data Engineering Moore’s law -> an address bit per 18 months. Storage grows 100x/decade (except 1000x last decade!) Disk data of 10 years ago now fits in RAM (iso-price). Device bandwidth grows 10x/decade – so need parallelism RAM:disk:tape price is 1:10:30 going to 1:10:10 Amdahl’s speedup law: S/(S+P) Amdahl’s IO law: bit of IO per instruction/second (tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars) Amdahl’s memory law: byte per instruction/second (going to 10) (1 TB RAM per TOP: 1 TeraDollars) PetaOps anyone? Gilder’s law: aggregate bandwidth doubles every 8 months. 5 Minute rule: cache disk data that is reused in 5 minutes. Web rule: cache everything! http://research.Microsoft.com/~gray/papers/ MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc Dealing With TeraBytes (Petabytes):Requires Parallelism: Dealing With TeraBytes (Petabytes): Requires Parallelism parallelism: use many little devices in parallelParallelism Must Be Automatic: Parallelism Must Be Automatic There are thousands of MPI programmers. There are hundreds-of-millions of people using parallel database search. Parallel programming is HARD! Find design patterns and automate them. Data search/mining has parallel design patterns.Scalability: Up and Out: Scalability: Up and OutEveryone scales outWhat’s the Brick?: Everyone scales out What’s the Brick? 1M$/slice IBM S390? Sun E 10,000? 100 K$/slice HPUX/AIX/Solaris/IRIX/EMC 10 K$/slice Utel / Wintel 4x 1 K$/slice Beowulf / Wintel 1x Terminology for scaleability: Terminology for scaleability Farms of servers: Clones: identical Scaleability + availability Partitions: Scaleability Packs Partition availability via fail-over GeoPlex for disaster tolerance.Unpredictable Growth: Unpredictable Growth The TerraServer Story: We expected 5 M hits per day We got 50 M hits on day 1 We peak at 15-20 M hpd on a “hot” day Average 5 M hpd after 1 year Most of us cannot predict demand Must be able to deal with NO demand Must be able to deal with HUGE demand An Architecture for Internet Services?: An Architecture for Internet Services? Need to be able to add capacity New processing New storage New networking Need continuous service Online change of all components (hardware and software) Multiple service sites Multiple network providers Need great development tools Change the application several times per year. Add new services several times per year. Premise: Each Site is a Farm : Premise: Each Site is a Farm Buy computing by the slice (brick): Rack of servers + disks. Grow by adding slices Spread data and computation to new slices Two styles: Clones: anonymous servers Parts+Packs: Partitions fail over within a pack In both cases, remote farm for disaster recoveryClones: Availability+Scalability: Clones: Availability+Scalability Some applications are Read-mostly Low consistency requirements Modest storage requirement (less than 1TB) Examples: HTML web servers (IP sprayer/sieve + replication) LDAP servers (replication via gossip) Replicate app at all nodes (clones) Spray requests across nodes. Grow by adding clones Fault tolerance: stop sending to that clone. Growth: add a clone.Two Clone Geometries: Two Clone Geometries Shared-Nothing: exact replicas Shared-Disk (state stored in server)Facilities Clones Need: Facilities Clones Need Automatic replication Applications (and system software) Data Automatic request routing Spray or sieve Management: Who is up? Update management & propagation Application monitoring. Clones are very easy to manage: Rule of thumb: 100’s of clones per admin Partitions for Scalability: Partitions for Scalability Clones are not appropriate for some apps. Statefull apps do not replicate well high update rates do not replicate well Examples Email / chat / … Databases Partition state among servers Scalability (online): Partition split/merge Partitioning must be transparent to client. Partitioned/Clustered Apps: Partitioned/Clustered Apps Mail servers Perfectly partitionable Business Object Servers Partition by set of objects. Parallel Databases Transparent access to partitioned tables Parallel QueryPacks for Availability: Packs for Availability Each partition may fail (independent of others) Partitions migrate to new node via fail-over Fail-over in seconds Pack: the nodes supporting a partition VMS Cluster Tandem Process Pair SP2 HACMP Sysplex™ WinNT MSCS (wolfpack) Cluster In A Box now commodity Partitions typically grow in packs. What Parts+Packs Need: What Parts+Packs Need Automatic partitioning (in dbms, mail, files,…) Location transparent Partition split/merge Grow without limits (100x10TB) Simple failover model Partition migration is transparent MSCS-like model for services Application-centric request routing Management: Who is up? Automatic partition management (split/merge) Application monitoring. Partitions and Packs: Partitions and Packs Packs for availabiltyGeoPlex: Farm pairs: GeoPlex: Farm pairs Two farms Changes from one sent to other When one farm fails other provides service Masks Hardware/Software faults Operations tasks (reorganize, upgrade move Environmental faults (power fail) Services on Clones & Partitions: Services on Clones & Partitions Application provides a set of services If cloned: Services are on subset of clones If partitioned: Services run at each partition System load balancing routes request to Any clone Correct partition. Routes around failures.Cluster Scenarios: 3- tier systems: Cluster Scenarios: 3- tier systems A simple web site Front End Web File Store SQL Temp State SQL Database Cluster Scale Out Scenarios: Cluster Scale Out Scenarios SQL Temp State Web File StoreA Cloned Front Ends (firewall, sprayer, web server) The FARM: Clones and Packs of Partitions Web Clients Load BalanceTerminology: Terminology Terminology for scaleability Farms of servers: Clones: identical Scaleability + availability Partitions: Scaleability Packs Partition availability via fail-over GeoPlex for disaster tolerance. What we have been doing with SDSS: Helping move the data to SQL Database design Data loading Experimenting with queries on a 4 M object DB 20 questions like “find gravitational lens candidates” Queries use parallelism, most run in a few seconds.(auto parallel) Some run in hours (neighbors within 1 arcsec) EASY to ask questions. Helping with an “outreach” website: SkyServer Personal goal: Try datamining techniques to “re-discover” Astronomy What we have been doing with SDSSReferences (.doc or .pdf): References (.doc or .pdf) Technology forecast: http://research.microsoft.com/~gray/papers/ MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc Gbps experiments: http://research.microsoft.com/~gray/ Disk experiments (10K$ TB) http://research.microsoft.com/~gray/papers/Win2K_IO_MSTR_2000_55.doc Scaleability Terminology http://research.microsoft.com/~gray/papers/MS_TR_99_85_Scalability_Terminology.doc You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
VOF Pasadena Maitane Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 57 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 29, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... By: edhaval09 (34 month(s) ago) hey plse allow me to download this ppt as it will help in my studies Saving..... Post Reply Close Saving..... Edit Comment Close Premium member Presentation Transcript ComputerTechnology Forecast: Computer Technology Forecast Jim Gray Microsoft Research Gray@Microsoft.com http://~research.Microsoft.com/~GrayReality Check: Reality Check Good news In the limit, processing & storage & network is free Processing & network is infinitely fast Bad news Most of us live in the present. People are getting more expensive. Management/programming cost exceeds hardware cost. Speed of light not improving. WAN prices have not changed much in last 8 years.Interesting Topics: Interesting Topics I’ll talk about server-side hardware What about client hardware? Displays, cameras, speech,…. What about Software? Databases, data mining, PDB, OODB Objects / class libraries … Visualization Open Source movementHow Much Information Is there?: How Much Information Is there? Soon everything can be recorded and indexed Most data never be seen by humans Precious Resource: Human attention Auto-Summarization Auto-Search is key technology. www.lesk.com/mlesk/ksg97/ksg.html Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book .Movie All LoC books (words) All Books MultiMedia Everything! Recorded A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Moore’s Law: Moore’s Law Performance/Price doubles every 18 months 100x per decade Progress in next 18 months = ALL previous progress New storage = sum of all old storage (ever) New processing = sum of all old processing. E. coli double ever 20 minutes! 15 years agoTrends: ops/s/$ Had Three Growth Phases: Trends: ops/s/$ Had Three Growth Phases 1890-1945 Mechanical Relay 7-year doubling 1945-1985 Tube, transistor,.. 2.3 year doubling 1985-2000 Microprocessor 1.0 year doublingWhat’s a Balanced System?: System Bus PCI Bus PCI Bus What’s a Balanced System?Storage capacity beating Moore’s law: Storage capacity beating Moore’s law 5 k$/TB today (raw disk) Cheap Storage: Cheap Storage Disks are getting cheap: 7 k$/TB disks (25 40 GB disks @ 230$ each) Cheap Storage or Balanced System: Cheap Storage or Balanced System Low cost storage (2 x 1.5k$ servers) 7K$ TB 2x (1K$ system + 8x60GB disks + 100MbEthernet) Balanced server (7k$/.5 TB) 2x800Mhz (2k$) 256 MB (400$) 8 x 60 GB drives (3K$) Gbps Ethernet + switch (1.5k$) 14k$ TB, 28K$/RAIDED TB The “Absurd” Disk: The “Absurd” Disk 2.5 hr scan time (poor sequential access) 1 aps / 5 GB (VERY cold data) It’s a tape! 1 TB 100 MB/s 200 KapsHot Swap Drives for Archive or Data Interchange: Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 60 GB in 40 minutes) 60 GB/overnite = ~N x 2 MB/second @ 19.95$/nite 17$ 260$240 GB, 2k$ (now)300 GB by year end.: 240 GB, 2k$ (now) 300 GB by year end. 4x60 GB IDE (2 hot plugable) (1,100$) SCSI-IDE bridge 200k$ Box 500 Mhz cpu 256 MB SRAM Fan, power, Enet 700$ Or 8 disks/box 600 GB for ~3K$ ( or 300 GB RAID)Hot Swap Drives for Archive or Data Interchange: Hot Swap Drives for Archive or Data Interchange 25 MBps write (so can write N x 74 GB in 3 hours) 74 GB/overnite = ~N x 2 MB/second @ 19.95$/niteIt’s Hard to Archive a PetabyteIt takes a LONG time to restore it.: It’s Hard to Archive a Petabyte It takes a LONG time to restore it. At 1GBps it takes 12 days! Store it in two (or more) places online (on disk?). A geo-plex Scrub it continuously (look for errors) On failure, use other copy until failure repaired, refresh lost copy from safe copy. Can organize the two copies differently (e.g.: one by time, one by space) Disk vs Tape: Disk vs Tape Disk 60 GB 30 MBps 5 ms seek time 3 ms rotate latency 7$/GB for drive 3$/GB for ctlrs/cabinet 4 TB/rack 1 hour scan Tape 40 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media 8$/GB for drive+library 10 TB/rack 1 week scan The price advantage of tape is narrowing, and the performance advantage of disk is growing At 10K$/TB, disk is competitive with nearline tape. Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =20 drivesTrends: Gilder’s Law: 3x bandwidth/year for 25 more years: Trends: Gilder’s Law: 3x bandwidth/year for 25 more years Today: 10 Gbps per channel 4 channels per fiber: 40 Gbps 32 fibers/bundle = 1.2 Tbps/bundle In lab 3 Tbps/fiber (400 x WDM) In theory 25 Tbps per fiber 1 Tbps = USA 1996 WAN bisection bandwidth Aggregate bandwidth doubles every 8 months! 1 fiber = 25 TbpsSense of scale: 300 MBps OC48 = G2 Or memcpy() 90 MBps PCI Sense of scale How fat is your pipe? Fattest pipe on MS campus is the WAN! 20 MBps disk / ATM / OC3 94 MBps Coast to CoastSlide19: Redmond/Seattle, WA San Francisco, CA New York Arlington, VA 5626 km 10 hops Information Sciences Institute Microsoft Qwest University of Washington Pacific Northwest Gigapop HSCC (high speed connectivity consortium) DARPA The Path: The Path DC -> SEA C:\tracert -d 131.107.151.194 Tracing route to 131.107.151.194 over a maximum of 30 hops 0 ------- DELL 4400 Win2K WKS Arlington Virginia, ISI Alteon GbE 1 16 ms <10 ms <10 ms 140.173.170.65 ------- Juniper M40 GbE Arlington Virginia, ISI Interface ISIe 2 <10 ms <10 ms <10 ms 205.171.40.61 ------- Cisco GSR OC48 Arlington Virginia, Qwest DC Edge 3 <10 ms <10 ms <10 ms 205.171.24.85 ------- Cisco GSR OC48 Arlington Virginia, Qwest DC Core 4 <10 ms <10 ms 16 ms 205.171.5.233 ------- Cisco GSR OC48 New York, New York, Qwest NYC Core 5 62 ms 63 ms 62 ms 205.171.5.115 ------- Cisco GSR OC48 San Francisco, CA, Qwest SF Core 6 78 ms 78 ms 78 ms 205.171.5.108 ------- Cisco GSR OC48 Seattle, Washington, Qwest Sea Core 7 78 ms 78 ms 94 ms 205.171.26.42 ------- Juniper M40 OC48 Seattle, Washington, Qwest Sea Edge 8 78 ms 79 ms 78 ms 208.46.239.90 ------- Juniper M40 OC48 Seattle, Washington, PNW Gigapop 9 78 ms 78 ms 94 ms 198.48.91.30 ------- Cisco GSR OC48 Redmond Washington, Microsoft 10 78 ms 78 ms 94 ms 131.107.151.194 ------- Compaq SP750 Win2K WKS Redmond Washington, Microsoft SysKonnect GbE “ PetaBumps”: “ PetaBumps” 751 mbps for 300 seconds = (~28 GB) single-thread single-stream tcp/ip desktop-to-desktop out of the box performance* 5626 km x 751Mbps = ~ 4.2e15 bit meter / second ~ 4.2 Peta bmps Multi-steam is 952 mbps ~5.2 Peta bmps 4470 byte MTUs were enabled on all routers. 20 MB window sizeThe Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/: The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/ Yesterday: 10 MBps (100 Mbps Ethernet) ~20 MBps tcp/ip saturates 2 cpus round-trip latency ~250 µs Now Wires are 10x faster Myrinet, Gbps Ethernet, ServerNet,… Fast user-level communication tcp/ip ~ 100 MBps 10% cpu round-trip latency is 15 us 1.6 Gbps demoed on a WANPointers: Pointers The single-stream submission: http://research.microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Single_Stream_mail).htm The multi-stream submission: http://research.Microsoft.com/~gray/papers/ Windows2000_I2_land_Speed_Contest_Entry_(Multi_Stream_mail).htm The code: http://research.Microsoft.com/~gray/papers/speedy.htm speedy.h speedy.c And a PowerPoint presentation about it. http://research.Microsoft.com/~gray/papers/ Windows2000_WAN_Speed_Record.pptNetworking : Networking WANS are getting faster than LANS G8 = OC192 = 8Gbps is “standard” Link bandwidth improves 4x per 3 years Speed of light (60 ms round trip in US) Software stacks have always been the problem. Time = SenderCPU + ReceiverCPU + bytes/bandwidth This has been the problemRules of Thumb in Data Engineering: Rules of Thumb in Data Engineering Moore’s law -> an address bit per 18 months. Storage grows 100x/decade (except 1000x last decade!) Disk data of 10 years ago now fits in RAM (iso-price). Device bandwidth grows 10x/decade – so need parallelism RAM:disk:tape price is 1:10:30 going to 1:10:10 Amdahl’s speedup law: S/(S+P) Amdahl’s IO law: bit of IO per instruction/second (tBps/10 top! 50,000 disks/10 teraOP: 100 M$ Dollars) Amdahl’s memory law: byte per instruction/second (going to 10) (1 TB RAM per TOP: 1 TeraDollars) PetaOps anyone? Gilder’s law: aggregate bandwidth doubles every 8 months. 5 Minute rule: cache disk data that is reused in 5 minutes. Web rule: cache everything! http://research.Microsoft.com/~gray/papers/ MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc Dealing With TeraBytes (Petabytes):Requires Parallelism: Dealing With TeraBytes (Petabytes): Requires Parallelism parallelism: use many little devices in parallelParallelism Must Be Automatic: Parallelism Must Be Automatic There are thousands of MPI programmers. There are hundreds-of-millions of people using parallel database search. Parallel programming is HARD! Find design patterns and automate them. Data search/mining has parallel design patterns.Scalability: Up and Out: Scalability: Up and OutEveryone scales outWhat’s the Brick?: Everyone scales out What’s the Brick? 1M$/slice IBM S390? Sun E 10,000? 100 K$/slice HPUX/AIX/Solaris/IRIX/EMC 10 K$/slice Utel / Wintel 4x 1 K$/slice Beowulf / Wintel 1x Terminology for scaleability: Terminology for scaleability Farms of servers: Clones: identical Scaleability + availability Partitions: Scaleability Packs Partition availability via fail-over GeoPlex for disaster tolerance.Unpredictable Growth: Unpredictable Growth The TerraServer Story: We expected 5 M hits per day We got 50 M hits on day 1 We peak at 15-20 M hpd on a “hot” day Average 5 M hpd after 1 year Most of us cannot predict demand Must be able to deal with NO demand Must be able to deal with HUGE demand An Architecture for Internet Services?: An Architecture for Internet Services? Need to be able to add capacity New processing New storage New networking Need continuous service Online change of all components (hardware and software) Multiple service sites Multiple network providers Need great development tools Change the application several times per year. Add new services several times per year. Premise: Each Site is a Farm : Premise: Each Site is a Farm Buy computing by the slice (brick): Rack of servers + disks. Grow by adding slices Spread data and computation to new slices Two styles: Clones: anonymous servers Parts+Packs: Partitions fail over within a pack In both cases, remote farm for disaster recoveryClones: Availability+Scalability: Clones: Availability+Scalability Some applications are Read-mostly Low consistency requirements Modest storage requirement (less than 1TB) Examples: HTML web servers (IP sprayer/sieve + replication) LDAP servers (replication via gossip) Replicate app at all nodes (clones) Spray requests across nodes. Grow by adding clones Fault tolerance: stop sending to that clone. Growth: add a clone.Two Clone Geometries: Two Clone Geometries Shared-Nothing: exact replicas Shared-Disk (state stored in server)Facilities Clones Need: Facilities Clones Need Automatic replication Applications (and system software) Data Automatic request routing Spray or sieve Management: Who is up? Update management & propagation Application monitoring. Clones are very easy to manage: Rule of thumb: 100’s of clones per admin Partitions for Scalability: Partitions for Scalability Clones are not appropriate for some apps. Statefull apps do not replicate well high update rates do not replicate well Examples Email / chat / … Databases Partition state among servers Scalability (online): Partition split/merge Partitioning must be transparent to client. Partitioned/Clustered Apps: Partitioned/Clustered Apps Mail servers Perfectly partitionable Business Object Servers Partition by set of objects. Parallel Databases Transparent access to partitioned tables Parallel QueryPacks for Availability: Packs for Availability Each partition may fail (independent of others) Partitions migrate to new node via fail-over Fail-over in seconds Pack: the nodes supporting a partition VMS Cluster Tandem Process Pair SP2 HACMP Sysplex™ WinNT MSCS (wolfpack) Cluster In A Box now commodity Partitions typically grow in packs. What Parts+Packs Need: What Parts+Packs Need Automatic partitioning (in dbms, mail, files,…) Location transparent Partition split/merge Grow without limits (100x10TB) Simple failover model Partition migration is transparent MSCS-like model for services Application-centric request routing Management: Who is up? Automatic partition management (split/merge) Application monitoring. Partitions and Packs: Partitions and Packs Packs for availabiltyGeoPlex: Farm pairs: GeoPlex: Farm pairs Two farms Changes from one sent to other When one farm fails other provides service Masks Hardware/Software faults Operations tasks (reorganize, upgrade move Environmental faults (power fail) Services on Clones & Partitions: Services on Clones & Partitions Application provides a set of services If cloned: Services are on subset of clones If partitioned: Services run at each partition System load balancing routes request to Any clone Correct partition. Routes around failures.Cluster Scenarios: 3- tier systems: Cluster Scenarios: 3- tier systems A simple web site Front End Web File Store SQL Temp State SQL Database Cluster Scale Out Scenarios: Cluster Scale Out Scenarios SQL Temp State Web File StoreA Cloned Front Ends (firewall, sprayer, web server) The FARM: Clones and Packs of Partitions Web Clients Load BalanceTerminology: Terminology Terminology for scaleability Farms of servers: Clones: identical Scaleability + availability Partitions: Scaleability Packs Partition availability via fail-over GeoPlex for disaster tolerance. What we have been doing with SDSS: Helping move the data to SQL Database design Data loading Experimenting with queries on a 4 M object DB 20 questions like “find gravitational lens candidates” Queries use parallelism, most run in a few seconds.(auto parallel) Some run in hours (neighbors within 1 arcsec) EASY to ask questions. Helping with an “outreach” website: SkyServer Personal goal: Try datamining techniques to “re-discover” Astronomy What we have been doing with SDSSReferences (.doc or .pdf): References (.doc or .pdf) Technology forecast: http://research.microsoft.com/~gray/papers/ MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.doc Gbps experiments: http://research.microsoft.com/~gray/ Disk experiments (10K$ TB) http://research.microsoft.com/~gray/papers/Win2K_IO_MSTR_2000_55.doc Scaleability Terminology http://research.microsoft.com/~gray/papers/MS_TR_99_85_Scalability_Terminology.doc