logging in or signing up pdb95 Alohomora Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 252 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 15, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Parallel Database Systems 101Jim Gray & Gordon BellMicrosoft Corporationpresented at VLDB 95, Zurich Switzerland, Sept 1995: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented at VLDB 95, Zurich Switzerland, Sept 1995 Detailed notes available from Gray@Microsoft.com this presentation is 120 of the 174 slides (time limit) Notes in PowerPoint7 and Word7Outline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata. Tandem, Oracle, Informix, Sybase, DB2, ‘RedBrickKinds Of Information Processing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Kinds Of Information Processing Point-to-Point Broadcast Immediate Time Shifted conversation money lecture concert mail book newspaper Net work Data Base Its ALL going electronic Immediate is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added Why Put Everything in Cyberspace?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Network Data Base Locate Process Analyze Summarize Databases: Information At Your Fingertips™ Information Network™Knowledge Navigator™: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Databases: Information At Your Fingertips™ Information Network™ Knowledge Navigator™ All information will be in an online database (somewhere) You might record everything you read: 10MB/day, 400 GB/lifetime (two tapes) hear: 400MB/day, 16 TB/lifetime (a tape per decade) see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday) Data storage, organization, and analysis is a challenge. That is what databases are about DBs do a good job on “records” Now working on text, spatial, image, and sound.Database Store ALL Data Types: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Database Store ALL Data Types The New World: Billions of objects Big objects (1MB) Objects have behavior (methods) The Old World: Millions of objects 100-byte objects Mike Won David NY Berk Austin People Name Address Mike Won David NY Berk Austin Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips Name Address Papers Picture Voice PeopleMagnetic Storage Cheaper than Paper: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Magnetic Storage Cheaper than Paper File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3 ¢/sheet Disk: disk (8 GB =) 2,000$ ASCII: 4 m pages 0.05 ¢/sheet (60x cheaper) Image: 200 k pages 1 ¢/sheet (3x cheaper than paper) Store everything on diskBillions of Clients : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Billions of Clients Every device will be “intelligent” Doors, rooms, cars, ... Computing will be ubiquitous Billions of Clients Need Millions of Servers: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Billions of Clients Need Millions of Servers mobile clients fixed clients server super server Clients Servers Super Servers Large Databases High Traffic shared data All clients are networked to servers may be nomadic or on-demand Fast clients want faster servers Servers provide data, control, coordination communicationMoore’s Law: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Moore’s Law 128KB 128MB 2000 8KB 1MB 8MB 1GB 1970 1980 1990 1M 16M bits: 1K 4K 16K 64K 256K 4M 64M 256M 1 chip memory size ( 2 MB to 32 MB) XXX doubles every 18 months 60% increase per year Micro Processor speeds chip density Magnetic disk density Communications bandwidth WAN bandwidth approaching LANs Exponential Growth: The past does not matter 10x here, 10x there, soon you're talking REAL change. PC costs decline faster than any other platform Volume & learning curves PCs will be the building bricks of all future systemsMoore's Law for Memory: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Moore's Law for Memory 128M 128K 2000 8K 1M 8M 1G 8G 1970 1980 1990 512 64 8 1 4K 32K $50 $400 $3k $25k $200k $1.6m $6 1/8th chip 1Kbit 640K DOS limit 4K 16K 64K 256K 1M 4M 16M 64M 256M Memory Price @ $50/chip Number of chips 32MB 128MB 8MB 1GB 4GB Capacity with 64Mb DRAMsMicroProcessor Speeds Went Up: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey MicroProcessor Speeds Went Up Clock rates went from 10Khz to 300Mhz Processors now 4x issue SPECInt92 fits in Cache, it tracks cpu speed Peak Advertised Performance (PAP) is 1.2 BIPS Real Application Performance (RAP) is 60 MIPS Similar curves for DEC VAX & Alpha HP/PA IBM R6000/ PowerPC MIPS & SGI SUNSystem SPECint vs. Price: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey System SPECint vs. Price 486@66 PCs NCR 3555 Tricord ES 5K HP 9000 SUN 1000 SGI L SGI XL Price (K$) to 16 cpu. Pentium Compaq PCs good performance Best price-performance 30$ / SPECint! Proprietary UNIX poor price/performance HP- 9000, IBM SP/2 are above 1K$ / SPECint! Use PC’s for CyberBricks NCR 3525 NCR 3600 AP SUN 2000In The Limit: The Pico Processor: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey In The Limit: The Pico Processor 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache On-Chip SMP Terror Bytes!Disc & Tape Storage: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Disc & Tape Storage $/byte got 104 better $/access got 103 better capacity grew 103 Latency down 10x Bandwidth up 10x 1e 0 1e 4 Tape (a/hr) Disk (a/min) RAM (a/s) 1e 8 1e 7 1e 6 1e 5 1e 4 1e 3 Disk RAM Tape B/$Disc Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Disc Trends Discs are getting smaller ( 1GB/unit) Discs are getting standard (SCSI) Discs are getting faster: 1MB/s -> 10MB/s 25 IO/s -> 75 IO/sDisc Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Disc Trends The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc. LOTS of accesses/second bandwidth 14"Tape Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tape Trends 1950-1980 Reel: 160 MB 1.0 MB/s 30 k$/drive 1980-1993: 3480: 300MB 3.0 MB/s 30 k$/drive 1985-1993: 8MM: 4GB 0.4 MB/s 3 k$/drive 1993-1994: 4MM: 1GB 0.2 MB/s 300 $/drive 1993- : DLT: 20GB 2.5 MB/s 3 K$/drive Mainframe silos: 250K$ and up for thousands of tapes 8MM silos: 5K$ and up for tens of tapesToday’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs 1 Size vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 10 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Price vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 10 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Size(B) $/MBTape & Optical: Beware of the Media Myth: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (5x cheaper than disc) Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (700x cheaper than disc). Tape & Optical Reality: Media is 10% of System Cost: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB (5x...50x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 550 $/GB ( same price as disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!The Access Time Myth: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Access Time Myth Myth: seek or pick time dominates Reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implications: many cheap servers better than one fast expensive server shorter queues parallel transfer lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays Seek Rotate Transfer Wait Seek Rotate TransferWhat's a Terabyte? (250 K$ of Disk @ .25$/MB): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What's a Terabyte? (250 K$ of Disk @ .25$/MB) 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images Library of Congress (in ASCII) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes 1995: 250 K$ of magnetic disc 70 discs 500 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes Terror Byte !! 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video Standard Storage Metrics: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Standard Storage Metrics Capacity: RAM: MB and $/MB: today at 10Mb & 100$/MB Disk: GB and $/GB: today at 5GB and 500$/GB Tape: TB and $/TB: today at .1TB and 50k$/TB (nearline) Access time (latency) RAM: 100 ns Disk: 10 ms Tape: 30 second pick, 30 second position Transfer rate RAM: 1 GB/s Disk: 5 MB/s - - - Arrays can go to 1GB/s Tape: 5 MB/s - - - Arrays can go 100 MB/sNew Storage Metrics: KOXs, MOXs, GOXs, SCANs?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey New Storage Metrics: KOXs, MOXs, GOXs, SCANs? KOX: How many kilobyte objects served per second the file server, transaction processing metric MOX: How many megabyte objects served per second the Mosaic metric GOX: How many gigabyte objects served per hour the video & EOSDIS metric SCANS: How many scans of all the data per day the data mining and utility metricTertiary Storage Tape Farms: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tertiary Storage Tape Farms Scan in 10 hours. many independent tape robots (like a disc farm) 10K$ robot 10 tapes 200 GB 6 MB/s 50$/GB 30 MOX 15 GOX 100 robots 20TB 50$/GB 3K MOX 1.5K GOX 2.5 Scans 1M$ NearLine StorageDisc Array and Tape Farms Win: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey NearLine Storage Disc Array and Tape Farms Win Disc Farm Optical /Tape Robot Tape Farm TB/M$ MOX GOX SCANS/Day Nearline Storage Metrics: Disk and Tape Farms Win : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Nearline Storage Metrics: Disk and Tape Farms Win Data Motel: Data checks in, but it never checks out 0.01 0.1 1 10 100 1 , 000 10 , 000 100 , 000 1 , 000 , 000 1000 x D i sc Farm STC Tape Robot 6,000 tapes, 8 readers 100x DLT Tape Farm GB/K$ MOX GOX SCANS/Day K OXAccess/$ (3-year life): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Access/$ (3-year life) 0.1 1 10 100 100,000 120 2 1000 x Disc Farm STC Tape Robot 6,000 tapes, 16 readers 100x DLT Tape Farm KOX/$ MOX/$ GOX/$ SCANS/k$ 500K 540 ,000 67 ,000 68 7 7 4.3 1.5 0.2 23 100 Summary (of storage): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Summary (of storage) Capacity and cost are improving fast (100x per decade) Accesses are getting larger (MOX, GOX, SCANS) BUT Latencies and bandwidth are not improving much (3x per decade) How to deal with this??? Bandwidth: Use partitioned parallel access (disk & tape farms) Latency Pipeline data up storage hierarchy (next section)Interesting Storage Ratios: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Interesting Storage Ratios Disk is back to 100x cheaper than RAM Nearline tape is only 10x cheaper than disk and the gap is closing! 100:1 10:1 1:1 1960 1970 1980 1990 2000 RAM $/MB Disk $/MB 30:1 ? Disk $/MB Nearline Tape ??? Why bother with Tape Disk & DRAM look goodPerformance =Storage Accesses not Instructions Executed: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Performance =Storage Accesses not Instructions Executed In the “old days” we counted instructions and IO’s Now we count memory references Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if notStorage Latency: How Far Away is the Data?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Storage Latency: How Far Away is the Data?Network Speeds: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Network Speeds Network speeds grow 60% / year WAN speeds limited by politics if voice is X$/minute, how much is video? Switched 100Mb Ethernet 1,000x more bandwidth ATM is a scaleable net: 1 Gb/s to desktop & wall plug commodity: same for LAN, WAN 1Tb/s fibers in laboratory 1960 1970 1980 1990 2000 Processors (i/s) Year Comm Speedups LANs & WANs (b/s)Network Trends & Challenge: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Network Trends & Challenge Bandwidth UP 104 Price DOWN Speed-of-light unchanged Software got worse Standard Fast Nets ATM PCI Myrinet Tnet HOPE: Commodity Net Good software Then clusters become a SNAP! commodity: 10k$/sliceThe Seven Price Tiers: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Seven Price Tiers 10$: wrist watch computers 100$: pocket/ palm computers 1,000$: portable computers 10,000$: personal computers (desktop) 100,000$: departmental computers (closet) 1,000,000$: site computers (glass house) 10,000,000$: regional computers (glass castle) SuperServer: Costs more than 100,000 $ “Mainframe” Costs more than 1M$ Must be an array of processors, disks, tapes comm portsIf Hardware is Free, Where Will The Money Go?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey If Hardware is Free, Where Will The Money Go? All clients and servers will be based on PC technology economies of scale give lowest price. Traditional budget: 40% vendor, 60% staff If hardware_price = software_price = 0 then what? Money will go to CONTENT (databases) NEW APPLICATIONS AUTOMATION analogy to 1920 telephone operators Systems programmer per MIPS DBA per 10GBThe New Computer Industry: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The New Computer Industry Horizontal integration is new structure Each layer picks best from lower layer. Desktop (C/S) market 1991: 50% 1995: 75% Constant Dollars vs Constant Work: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Constant Dollars vs Constant Work Constant Work: One SuperServer can do all the world’s computations. Constant Dollars: The world spends 10% on information processing Computers are moving from 5% penetration to 50% 300 B$ to 3T$ We have the patent on the byte and algorithmSoftware Economics: Bill’s Law: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Software Economics: Bill’s Law Bill Joy’s law (Sun): Don’t write software for less than 100,000 platforms. @10M$ engineering expense, 1,000$ price Bill Gate’s law: Don’t write software for less than 1,000,000 platforms. @10M$ engineering expense, 100$ price Examples: UNIX vs NT: 3,500$ vs 500$ UNIX-Oracle vs SQL-Server: 100,000$ vs 1,000$ No Spreadsheet or Presentation pack on UNIX/VMS/... Commoditization of base Software & HardwareWhat comes next: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What comes next MANY new clients Applications to enable clients & servers super-serversOpportunities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Opportunities Bad News: Big players will dominate cpu / storage / network (Intel, Seagate, ?) OS (Microsoft, Novell) DB-TP-CS...(Oracle, Sybase, Informix, IBM, Novell, Microsoft) Good News: Applications are up for grabs! Value added is in applications & Content Service, & Support Advice: Create new sub-spaces in Cyberspace. Create new clients (e.g., cellular phones,...) Examples: SAP, Adabase, Lotus, Peoplesoft, Netscape, Doom..., Lotus Notes, NetscapeOutline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Cyberspace pep talk: PCs are bricks, Nets are mortar, DBs are land (content) 4B machines Commodity Processor-Disk-Tape-Net Plus commodity base software (POSIX or NT or..) Build 4T SuperServers from arrays of 4B machines Challenge: Software to automate operations & programming Parallel DBMSs do this Opportunities: new kinds of clients (e.g., intelligent universe) new applications (new subspaces in Cyberspace)ThesisMany Little will Win over Few Big: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Thesis Many Little will Win over Few Big 1 M$ 100 K$ 10 K$ Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" Year 2000 4B Machine: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Year 2000 4B Machine The Year 2000 commodity PC (3K$) Billion Instructions/Sec Billion Bytes RAM Billion Bits/s Net 10 B Bytes Disk Billion Pixel display 3000 x 3000 x 24 pixel 10 B byte Disk .1 B byte RAM 1 Bips Processor 1 B bits/sec LAN/WAN4 B PC’s: The Bricks of Cyberspace: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey 4 B PC’s: The Bricks of Cyberspace Cost 3,000 $ Come with OS (NT, POSIX,..) DBMS High speed Net System management GUI / OOUI Tools Compatible with everyone else CyberBricksImplications of Hardware Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Implications of Hardware Trends Large Disc Farms will be inexpensive ( 100$/GB) Large RAM databases will be inexpensive (1,000$/GB) Processors will be inexpensive So The building block will be a processor with large RAM lots of Disc 1k SPECint CPU 50 GB Disc 5 GB RAMImplication of Hardware Trends: Clusters: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Implication of Hardware Trends: Clusters Future Servers are CLUSTERS of processors, discs Distributed Database techniques make clusters work CPU 50 GB Disc 5 GB RAMFuture SuperServer4T Machine: High Speed Network ( 10 Gb/s) Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Future SuperServer 4T Machine Array of 1,000 4B machines processors, disks, tapes comm lines A few MegaBucks Challenge: Manageability Programmability Security Availability Scaleability Affordability As easy as a single system 1,000 discs = 10 Terrorbytes 100 Tape Transports = 1,000 tapes = 1 PetaByte 100 Nodes 1 TipsGreat Debate: Shared What?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Great Debate: Shared What? Shared Memory (SMP) Shared Disk Shared Nothing (network) Easy to program Difficult to build Difficult to scaleup Hard to program Easy to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2 Winner will be a synthesis of these ideas Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important) But gives Shared memory message cost.Scaleables: Uneconomic So Far: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Scaleables: Uneconomic So Far A Slice is a processor, memory, and a few disks. Slice Price of Scaleables so far is 5x to 10x markup Teradata: 70K$ for a Intel 486 + 32MB + 4 disk. Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk Intel: 75k$ for an I860 +32MB + 2 disk TMC: 75k$ for a SPARC 3 + 32MB + 2 disk. IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk Compaq Slice Price is less than 10k$ What is the problem? Proprietary interconnect Proprietary packaging Proprietary software (vendorIX)Summary: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Summary Storage trends force pipeline & partition parallelism Lots of bytes & bandwidth per dollar Lots of latency Processor trends force pipeline & partition Lots of MIPS per dollar Lots of processors Putting it together Scaleable Networks and Platforms) Build clusters of commodity processors & storage Commodity interconnect is key (S of PMS) Traditional interconnects give 100k$/slice. Commodity Cluster Operating System is key Fault isolation and tolerance is key Automatic Parallel Programming is keyThe Hardware is in Place and Then A Miracle Occurs: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Hardware is in Place and Then A Miracle Occurs Enables Parallel ApplicationsThe Software Challenge: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Software Challenge • Automatic data placement (partition: random or organized) • Automatic parallel programming (process placement) • Parallel concepts, algorithms & tools • Parallel Query Optimization • Execution Techniques load balance, checkpoint/restart, pacing, multi-programmingKinds of Parallel Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Sequential Sequential Sequential Any Sequential Program Any Sequential Program Why Parallel Access To Data?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. BandwidthDataFlow ProgrammingPrefetch & Postwrite Hide Latency : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DataFlow Programming Prefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!) Need a memory that gets the data in advance ( 100MB/S) Solution: Pipeline from source (tape, disc, ram...) to cpu cache Pipeline results to destination LatencyWhy are Relational OperatorsSo Successful for Parallelism?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Why are Relational Operators So Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISMDatabase Systems “Hide” Parallelism : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Database Systems “Hide” Parallelism Automate system management via tools data placement data organization (indexing) periodic tasks (dump / recover / reorganize) Automatic fault tolerance duplex & failover transactions Automatic parallelism among transactions (locking) within a transaction (parallel execution)Automatic Parallel OR DB: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Automatic Parallel OR DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image date loc image Landsat 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it image Answer date, location, & image testsOutline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrickParallelism: Speedup & Scaleup: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallelism: Speedup & Scaleup Speedup: Same Job, More Hardware Less time Scaleup: Bigger Job, More Hardware Same time Transaction Scaleup: more clients/servers Same response timeThe New Law of Computing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The New Law of Computing Grosch's Law: Parallel Law: Needs Linear Speedup and Linear Scaleup Not always possibleParallelism: Performance is the Goal: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both. The New Performance Metrics: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The New Performance Metrics Transaction Processing Performance Council: TPC-A: simple transaction TPC-B: server only, about 3x lighter than TPC-A Both obsoleted by TPC-C (no new results after 6/7/95) TPC-C (revision 3) Transactions Per Minute tpm-C Mix of 5 transactions: query, update, minibatch Terminal price eliminated about 5x heavier than tpcA (so 3.5 ktpcA 20 ktpmC) TPC-D approved in March 1995 - Transactions Per Hour Scaleable database (30 GB, 100GB, 300GB,... ) 17 complex SQL queries (no rewrites, no hints without permission) 2 load/purge queries No official results yet, many “customer” results.TPC-C Results 12/94: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey TPC-C Results 12/94 Courtesy of Charles Levine of Tandem (of course)Success Stories: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Success Stories Online Transaction Processing many little jobs SQL systems support 3700 tps-A (24 cpu, 240 disk) SQL systems support 21,000 tpm-C (112 cpu,670 disks) Batch (decision support and Utility) few big jobs, parallelism inside Scan data at 100 MB/s Linear Scaleup to 500 processors transactions / sec hardware recs/ sec hardwareThe Perils of Parallelism: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Perils of Parallelism Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Skew: If tasks get very small, variance > service time Benchmark Buyer's Guide: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmark Buyer's Guide Things to ask When does it stop scaling? Throughput numbers, Not ratios. Standard benchmarks allow Comparison to others Comparison to sequential Ratios and non-standard benchmarks are red flags.Performance 101: Scan Rate: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Agg Count Performance 101: Scan Rate Disk is 3MB/s to 10MB/s Record is 100B to 200B (TPC-D 110...160, Wisconsin 204) So should be able to read 10kr/s to 100kr/s Simple test: Time this on a 1M record table SELECT count(*) FROM T WHERE x < :infinity; (table on one disk, turn off parallelism) Typical problems: disk or controller is an antique no read-ahead in operating system or DB small page reads (2kb) data not clustered on disk big cpu overhead in record movement Parallelism is not the cure for these problems ScanParallel Scan Rate: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Scan Rate Agg Count Scan Agg Count Scan Agg Count Scan Agg Count Scan Agg Sum Simplest parallel test: Scaleup previous test: 4 disks, 4 controllers, 4 processors 4 times as many records partitioned 4 ways. Same query Should have same elapsed time. Some systems do.Parallel Update Rate : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Update Rate UPDATE Test: UPDATE T SET x = x + :one; Test for million row T on 1 disk Test for four million row T on 4 disks Look for bottlenecks. After each call, execute ROLLBACK WORK See if UNDO runs at the DO speed See if UNDO is parallel (scales up)Parallel Insert Rate: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Insert Rate INSERT INTO T2 SELECT * FROM T1; First 1 scan-> insert, then 4 scan->insert. See if log becomes bottleneck. After each run, call ROLLBACK WORK See if rollback: runs as fast as forward processing runs faster in parallel. If not, think about the implications of 100x parallelism Few systems pass these basic sanity tests. The records/$/second Metric: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The records/$/second Metric parallel database systems scan data An interesting metric (100 byte record): Record Scan Rate / System Cost Typical scan rates: 1k records/s to 30k records/s Each Scaleable system has a “slice price” guess: Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or Oracle) Teradata: 75k$ Sequent: 75k$ (P5+2 disks+Dynix+Informix) Tandem: 100k$ IBM SP2: 130k$ (RS6000+2 disks, AIX, DB2) You can compute slice price for systems later in presentation BAD: 0.1 records/s/$ (there is one of these) GOOD: 0.33 records/s/$ (there is one of these) Super! 1.00 records/s/$ (there is one of these) We should aim at 10 records/s/$ with P6.Embarrassing Questions to Ask Your PDB Vendor: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Embarrassing Questions to Ask Your PDB Vendor How are constraints checked? ask about unique secondary indices ask about deferred constraints ask about referential integrity How does parallelism interact with triggers Stored procedures OO extensions How can I change my 10 TB database design in an hour? add index add constraint reorganize / repartition These are hard problems.Outline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrickAutomatic Data Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Good for equijoins, range queries group-by Good for equijoins Good to spread loadIndex Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Index Partitioning Hash indices partition by hash B-tree indices partition as a forest of trees. One tree per range Primary index clusters data Secondary Index Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Secondary Index Partitioning In shared nothing, secondary indices are Problematic Partition by base table key ranges Insert: completely local (but what about unique?) Lookup: examines ALL trees (see figure) Unique index involves lookup on insert. Partition by secondary key ranges Insert: two nodes (base and index) Lookup: two nodes (index -> base) Uniqueness is easy Teradata solution Base Table A..Z Base Table A..Z A..Z A..Z A..ZKinds of Parallel Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Any Sequential Any Sequential Program ProgramData Rivers Split + Merge Streams: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Rivers Split + Merge Streams River M Consumers N producers Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano. N X M Data StreamsPartitioned Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelismN x M way Parallelism: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data FlowsPicking Data Ranges: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Picking Data Ranges Disk Partitioning For range partitioning, sample load on disks. Cool hot disks by making range smaller For hash partitioning, Cool hot disks by mapping some buckets to others River Partitioning Use hashing and assume uniform If range partitioning, sample data and use histogram to level the bulk Teradata, Tandem, Oracle use these tricks Blocking Operators = Short Pipelines: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Blocking Operators = Short Pipelines An operator is blocking, if it does not produce any output, until it has consumed all its input Examples: Sort, Aggregates, Hash-Join (reads all of one operand) Blocking operators kill pipeline parallelism Make partition parallelism all the more important. Sort Runs Scan Sort Runs Sort Runs Sort Runs Tape File SQL Table Process Merge Runs Merge Runs Merge Runs Merge Runs Table Insert Index Insert Index Insert Index Insert SQL Table Index 1 Index 2 Index 3 Database Load Template has three blocked phasesSimple Aggregates (sort or hash?): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Simple Aggregates (sort or hash?) Simple aggregates (count, min, max, ...) can use indices More compact Sometimes have aggregate info. GROUP BY aggregates scan in category order if possible (use indices) Else If categories fit in RAM use RAM category hash table Else make temp of <category, item> sort by category, do math in merge step. Parallel Aggregates: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) = S count(s(i)), ditto for sum() avg(S) = (S sum(s(i))) / S count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river. Sort: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sort Used for loading and reorganization (sort makes them sequential) build B-trees reports non-equijoins Rarely used for aggregates or equi-joins (if hash available Sort Runs Input Data Sorted Data MergeParallel Sort: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Sort M input N output Sort design Disk and merge not needed if sort fits in memory Scales linearly because Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the sourceSIGMOD Sort Award: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey SIGMOD Sort Award Datamation Sort: 1M records (100 B recs) 1000 seconds 1986 60 seconds 1990 7 seconds 1994 3.5 seconds 1995 (SGI challenge) micros finally beat the mainframe! finally! a UNIX system that does IO SIGMOD MinuteSort 1.1GB, Nyberg, 1994 Alpha 3cpu 1.6GB, Nyberg, 1995 SGI Challenge (12 cpu) no SIGMOD PennySort recordNested Loops Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Nested Loops Join Outer Table Inner Table If inner table indexed on join cols (b-tree or hash) then sequential scan outer (from start key) For each outer record probe inner table for matching recs Works best if inner is in RAM (=> small innerMerge Join (and sort-merge join): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Merge Join (and sort-merge join) Left Table Right Table NxM case Cartesian product Partitions well: partition smaller to larger partition. Works for all joins (outer, non-equijoins, Cartesian, exclusion,...) If tables sorted on join cols (b-tree or hash) then sequential scan each (from start key) left < right left=right left > right advance left match advance right Nice sequential scan of data (disk speed) (MxN case may cause backwards rescan) Sort-merge join sorts before doing the mergeHash Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Hash Join Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then bucket-by-bucket hash join. Purely sequential data behavior Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, products just appearing (what went wrong?) Hash reduces skew Right Table Left Table Hash BucketsHash Join Variants: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Hash Join Variants Hybrid: Keep one bucket in memory. Bitmap: If disk-based, keep bitmap of first table join vals Filter second table with first. (useless if a 1-N join (which is typical)) Bucket Tuning: use many small buckets Aggregate them based on memory pressureParallel Hash Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Hash Join ICL implemented hash join with bitmaps in CAFS machine (1976)! Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully) We ignored them! (why?) But now, Everybody's doing it. (or promises to do it). Hashing minimizes skew, requires little thinking for redistribution Hashing uses massive main memory Index-only Scan and Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Index-only Scan and Join Index has secondary key and primary key or RID. Index is like a table (but compact and clustered) Think of it as a replica Can scan it rather than base table (called a semi-join when used for joins) Can scan it, select primary keys, Sort primary keys and Scan base table Index Sec Key(s) Pri Keys (called a hybrid-join when used for joins) Exotic Joins (Exclusion, Cartesian, Outer, ...): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Exotic Joins (Exclusion, Cartesian, Outer, ...) Exclusion used for NOT IN and DIFFERENCE queries Outer is “lossless” join, {left, right} appears with null sibling if matching sibling not found. Cartesian is often a mistake (missing where clause) but also important for small-table large-table optimization. Small table used as a pick list. Each small table represents a mapping: name -> code set membership (e.g. holidays) Best plan: Restrict small tables, Form Cartesian product of all small Send it to each partition of large table for hash join. Observations: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Observations It is easy to build a fast parallel execution environment (no one has done it, but it is just programming) It is hard to write a robust and world-class query optimizer. There are many tricks One quickly hits the complexity barrier Common approach: Pick best sequential plan Pick degree of parallelism based on bottleneck analysis Bind operators to processWhat’s Wrong With That?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What’s Wrong With That? Why isn’t the best serial plan, the best parallel plan? Counter example: Table partitioned with local secondary index at two nodes Range query selects all of node 1 and 1% of node 2. Node 1 should do a scan of its partition. Node 2 should use secondary index. SELECT * FROM telephone_book WHERE name < “NoGood”; Sybase Navigator & DB2 PE should get this right. We need theorems here (practitioners do not have them) N..Z Table Scan A..M Index ScanGreat Debate: Shared What?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Great Debate: Shared What? Shared Memory (SMP) Shared Disk Shared Nothing (network) Easy to program Difficult to build Difficult to scaleup Hard to program Easy to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2 Winner will be a synthesis of these ideas Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important) But gives Shared memory message cost.What Systems Work This Way: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What Systems Work This Way Shared Nothing Teradata: 400 nodes Tandem: 110 nodes IBM / SP2 / DB2: 128 nodes Informix/SP2 48 nodes ATT & Sybase 8x14 nodes Shared Disk Oracle 170 nodes Rdb 24 nodes Shared Memory Informix 9 nodes RedBrick ? nodes Outline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata - Oracle -DB2 Tandem - Informix -RedBrick - Sybase System Survey Ground Rules: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey System Survey Ground Rules Premise: The world does not need yet another PDB survey It would be nice to have a survey of “real” systems Visited each parallel DB vendor I could (time limited) Asked not to be given confidential info. Asked for public manuals and benchmarks Asked that my notes be reviewed I say only nice things (I am a PDB booster) Acknowledgments: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Acknowledgments Teradata Todd Walter and Carrie Ballinger Tandem Susanne Englert, Don Slutz, HansJorge Zeller, Mike Pong Oracle Gary Hallmark, Bill Widdington Informix Gary Kelley, Hannes Spintzik, Frank Symonds, Dave Clay Navigator Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman , Bob McDonald, Jan Graveson Ron Chung Hu, Stuart Thompto DB2 Chaitan Baru, Gilles Fecteau, James Hamilton, Hamid Pirahesh Redbrick Phil Fernandez, Donovan Schneider Teradata : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Teradata • Ship 1984, now an ATT GIS brand name • Parallel DB server for decision support SQL in, tables out • Support Heterogeneous data (convert to client format) Data hash partitioned among AMPs with fallback (mirror) hash. Applications run on clients Biggest installation: 476 nodes, 2.4 TB Ported to UNIX baseParsing Engines: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parsing Engines Interface to IBM or Ethernet or... Accept SQL, return records and status. Support SQL 89, moving to SQL92 Parse, Plan & authorize SQL cost based optimizer Issue requests to AMPs Merge AMP results to requester. Some global load control based on client priority (adaptive and GREAT!) Access Modules Almost all work done in AMPs A shared nothing SQL engine scans, inserts, joins, log, lock,.... Manages up to 4 disks (as one logical volume) Easy design, manage, grow (just add disk)Data Layout: Hash Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Layout: Hash Partitioning All data declustered to all nodes Each table has a hash key (may be compound) Key maps to one of 4,000 buckets Buckets map to one of the AMPs Non-Unique secondary index partitioned by table criterion Fallback bucket maps to second AMP in cluster. Typical cluster is 6 nodes (2 is mirroring). Cluster limits failure scope: 2 failures only cause data outage if both in same cluster. Within a node, each hash to cylinder then hash to “page” Page is a heap with a sorted directory Teradata Optimization & Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Teradata Optimization & Execution Sophisticated query optimizer (many tricks) Great emphasis on Joins & Aggregates. Nested, merge, product, bitmap join (no hash join) Automatic load balancing from hashing & load control Excellent utilities for data loading, reorganize Move > 1TB database from old to new in 6 days, in background while old system running Old hardware, 3.8B row table (1TB), >300 AMPs typical scan, sort, join averages 30 minutes Query Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution Protocol PE requests work AMP responds OK (or pushback) AMP works (if all OK) AMP declares finished When all finished, PE does 2PC and starts pull Simple scan: PE broadcasts scan to each AMP Each AMP scans produces answer spool file PE pulls spool file from AMPs via Ynet If scan were ordered, sort “catcher” would be forked at each AMP pipelined to scans Ynet and PE would do merge of merges from AMPs Aggregates, Updates: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Aggregates, Updates Aggregate of Scan: Scan’s produce local sub-aggregates Hash sub-aggregates to Ynet Each AMP “catches” its sub-aggregate hash buckets Consolidate sub-aggregates. PE pulls aggregates from AMPs via Ynet. Note: fully scaleable design Insert / Update / Delete at a AMP node generates insert / update /delete messages to unique-secondary indices fallback bucket of base table. messages saved in spool if node is downQuery Execution: Joins: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution: Joins Great emphasis on Joins. Includes small-table large-table optimization cheapest triple, then cheapest in triple. If equi-partitioned, do locally If not equi-partitioned, May replicate small table to large partition (Ynet shines) May repartition one if other is already partitioned on join May repartition both (in parallel) Join algorithm within node is Product Nested Sort-merge Hash bit map of secondary indices, intersected. Utilities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Utilities Bulk Data Load, Fast Data Load, Multi-load, Blast 32KB of data to an AMP Multiple sessions by multiple clients can drive 200x parallel Double buffer AMP unpacks, and puts “upsert”onto Ynet One record can generate multiple upserts (transaction-> inventory, store-sales, ...) Catcher on Ynet, grabs relevant “upserts” to temp file. Sorts and then batches inserts (survives restarts). Online and restartable. Customers cite this as Teradata strength. Fast Export (similar to bulk data load)Utilities II: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Utilities II Backup / Restore: Rarely needed because of fallback. Cluster is unit of recovery Backup is online, Restore is offline Reorganize: Rarely needed, add disk is just restart Add node: rehash all buckets that go to that node: (Ynet has old and new bucket map) Fully parallel and fault tolerant, takes minutes Port To UNIX: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Port To UNIX New design (3700 series) described in VLDB 93 Ported to UNIX platforms (3600 AP, PE, AMP) Moved Teradata to Software Ynet on SMPs Based on Bullet-Proof UNIX with TOS layer atop. message system communications stacks raw disk & virtual processors virtual partitions (buckets go to virtual partitions) removes many TOS limits Result is 10x to 60x faster than an AMP Compiled expression evaluation (gives 50x speedup on scans) Large main memory helps UNIX 5.4 (SMP, RAS, virtual Ynet) UNIX PDE: TOS adapter Teradata SQL (AMP logic) Parsing engine (parallelism) Applications SQL HARDWARECustomer Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Customer Benchmarks Standard Benchmarks Only old Boral/DeWitt Wisconsin numbers. Nothing public. Moving > 1TB database from one old to new in 6 days, in background while old system running So: unload-load rate > 2MB/s sustained Background task (speed limited by host speed/space) Old hardware, 3.8B row table, >300 AMPs typical scan, sort, join averages 30 minutes rates (rec size not cited): krec/s/AMP k rec/s scan: 9 2.7 mr/s !!!!!! clustered join: 2 600 kr/s insert-select: .39 120 kr/s Hash index build: 3.3 100 kr/sUNIX/SMP Port of Teradata: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey UNIX/SMP Port of Teradata op rows seconds k r/s MB/s scan 50000000 737 67.8 11.0 copy 5000000 1136 4.4 0.7 aggregate 50000000 788 63.5 10.3 Join 50x2M (clustered) 52000000 768 67.7 11.0 Join 5x5 (unclustered) 10000000 237 42.2 6.8 Join 50Mx.1K 50000100 1916 26.1 4.2 Times to process a Teradata Test DB on a 8 Pentium, 3650. These numbers are 10 to 150x better than a single AMP Compiled expression handling more memory Teradata Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Teradata Good Things Scaleable to large (multi-terabyte) databases Available TODAY! It is VERY real: in production in many large sites Robust and complete set of utilities Automatic management. Integrates with the IBM mainframe OLTP world Heterogeneous data support is good data warehouse Tandem: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Message-based OS (Guardian): (1) location transparency (2) fault isolation (failover to other nodes). Expand software 255 Systems WAN Classic shared-nothing system (like Teradata except applications run inside DB machine. 4 node System 8 x1M B/S 30MB/S 1-16 MIPS R4400 cpus dual port controllers, dual 30MB/s LAN 224 PROCESSORS 1974-1985: Encompass: Fault-tolerant Distributed OLTP 1986: NonStopSQL: First distributed and high-performance SQL (200 tps) 1989: Parallel NonStopSQL: Parallel query optimizer/executor 1994: Parallel and Online SQL (utilities, DDL, recovery, ....) 1995: Moving to ServerNet: shared disk modelTandem Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Data Layout Each table or index range partitioned to a set of disks (anywhere in network) Index is B-tree per partition clustering index is B+ tree Table fragments are files (extent based). Descriptors for all local files live in local catalog (node autonomy) Tables can be distributed in network (lan or wan) Duplexed disks and disk processes for failover Partition Block Extents may be added File= {parts}Tandem Software (Process Structure): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Software (Process Structure) Disk Server Pair Data partition C/COBOL/.. Application SQL SQL engine Joins, Sorts global aggs triggers index maintenance views security Query Compiler Utilities Transactions Helper Processes GUI Selects Update, Delete Record/Set insert Aggregates Assertions Locking Logging buffer pool Disk Pair or Array Hardware & OS move data at 4MB/s with >1 ins/byteOLTP Features: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey OLTP Features Insert / Update / Delete index in parallel with base table If 5 indices, 5x faster response time. Record and key-value range locking, SQL92 isolation levels Undo scanner per log: double-buffers undo to each server 21 k tpc-C (WOW!!) with 110 node server (800GB db) Can mix OLTP and batch. Priority serving to avoid priority inversion problem Buffer management prevents sequential buffer pollutionTandem Query Plan & Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Query Plan & Execution Simple selects & aggregates done in disk servers Parallelism chosen: scan: table fragmentation hash: # processors or Outer table fragments Sorts: redistribution, sort in executors (N-M) Joins done in executors (nest, sort-merge, hash). Redistribution is always a hash (minimize skew) Pipeline as deep as possible (use lots of processes) Multiple logs & parallel UNDO avoid bottlenecks Can mix OLTP and batch. Priority serving to avoid priority inversion problem Buffer management prevents sequential buffer pollutionParallel Operators: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Operators Initially just inserted rivers between sequential operators Parallel query optimizer Created executors at all clustering nodes or at all nodes, repartitioned via hash to them Gave parallel select, insert, update, delete join, sort, aggregates,... correlated subqueries are blocking Got linear speedup/scaleup on Wisconsin. Marketing never noticed, product slept from 1989-1993 Developers added: Hash Join aggregates in disk process SQL92 features parallel utilities online everything converted to MIPSco fixed bugs Join Strategies: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Join Strategies Nested loop Sort merge Both can work off index-only access Replicate small to all partitions (when one small) Small-table Cartesian product large-table optimization Now hybrid-hash join uses many small buckets tuned to memory demand tuned to sequential disk performance no bitmaps because (1) parallel hash (2) equijoins usually do not benefit When both large, and unclustered (rare case) N+M scanners, 16 catchers: sortmerge or hybrid hashAdministration (Parallel & Online everything): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Administration (Parallel & Online everything) All utilities are online (claim to reduce outages by 40%): Add table, column,... Add index: builds index from stale copy uses log for catchup in final minute, gets lock, completes index. Reorg B-tree while it is accessed Add / split/ merge/ reorg partition Backup Recover page, partition, file. Add, alter logs, disks, processors, ... You need this: Terabyte operations take a long time! Parallel Utilities: load (M to N) index build (M scanners, N inserters, in background) recovery:Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks No official DSS benchmark reports Unofficial results 1 to 16 R4400 class processors, 64MB each (Himalayas) 3 disks, 3 ctlrs each Sequential 16x Parallel rec/s MB/s rec/s MB/s speedup Load Wisc 1.6 kr/s 321 Kb/s 28 kr/s 5.4 MB/s 16 Parallel Index build 1.5 kr/s 15Kb/s 24 kr/s 240 KB/s 16 SCAN 28 kr/s 5.8 MB/s 470 kr/s 94 MB/s 16 !!!!!!! Aggregate (1 col) 25 kr/s 4.9 MB/s 400 kr/s 58 MB/s 16 Aggregate (6 col) 18 kr/s 3.6 MB/s 300 kr/s 60 MB/s 16 2-Way hash Join 13 kr/s 2.6 MB/s 214 kr/s 42 MB/s 16 3-Way hash Join ? kr/s ? Mb/s ? kr/s ? MB/s ? 1x and 16x rates are best I’ve seen anywhere.Tandem Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Good Things 21 K TPM-C (WOW!) It is available TODAY! Online everything Fault tolerant, distributed, high availability Mix OLTP and batch Great Hash Join Algorithm Probably the best peak performance availableOracle: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Parallel Server (V7): Multiple threads in a server Multiple servers in a cluster Client/server, OLTP & clusters (TP lite) Parallel Query (V7.1) Parallel SELECT (and sub-selects) Parallel Recovery: (V7.1) @ restart, one log scanner, multiple redoers Beta in 1993, Ship 6/94. More Parallel (create table): V7.2, 6/95 Shared disk implementation ported to most platforms Parallel SELECT (no parallel INSERT, UPDATE, DELETE, DDL) except for sub-selects inside these verbs.Oracle Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Data Layout Homogenous: one table (index) per segment extents picked from a TableSpace Files may be raw disk Segments are B-trees or heaps. data -> disk map is automatic No range / hash / round-robin partitioning ROWID can be used as scan partitioning on base tables. Guiding principal: If its not organized, it can’t get disorganized, and doesn’t need to be reorganized.Oracle Parallel Query Product Concept: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Parallel Query Product Concept Convert serial SELECT plan to parallel plan If Table scan or HINT then consider parallel plan Table has default degree of parallelism (explicitly set) Overridden by system limits and hints. Use max degree of all participating tables. Intermediate results are hash partitioned Nested Loop Join and Merge Join User hints can (must?) specify join order, join strategy, index, degree of parallelism,... Query Planning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Planning Query Coordinator starts with Oracle Cost-Based plan If plan requests Table scan or HINT then consider parallel plan Table has default degree of parallelism (explicitly set) Overridden by system limits and hints. Use max degree of all participating tables. Shared disk makes temp space allocation easy Planner picks degree of parallelism and river partitioning. Proud of their OR optimization. Query Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution Coordinator does extra work to merge the outputs of several sorts subsorts pushed to servers aggregate the outputs of several aggregates aggregates pushed to servers Parallel function invocation is potentially a big win. SELECT COUNT ( f(a,b,c,...)) FROM T; Invokes function f on each element of T, 100x parallel. Join Strategies: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Join Strategies Oracle has (1) Nested Loop Join (2) Merge Join Replicate inner to outer partition automatic in shared disk (looks like partition outer). Has small-table large-table optimization (Cartesian product join) User hints can specify join order, join strategy, index degree of parallelism,... Transactions & Recovery: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Transactions & Recovery Transactions and transaction save points (linear nest). ReadOnly snapshots for decision support. SQL92 isolation levels (ACID = Snapshot isolation) Database has multiple rollback segments UNDO log, Transaction has one commit/REDO log so may be a bottleneck Parallel recovery at restart: One log scanner, DEGREE REDO streams, typically one per disk INSTANCE REDO streams, typically two-deep per disk Oracle Utilities : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Utilities User can write parallel load / unload utility Index build, Constraints, are separate steps Not incremental or online or restartable. Update Statistics (Analyze) is not parallel Index build is a N-1 parallel: N scanner/sorter, 1 inserter. Parallel recovery at restart: One log scanner, DEGREE REDO streams, typically one per disk INSTANCE REDO streams, typically two-deep per disk Administration Not much special: Limit degree of parallelism at a server Set default parallelism of a table Query can only lower these limits No special tools, meters, monitors,... Just ordinary Parallel Server Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks Sequent 20x 50MHz 486, .5GB RAM, 20 disk Sequential 20x Parallel rec/s krecr/s KB/s rec/s MB/s speedup Load 5M Wisc .5 kr/s 113 KB/s 8.8 kr/s 1.8 MB/s 16 Parallel Index load 2.2 kr/s 18 Kb/s 29 kr/s 235 KB/s 13 SCAN 1.7 kr/s 364 KB/s 26 kr/s 5.3 MB/s 15 Agg MJ 3.3 kr/s 660 KB/s 45 kr/s 9.3 MB/s 14 Agg NJ 1.4 kr/s 290 KB/s 26 kr/s 5.4 MB/s 19 Same benchmark on 16x SP1 (a shared nothing machine), got similar results. 168x N-cube ( 16MB/node), 4 lock nodes, 64 disk nodes got good scaleup Oracle has published details on all these benchmarks. 20 Pentium, 40 disk system, SCAN at 44 MB/s 55% cpu Sept 1994 news:Oracle Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Good Things Available now! Parallel Everywhere (on everybody’s box) A HIGH FUNCTION SQL No restrictions (triggers, indices,...) Very easy to use (almost no knobs or options) Parallel invocation of stored procedures Near-linear scaleup and speedup of SELECTs. Respectable performance on SequentInformix: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix DSA (Dynamic Scaleable Architecture) describes redesign to thread-based, server-based system. V6 - 1993 - : DSA -- rearchitecture (threads, OLTP focus) V7 - 1994 - : PDQ -- Parallel Data Query (SMP) V8 - 1995 - : XMP -- Cluster parallelism (shared disk/nothing). Parallelism is a MAJOR focus now that SQL92 under control Other major focus is TOOLS (ODBC, DRDA, NewEra 4GL). Informix is a UNIX SQL system: AIX (IBM), HP/UX (HP), OSF/1 (DEC, HP), SCO/UNIX, Sequent/DYNIX, SUN (SunOS, Solaris) Today shared nothing parallelism on IBM SP2, ATT3650, ICL, (beta) Informix Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Data Layout DBspace Block Chunks may be added File Table or index maps to homogeneous set of DB spaces contains “chunks” (extents) Partition by: range, round robin expression hash (V8) Access via B+Tree, B* tree, and hash (V8) Built an extent-based file system on raw disks or files High speed sequential, clustering, async IO,...Informix Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Execution Completely parallel DML, some parallel DDL Parallel SELECT, UPDATE, DELETE Executor per partition in all cases. Parallel sort, joins (nest, merge, hash) aggregates, union Whenever an operator has input and a free output buffer, it can work to fill the output buffer. Natural flow control Blocking operators (sort, hash join, aggregates, correlated subqueries) Spool to a buffer (if small), else spool to disk. Shared buffer pool minimizes data copies.Parallel Plans: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Plans Query plan is parallelized by scanner per table partition (does select, project) sub-aggregates per partition (hash or sort) If clustered join (nested loop or merge) then operator per outer or per partition If hash-join, parallel scan smaller first, build bitmap and hash buckets then scan larger and: join to smaller if it fits in memory else filter via bitmap and build larger buckets then join bucket by bucket Hybrid hash join with bitmaps and bucket tuning. Parallel Operators: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Operators Parallel SELECT, UPDATE, DELETE Executor per partition in all cases. Parallel sort, joins, aggregates, union Only correlated subqueries are blocking Completely parallel DML, some parallel DDL Transactions & Recovery: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Transactions & Recovery SQL 2 isolation levels allow DSS to run in background Transaction save points Separate logical and physical logs. Bulk updates could bottleneck on single log. Recovery unit is data partition (DBspace) Parallel recovery: thread per DBspace If DB fragment unavailable, DSS readers can skip itInformix Administration: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Administration Can assign % of processors, memory, IO to DSS (parallel query) Sum of all parallel queries live within this quota Each query can specify the % of the total that it wishes. (0 means sequential execution) Parallel Data load (SMP only) Parallel Index Build (N - M) Parallel recovery Online backup / restore UtilitiesBenchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks Sequent system: 9 Pentium processors 1 GB main memory Base tables on 16 disk (FWD SCSI) Indices on 10 discs Temp space on 10 disks Sequential Parallel rec/s MB/s rec/s MB/s speedup Load 300M Wisc 3kr/s 600Kb/s Parallel Index load 48kr/s 1MB/s SCAN 17kr/s 3.5MB/s 147kr/s 30MB/s 8.3 Aggregate 11kr/s 2.3MB/s 113kr/s 23MB/s 10.1 2-Way hash Join 18kr/s 3.2MB/s 242kr/s 31MB/s 9.7 3-Way hash Join 25kr/s 3.5Mb/s 239kr/s 33MB/s 9.5 Informix Shared Nothing Benchmark : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Shared Nothing Benchmark IBM SP2 - : TPC-D-like database 48 SP2 Processors Customer Benchmark, Not audited benchmark. Load 60 GB in 40 minutes, 250 GB in 140 min about 100 GB/hr ! 2GB/node/hr Scan & Aggregate (#6) 60 GB in 7 min = 140 MB/s = 3 MB/s/node = 30 kr/s 260 GB in 24 min = 180 MB/s = 4 MB/s/node = 40 kr/s Power Test (17 complex queries and 2 load/purge ops) 60 GB in 5 hrs 260 GB in 18 hrs Multiuser Test: 1 user, 12 queries: 10 hrs, 4 users, 3 queries: 10 hrs Informix Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Good Things A full function SQL Available today on Sequent Beautiful manuals Linear speedup and scaleup Best published performance on UNIX systems Probably best price performance. (but things are changing fast!) Some mechanisms to mix OLTP and batch.Sybase Navigator Product concept: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sybase Navigator Product concept Two layer software architecture: (1) Navigator drives array of shared-nothing SQL engines. (2) Array of SQL engines, each unaware of others. similar to Tandem disk processes SQL engine is COTS. Goal: linear scaleup and speedup, plus good OLTP support Emphasize WHOLE LIFECYCLE Configurator: tools to design a parallel system Administrator: tools to manage a parallel system (install/upgrade, start/stop, backup/restore, monitor/tune) Optimizer: execute requests in parallel.Configurator: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Configurator Fully graphical design tool Given ER model and dataflow model of the application workload characteristics response time requirements, hardware components (heavy into circles and arrows) Recommends hardware configuration/ Table definitions (SQL) table partitioningAdministrator: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Administrator Made HUGE investments in this area. Truly industry leading graphical tools make MPP configuration “doable”. GUI interface to manage: startup / shutdown of cluster backup / restore / manage logs configure (install, add nodes, configure and tune servers) Manage / consolidate system event logs System stored procedures (global operations) (e.g. aggregate statistics from local to global cat) Monitor SQL Server events Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Layout Pure shared nothing Navigator partitions data among SQL servers • map to a subset of the servers • range partition or hash partition. Secondary indices are partitioned with base table No Unique secondary indices Only shorthand views, no protection views Schema server stores global data definition for all nodes. Each partition server has schema for its partition data for its partition. log for its partition Sybase SQL Server Backgrounder: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sybase SQL Server Backgrounder Recently became SQL89 compliant (cursors, nulls, etc) Stored procedures, multi-threaded, internationalized, B*-tree centric (clustering index is B+tree) Use nested loops, sort-merge join (sort is index build). Page locking, 2K disk IO, ... other little-endian design decisions. Respectable TPC-C results (AIX RS/6000). UNIX raw disks or files are base (also on OS/2, NetWare,...). table->disk mapping CREATE DATABASE name ON {device...} LOG ON {device...} SP_ADDSEGMENT segment, device CREATE TABLE name(cols) [ ON segment] Microsoft has a copy of the code, deep ported to NT Navigator Extension Mechanisms: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Navigator Extension Mechanisms Navigator extended Sybase TDS by Adding stored procedures to do things Extending the syntax (e.g. see data placement syntax below) Sybase TDS and OpenServer design are great for this All “front ends based on OpenServer and threads”Process Structure - Pure Shared Nothing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Process Structure - Pure Shared Nothing DBA Server does everything: SQL compilation System management Catalog management SQL server restart (in 2nd node) DBA fallback detects deadlock does DBA takeover on fail Control server at each node manages SQL servers there (security, request caching, 2PC, final merge /aggregate,... parallel stored procedures (SMID) ) Split server manages re-partitioning of data SQL Server is unit of query parallelism, (one per cpu per node) Simple Request Processing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Simple Request Processing Control (1/node) Client SQL Split DBA server schema server Client connects to Navigator (a Control Server) using standard Sybase TDS protocol. SQL request flows to DBA server that compiles it sends stored procedures (plans) to all control servers plans to all relevant SQL servers Control server executes plan. Pass to SQL server, returns results. Plan cached on second call, DBA server not invoked. Good for OLTP Parallel Request Processing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Request Processing Control Split Control Client SQL Split DBA server schema server Control Split If query involves multiple nodes, then command sent to each one (diagram shows secondary index lookup) Query sent to SQL servers that may have relevant data. If data needs to be redistributed or aggregated, split servers issue queries and inserts (that is their only role) split servers have no persistent storage. Data Manipulation: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Manipulation SQL server is unit of parallelism "Parallelized EVERYTHING in the T-SQL language" Includes SIMD execution of T-SQL procedures, plus N-M data move operations. Two-level optimization: DBA Server has optimizer (BIG investment, all new code, NOT the infamous Sybase optimizer) Each SQL server has Sybase optimizer If extreme skew, different servers have different plans DBA optimizer shares code with SQL server (so they do not play chess with one another). Very proud of their optimizer.Query Execution : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution Classic Sellinger cost-based optimizer. SELECT, UPDATE, DELETE N-to-M parallel Bulk and async INSERT interface. N-M Parallel sort Aggregate (hash/sort) select and join can do index-only access if data is there. eliminate correlated subqueries (convert to join). (Gansky&Wong. SIGMOD87 extended) Join: nested-loop, sort-merge, index only Sybase often dynamically builds index to support nested loop (fake sort-merge) Typically left-deep sequence of binary joins.Join and Partition Strategy: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Join and Partition Strategy Partition strategies If already partitioned on join, then no splitting Else Move subset of T1 to T2 partitions. or Replicate T1 to all T2 partitions or repartition both T1 and T2 to width of home nodes or target. No hash join, but all (re) partitioning is range or hash based. Not aggressive parallelism/pipelining: 2 op at a time. Pipeline to disk via split server (not local to disk and then split). Split servers fake subtables for SQL engines. Top level aggregates merged by control, others done by split. Utilities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Utilities Bulk data load (N-M) async calls GUI manages Backup all SQL serves in parallel Reorg via CREATE TABLE <new> , INSERT INTO <new> SELECT * FROM <old> Utilities are mostly offline (as per Sybase) Nice EXPLAIN utility Futures: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Futures Hash join within split servers Shared memory optimizations Full support for unique secondary indices Full trigger support (cross-server triggers) Full security and view support.Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks Preliminary: 8x8 3600 - Ynet. node: 8 x (50MHz 486 256k local cache) 512MB main memory, 2 x 10 disk arrays, @ 2GB 4 MB/s per disk. 6 x Sybase servers Scaleup & speedup tests of 1, 4, and 8 nodes. Numbers (except loading) reported as ratios of elapsed times S&S tests show a >7x speedup of 8-way over 1-way Tests cover insert, select, update, delete, join, aggregate, load Reference Account: Chase Manahattan Bank 14x8 P5 ATT 3600 cluster: (112 processors) 56 SQL servers, 10GB each = 560 GB 100x faster than DB2/MVS (minutes vs days) Linearity is great. Navigator Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Navigator Good Things Concern for lifecycle design, install, manage, operate, use Good optimization techniques Fully parallel, including stored procedures! Scaleup and Speedup are near linear.Sybase IQ: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sybase IQ Sybase bought Expressway Expressway evolved from Model 204 bitmap technology: index duplicates with bitmap compress bitmap. Can give 10x or 100x speedup. Can save space and IO bandwidth Currently, two products (Sybase and IQ) not integratedDB2: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2 DB2/VM: = SQL/DS: System R gone public DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...) Parallel and async IO into one process (on mainframe) Parallel execution in next release (late next year?) MVS PQS now withdrawn? DB2/AS400: Home grown DB2-2-PE: OS2/DM grown large. First moved to AIX Being extended parallelism Parallelism based on SP/2 -- shared nothing done right. Benchmarks today - Beta everywhere DB2++: separate code path has OO extensions, good TPC-C Ported to HP/UX, Solaris, NT in beta DB2/2 Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Data Layout DATABASE: a collection of nodes (up to 128 SP2s so far) NODEGROUP: a collection of logical nodes (a 4k hash map LOGICAL NODE: A DB2 instance (segments, log, locks...) PHYSICAL NODE: A box. Logical Node: Segments of 4 k pages Segments allocated in units (64K default) Tables stripe across all segments Table created in NodeGroup: Hash (partition key) across all members of group Cluster has single system Image Segments Nodes: Group 1 Group 2DB2/2 Query Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Query Execution Each node maintains pool of AIX server processes Query optimizer does query decomposition to node plans (like R* distributed query decomposition) Parallel Optimization is 1Ø (not like Wai Hong’s work) Sends sub-plans to nodes to be executed by servers Node binds plan to server process Intermediate results hashed Proud that Optimizer does not need hints. “Standard” join strategies (except no hash join).DB2/2 Utilities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Utilities 4 loaders: import raw-insert (fabricates raw blocks, no checks) insert bulk insert Reorganize hash map, add / drop nodes, add devices Table unavailable during these operations Online & Incremental backup Fault tolerance via HACMPDB2/2 Performance: Good performance Great Scaling: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Performance: Good performance Great Scaling Wisconsin scaleups big = 4.8 M rec = 1 GB small = 1.2 M rec = 256MB scan rate ~12 kr/s/node raw load: 2.5 kr/s/node see notes for more data DB2/2 Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Good Things Scaleable to 128 nodes (or more) From IBM Good performance Complete SQL (update, insert,...) Will converge with DB2/3 (OO and TPC-C stuff) Will be available off AIX someday (aix is slow and SP2 is very expensive)RedBrick: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey RedBrick Read-only (LOAD then SELECT only) Database system Load is incremental and sophisticated Precompute indices to make small-large joins run fast Indices use compression techniques. Only join via indices Many aggregate functions to make DSS reports easy Parallelism: Pipeline IO Typically a thread per processor (works on index partition) Piggyback many queries on one scan Parallel utilities (index in parallel, etc) SP2 implementation uses shared disk model.Summary: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Summary There is a LOT of activity (many products coming to market) Query optimization is near the complexity barrier Needs a new approach? All have good speedup & scaleup if they can find a plan Managing huge processor / disk / tape arrays is hard. I am working on commoditizing these ideas: low $/record/sec (scaleup PC technology) low Admin $/node (automate, automate, automate,...) Continuous availability (online & fault tolerant) You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
pdb95 Alohomora Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 252 Category: Entertainment License: All Rights Reserved Like it (0) Dislike it (0) Added: October 15, 2007 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Parallel Database Systems 101Jim Gray & Gordon BellMicrosoft Corporationpresented at VLDB 95, Zurich Switzerland, Sept 1995: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft Corporation presented at VLDB 95, Zurich Switzerland, Sept 1995 Detailed notes available from Gray@Microsoft.com this presentation is 120 of the 174 slides (time limit) Notes in PowerPoint7 and Word7Outline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata. Tandem, Oracle, Informix, Sybase, DB2, ‘RedBrickKinds Of Information Processing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Kinds Of Information Processing Point-to-Point Broadcast Immediate Time Shifted conversation money lecture concert mail book newspaper Net work Data Base Its ALL going electronic Immediate is being stored for analysis (so ALL database) Analysis & Automatic Processing are being added Why Put Everything in Cyberspace?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Why Put Everything in Cyberspace? Low rent min $/byte Shrinks time now or later Shrinks space here or there Automate processing knowbots Point-to-Point OR Broadcast Immediate OR Time Delayed Network Data Base Locate Process Analyze Summarize Databases: Information At Your Fingertips™ Information Network™Knowledge Navigator™: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Databases: Information At Your Fingertips™ Information Network™ Knowledge Navigator™ All information will be in an online database (somewhere) You might record everything you read: 10MB/day, 400 GB/lifetime (two tapes) hear: 400MB/day, 16 TB/lifetime (a tape per decade) see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday) Data storage, organization, and analysis is a challenge. That is what databases are about DBs do a good job on “records” Now working on text, spatial, image, and sound.Database Store ALL Data Types: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Database Store ALL Data Types The New World: Billions of objects Big objects (1MB) Objects have behavior (methods) The Old World: Millions of objects 100-byte objects Mike Won David NY Berk Austin People Name Address Mike Won David NY Berk Austin Paperless office Library of congress online All information online entertainment publishing business Information Network, Knowledge Navigator, Information at your fingertips Name Address Papers Picture Voice PeopleMagnetic Storage Cheaper than Paper: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Magnetic Storage Cheaper than Paper File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3 ¢/sheet Disk: disk (8 GB =) 2,000$ ASCII: 4 m pages 0.05 ¢/sheet (60x cheaper) Image: 200 k pages 1 ¢/sheet (3x cheaper than paper) Store everything on diskBillions of Clients : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Billions of Clients Every device will be “intelligent” Doors, rooms, cars, ... Computing will be ubiquitous Billions of Clients Need Millions of Servers: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Billions of Clients Need Millions of Servers mobile clients fixed clients server super server Clients Servers Super Servers Large Databases High Traffic shared data All clients are networked to servers may be nomadic or on-demand Fast clients want faster servers Servers provide data, control, coordination communicationMoore’s Law: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Moore’s Law 128KB 128MB 2000 8KB 1MB 8MB 1GB 1970 1980 1990 1M 16M bits: 1K 4K 16K 64K 256K 4M 64M 256M 1 chip memory size ( 2 MB to 32 MB) XXX doubles every 18 months 60% increase per year Micro Processor speeds chip density Magnetic disk density Communications bandwidth WAN bandwidth approaching LANs Exponential Growth: The past does not matter 10x here, 10x there, soon you're talking REAL change. PC costs decline faster than any other platform Volume & learning curves PCs will be the building bricks of all future systemsMoore's Law for Memory: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Moore's Law for Memory 128M 128K 2000 8K 1M 8M 1G 8G 1970 1980 1990 512 64 8 1 4K 32K $50 $400 $3k $25k $200k $1.6m $6 1/8th chip 1Kbit 640K DOS limit 4K 16K 64K 256K 1M 4M 16M 64M 256M Memory Price @ $50/chip Number of chips 32MB 128MB 8MB 1GB 4GB Capacity with 64Mb DRAMsMicroProcessor Speeds Went Up: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey MicroProcessor Speeds Went Up Clock rates went from 10Khz to 300Mhz Processors now 4x issue SPECInt92 fits in Cache, it tracks cpu speed Peak Advertised Performance (PAP) is 1.2 BIPS Real Application Performance (RAP) is 60 MIPS Similar curves for DEC VAX & Alpha HP/PA IBM R6000/ PowerPC MIPS & SGI SUNSystem SPECint vs. Price: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey System SPECint vs. Price 486@66 PCs NCR 3555 Tricord ES 5K HP 9000 SUN 1000 SGI L SGI XL Price (K$) to 16 cpu. Pentium Compaq PCs good performance Best price-performance 30$ / SPECint! Proprietary UNIX poor price/performance HP- 9000, IBM SP/2 are above 1K$ / SPECint! Use PC’s for CyberBricks NCR 3525 NCR 3600 AP SUN 2000In The Limit: The Pico Processor: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey In The Limit: The Pico Processor 1 M SPECmarks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache On-Chip SMP Terror Bytes!Disc & Tape Storage: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Disc & Tape Storage $/byte got 104 better $/access got 103 better capacity grew 103 Latency down 10x Bandwidth up 10x 1e 0 1e 4 Tape (a/hr) Disk (a/min) RAM (a/s) 1e 8 1e 7 1e 6 1e 5 1e 4 1e 3 Disk RAM Tape B/$Disc Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Disc Trends Discs are getting smaller ( 1GB/unit) Discs are getting standard (SCSI) Discs are getting faster: 1MB/s -> 10MB/s 25 IO/s -> 75 IO/sDisc Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Disc Trends The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc. LOTS of accesses/second bandwidth 14"Tape Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tape Trends 1950-1980 Reel: 160 MB 1.0 MB/s 30 k$/drive 1980-1993: 3480: 300MB 3.0 MB/s 30 k$/drive 1985-1993: 8MM: 4GB 0.4 MB/s 3 k$/drive 1993-1994: 4MM: 1GB 0.2 MB/s 300 $/drive 1993- : DLT: 20GB 2.5 MB/s 3 K$/drive Mainframe silos: 250K$ and up for thousands of tapes 8MM silos: 5K$ and up for tens of tapesToday’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs 1 Size vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 10 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Price vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 10 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Size(B) $/MBTape & Optical: Beware of the Media Myth: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (5x cheaper than disc) Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (700x cheaper than disc). Tape & Optical Reality: Media is 10% of System Cost: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tape & Optical Reality: Media is 10% of System Cost Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB (5x...50x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 550 $/GB ( same price as disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!The Access Time Myth: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Access Time Myth Myth: seek or pick time dominates Reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implications: many cheap servers better than one fast expensive server shorter queues parallel transfer lower cost/access and cost/byte This is now obvious for disk arrays This will be obvious for tape arrays Seek Rotate Transfer Wait Seek Rotate TransferWhat's a Terabyte? (250 K$ of Disk @ .25$/MB): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What's a Terabyte? (250 K$ of Disk @ .25$/MB) 1 Terabyte 1,000,000,000 business letters 100,000,000 book pages 50,000,000 FAX images 10,000,000 TV pictures (mpeg) 4,000 LandSat images Library of Congress (in ASCII) is 25 TB 1980: 200 M$ of disc 10,000 discs 5 M$ of tape silo 10,000 tapes 1995: 250 K$ of magnetic disc 70 discs 500 K$ of optical disc robot 250 platters 50 K$ of tape silo 50 tapes Terror Byte !! 150 miles of bookshelf 15 miles of bookshelf 7 miles of bookshelf 10 days of video Standard Storage Metrics: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Standard Storage Metrics Capacity: RAM: MB and $/MB: today at 10Mb & 100$/MB Disk: GB and $/GB: today at 5GB and 500$/GB Tape: TB and $/TB: today at .1TB and 50k$/TB (nearline) Access time (latency) RAM: 100 ns Disk: 10 ms Tape: 30 second pick, 30 second position Transfer rate RAM: 1 GB/s Disk: 5 MB/s - - - Arrays can go to 1GB/s Tape: 5 MB/s - - - Arrays can go 100 MB/sNew Storage Metrics: KOXs, MOXs, GOXs, SCANs?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey New Storage Metrics: KOXs, MOXs, GOXs, SCANs? KOX: How many kilobyte objects served per second the file server, transaction processing metric MOX: How many megabyte objects served per second the Mosaic metric GOX: How many gigabyte objects served per hour the video & EOSDIS metric SCANS: How many scans of all the data per day the data mining and utility metricTertiary Storage Tape Farms: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tertiary Storage Tape Farms Scan in 10 hours. many independent tape robots (like a disc farm) 10K$ robot 10 tapes 200 GB 6 MB/s 50$/GB 30 MOX 15 GOX 100 robots 20TB 50$/GB 3K MOX 1.5K GOX 2.5 Scans 1M$ NearLine StorageDisc Array and Tape Farms Win: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey NearLine Storage Disc Array and Tape Farms Win Disc Farm Optical /Tape Robot Tape Farm TB/M$ MOX GOX SCANS/Day Nearline Storage Metrics: Disk and Tape Farms Win : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Nearline Storage Metrics: Disk and Tape Farms Win Data Motel: Data checks in, but it never checks out 0.01 0.1 1 10 100 1 , 000 10 , 000 100 , 000 1 , 000 , 000 1000 x D i sc Farm STC Tape Robot 6,000 tapes, 8 readers 100x DLT Tape Farm GB/K$ MOX GOX SCANS/Day K OXAccess/$ (3-year life): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Access/$ (3-year life) 0.1 1 10 100 100,000 120 2 1000 x Disc Farm STC Tape Robot 6,000 tapes, 16 readers 100x DLT Tape Farm KOX/$ MOX/$ GOX/$ SCANS/k$ 500K 540 ,000 67 ,000 68 7 7 4.3 1.5 0.2 23 100 Summary (of storage): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Summary (of storage) Capacity and cost are improving fast (100x per decade) Accesses are getting larger (MOX, GOX, SCANS) BUT Latencies and bandwidth are not improving much (3x per decade) How to deal with this??? Bandwidth: Use partitioned parallel access (disk & tape farms) Latency Pipeline data up storage hierarchy (next section)Interesting Storage Ratios: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Interesting Storage Ratios Disk is back to 100x cheaper than RAM Nearline tape is only 10x cheaper than disk and the gap is closing! 100:1 10:1 1:1 1960 1970 1980 1990 2000 RAM $/MB Disk $/MB 30:1 ? Disk $/MB Nearline Tape ??? Why bother with Tape Disk & DRAM look goodPerformance =Storage Accesses not Instructions Executed: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Performance =Storage Accesses not Instructions Executed In the “old days” we counted instructions and IO’s Now we count memory references Processors wait most of the time Where the time goes: clock ticks used by AlphaSort Components 70 MIPS “real” apps have worse Icache misses so run at 60 MIPS if well tuned, 20 MIPS if notStorage Latency: How Far Away is the Data?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Storage Latency: How Far Away is the Data?Network Speeds: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Network Speeds Network speeds grow 60% / year WAN speeds limited by politics if voice is X$/minute, how much is video? Switched 100Mb Ethernet 1,000x more bandwidth ATM is a scaleable net: 1 Gb/s to desktop & wall plug commodity: same for LAN, WAN 1Tb/s fibers in laboratory 1960 1970 1980 1990 2000 Processors (i/s) Year Comm Speedups LANs & WANs (b/s)Network Trends & Challenge: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Network Trends & Challenge Bandwidth UP 104 Price DOWN Speed-of-light unchanged Software got worse Standard Fast Nets ATM PCI Myrinet Tnet HOPE: Commodity Net Good software Then clusters become a SNAP! commodity: 10k$/sliceThe Seven Price Tiers: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Seven Price Tiers 10$: wrist watch computers 100$: pocket/ palm computers 1,000$: portable computers 10,000$: personal computers (desktop) 100,000$: departmental computers (closet) 1,000,000$: site computers (glass house) 10,000,000$: regional computers (glass castle) SuperServer: Costs more than 100,000 $ “Mainframe” Costs more than 1M$ Must be an array of processors, disks, tapes comm portsIf Hardware is Free, Where Will The Money Go?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey If Hardware is Free, Where Will The Money Go? All clients and servers will be based on PC technology economies of scale give lowest price. Traditional budget: 40% vendor, 60% staff If hardware_price = software_price = 0 then what? Money will go to CONTENT (databases) NEW APPLICATIONS AUTOMATION analogy to 1920 telephone operators Systems programmer per MIPS DBA per 10GBThe New Computer Industry: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The New Computer Industry Horizontal integration is new structure Each layer picks best from lower layer. Desktop (C/S) market 1991: 50% 1995: 75% Constant Dollars vs Constant Work: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Constant Dollars vs Constant Work Constant Work: One SuperServer can do all the world’s computations. Constant Dollars: The world spends 10% on information processing Computers are moving from 5% penetration to 50% 300 B$ to 3T$ We have the patent on the byte and algorithmSoftware Economics: Bill’s Law: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Software Economics: Bill’s Law Bill Joy’s law (Sun): Don’t write software for less than 100,000 platforms. @10M$ engineering expense, 1,000$ price Bill Gate’s law: Don’t write software for less than 1,000,000 platforms. @10M$ engineering expense, 100$ price Examples: UNIX vs NT: 3,500$ vs 500$ UNIX-Oracle vs SQL-Server: 100,000$ vs 1,000$ No Spreadsheet or Presentation pack on UNIX/VMS/... Commoditization of base Software & HardwareWhat comes next: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What comes next MANY new clients Applications to enable clients & servers super-serversOpportunities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Opportunities Bad News: Big players will dominate cpu / storage / network (Intel, Seagate, ?) OS (Microsoft, Novell) DB-TP-CS...(Oracle, Sybase, Informix, IBM, Novell, Microsoft) Good News: Applications are up for grabs! Value added is in applications & Content Service, & Support Advice: Create new sub-spaces in Cyberspace. Create new clients (e.g., cellular phones,...) Examples: SAP, Adabase, Lotus, Peoplesoft, Netscape, Doom..., Lotus Notes, NetscapeOutline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Cyberspace pep talk: PCs are bricks, Nets are mortar, DBs are land (content) 4B machines Commodity Processor-Disk-Tape-Net Plus commodity base software (POSIX or NT or..) Build 4T SuperServers from arrays of 4B machines Challenge: Software to automate operations & programming Parallel DBMSs do this Opportunities: new kinds of clients (e.g., intelligent universe) new applications (new subspaces in Cyberspace)ThesisMany Little will Win over Few Big: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Thesis Many Little will Win over Few Big 1 M$ 100 K$ 10 K$ Mainframe Mini Micro Nano 14" 9" 5.25" 3.5" 2.5" 1.8" Year 2000 4B Machine: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Year 2000 4B Machine The Year 2000 commodity PC (3K$) Billion Instructions/Sec Billion Bytes RAM Billion Bits/s Net 10 B Bytes Disk Billion Pixel display 3000 x 3000 x 24 pixel 10 B byte Disk .1 B byte RAM 1 Bips Processor 1 B bits/sec LAN/WAN4 B PC’s: The Bricks of Cyberspace: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey 4 B PC’s: The Bricks of Cyberspace Cost 3,000 $ Come with OS (NT, POSIX,..) DBMS High speed Net System management GUI / OOUI Tools Compatible with everyone else CyberBricksImplications of Hardware Trends: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Implications of Hardware Trends Large Disc Farms will be inexpensive ( 100$/GB) Large RAM databases will be inexpensive (1,000$/GB) Processors will be inexpensive So The building block will be a processor with large RAM lots of Disc 1k SPECint CPU 50 GB Disc 5 GB RAMImplication of Hardware Trends: Clusters: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Implication of Hardware Trends: Clusters Future Servers are CLUSTERS of processors, discs Distributed Database techniques make clusters work CPU 50 GB Disc 5 GB RAMFuture SuperServer4T Machine: High Speed Network ( 10 Gb/s) Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Future SuperServer 4T Machine Array of 1,000 4B machines processors, disks, tapes comm lines A few MegaBucks Challenge: Manageability Programmability Security Availability Scaleability Affordability As easy as a single system 1,000 discs = 10 Terrorbytes 100 Tape Transports = 1,000 tapes = 1 PetaByte 100 Nodes 1 TipsGreat Debate: Shared What?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Great Debate: Shared What? Shared Memory (SMP) Shared Disk Shared Nothing (network) Easy to program Difficult to build Difficult to scaleup Hard to program Easy to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2 Winner will be a synthesis of these ideas Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important) But gives Shared memory message cost.Scaleables: Uneconomic So Far: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Scaleables: Uneconomic So Far A Slice is a processor, memory, and a few disks. Slice Price of Scaleables so far is 5x to 10x markup Teradata: 70K$ for a Intel 486 + 32MB + 4 disk. Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk Intel: 75k$ for an I860 +32MB + 2 disk TMC: 75k$ for a SPARC 3 + 32MB + 2 disk. IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk Compaq Slice Price is less than 10k$ What is the problem? Proprietary interconnect Proprietary packaging Proprietary software (vendorIX)Summary: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Summary Storage trends force pipeline & partition parallelism Lots of bytes & bandwidth per dollar Lots of latency Processor trends force pipeline & partition Lots of MIPS per dollar Lots of processors Putting it together Scaleable Networks and Platforms) Build clusters of commodity processors & storage Commodity interconnect is key (S of PMS) Traditional interconnects give 100k$/slice. Commodity Cluster Operating System is key Fault isolation and tolerance is key Automatic Parallel Programming is keyThe Hardware is in Place and Then A Miracle Occurs: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Hardware is in Place and Then A Miracle Occurs Enables Parallel ApplicationsThe Software Challenge: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Software Challenge • Automatic data placement (partition: random or organized) • Automatic parallel programming (process placement) • Parallel concepts, algorithms & tools • Parallel Query Optimization • Execution Techniques load balance, checkpoint/restart, pacing, multi-programmingKinds of Parallel Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Sequential Sequential Sequential Any Sequential Program Any Sequential Program Why Parallel Access To Data?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. BandwidthDataFlow ProgrammingPrefetch & Postwrite Hide Latency : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DataFlow Programming Prefetch & Postwrite Hide Latency Can't wait for the data to arrive (2,000 years!) Need a memory that gets the data in advance ( 100MB/S) Solution: Pipeline from source (tape, disc, ram...) to cpu cache Pipeline results to destination LatencyWhy are Relational OperatorsSo Successful for Parallelism?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Why are Relational Operators So Successful for Parallelism? Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISMDatabase Systems “Hide” Parallelism : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Database Systems “Hide” Parallelism Automate system management via tools data placement data organization (indexing) periodic tasks (dump / recover / reorganize) Automatic fault tolerance duplex & failover transactions Automatic parallelism among transactions (locking) within a transaction (parallel execution)Automatic Parallel OR DB: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Automatic Parallel OR DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image date loc image Landsat 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it image Answer date, location, & image testsOutline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrickParallelism: Speedup & Scaleup: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallelism: Speedup & Scaleup Speedup: Same Job, More Hardware Less time Scaleup: Bigger Job, More Hardware Same time Transaction Scaleup: more clients/servers Same response timeThe New Law of Computing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The New Law of Computing Grosch's Law: Parallel Law: Needs Linear Speedup and Linear Scaleup Not always possibleParallelism: Performance is the Goal: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallelism: Performance is the Goal Goal is to get 'good' performance. Law 1: parallel system should be faster than serial system Law 2: parallel system should give near-linear scaleup or near-linear speedup or both. The New Performance Metrics: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The New Performance Metrics Transaction Processing Performance Council: TPC-A: simple transaction TPC-B: server only, about 3x lighter than TPC-A Both obsoleted by TPC-C (no new results after 6/7/95) TPC-C (revision 3) Transactions Per Minute tpm-C Mix of 5 transactions: query, update, minibatch Terminal price eliminated about 5x heavier than tpcA (so 3.5 ktpcA 20 ktpmC) TPC-D approved in March 1995 - Transactions Per Hour Scaleable database (30 GB, 100GB, 300GB,... ) 17 complex SQL queries (no rewrites, no hints without permission) 2 load/purge queries No official results yet, many “customer” results.TPC-C Results 12/94: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey TPC-C Results 12/94 Courtesy of Charles Levine of Tandem (of course)Success Stories: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Success Stories Online Transaction Processing many little jobs SQL systems support 3700 tps-A (24 cpu, 240 disk) SQL systems support 21,000 tpm-C (112 cpu,670 disks) Batch (decision support and Utility) few big jobs, parallelism inside Scan data at 100 MB/s Linear Scaleup to 500 processors transactions / sec hardware recs/ sec hardwareThe Perils of Parallelism: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The Perils of Parallelism Startup: Creating processes Opening files Optimization Interference: Device (cpu, disc, bus) logical (lock, hotspot, server, log,...) Skew: If tasks get very small, variance > service time Benchmark Buyer's Guide: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmark Buyer's Guide Things to ask When does it stop scaling? Throughput numbers, Not ratios. Standard benchmarks allow Comparison to others Comparison to sequential Ratios and non-standard benchmarks are red flags.Performance 101: Scan Rate: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Agg Count Performance 101: Scan Rate Disk is 3MB/s to 10MB/s Record is 100B to 200B (TPC-D 110...160, Wisconsin 204) So should be able to read 10kr/s to 100kr/s Simple test: Time this on a 1M record table SELECT count(*) FROM T WHERE x < :infinity; (table on one disk, turn off parallelism) Typical problems: disk or controller is an antique no read-ahead in operating system or DB small page reads (2kb) data not clustered on disk big cpu overhead in record movement Parallelism is not the cure for these problems ScanParallel Scan Rate: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Scan Rate Agg Count Scan Agg Count Scan Agg Count Scan Agg Count Scan Agg Sum Simplest parallel test: Scaleup previous test: 4 disks, 4 controllers, 4 processors 4 times as many records partitioned 4 ways. Same query Should have same elapsed time. Some systems do.Parallel Update Rate : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Update Rate UPDATE Test: UPDATE T SET x = x + :one; Test for million row T on 1 disk Test for four million row T on 4 disks Look for bottlenecks. After each call, execute ROLLBACK WORK See if UNDO runs at the DO speed See if UNDO is parallel (scales up)Parallel Insert Rate: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Insert Rate INSERT INTO T2 SELECT * FROM T1; First 1 scan-> insert, then 4 scan->insert. See if log becomes bottleneck. After each run, call ROLLBACK WORK See if rollback: runs as fast as forward processing runs faster in parallel. If not, think about the implications of 100x parallelism Few systems pass these basic sanity tests. The records/$/second Metric: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey The records/$/second Metric parallel database systems scan data An interesting metric (100 byte record): Record Scan Rate / System Cost Typical scan rates: 1k records/s to 30k records/s Each Scaleable system has a “slice price” guess: Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or Oracle) Teradata: 75k$ Sequent: 75k$ (P5+2 disks+Dynix+Informix) Tandem: 100k$ IBM SP2: 130k$ (RS6000+2 disks, AIX, DB2) You can compute slice price for systems later in presentation BAD: 0.1 records/s/$ (there is one of these) GOOD: 0.33 records/s/$ (there is one of these) Super! 1.00 records/s/$ (there is one of these) We should aim at 10 records/s/$ with P6.Embarrassing Questions to Ask Your PDB Vendor: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Embarrassing Questions to Ask Your PDB Vendor How are constraints checked? ask about unique secondary indices ask about deferred constraints ask about referential integrity How does parallelism interact with triggers Stored procedures OO extensions How can I change my 10 TB database design in an hour? add index add constraint reorganize / repartition These are hard problems.Outline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrickAutomatic Data Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Good for equijoins, range queries group-by Good for equijoins Good to spread loadIndex Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Index Partitioning Hash indices partition by hash B-tree indices partition as a forest of trees. One tree per range Primary index clusters data Secondary Index Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Secondary Index Partitioning In shared nothing, secondary indices are Problematic Partition by base table key ranges Insert: completely local (but what about unique?) Lookup: examines ALL trees (see figure) Unique index involves lookup on insert. Partition by secondary key ranges Insert: two nodes (base and index) Lookup: two nodes (index -> base) Uniqueness is easy Teradata solution Base Table A..Z Base Table A..Z A..Z A..Z A..ZKinds of Parallel Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Any Sequential Any Sequential Program ProgramData Rivers Split + Merge Streams: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Rivers Split + Merge Streams River M Consumers N producers Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano. N X M Data StreamsPartitioned Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Partitioned Execution Spreads computation and IO among processors Partitioned data gives NATURAL parallelismN x M way Parallelism: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey N x M way Parallelism N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data FlowsPicking Data Ranges: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Picking Data Ranges Disk Partitioning For range partitioning, sample load on disks. Cool hot disks by making range smaller For hash partitioning, Cool hot disks by mapping some buckets to others River Partitioning Use hashing and assume uniform If range partitioning, sample data and use histogram to level the bulk Teradata, Tandem, Oracle use these tricks Blocking Operators = Short Pipelines: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Blocking Operators = Short Pipelines An operator is blocking, if it does not produce any output, until it has consumed all its input Examples: Sort, Aggregates, Hash-Join (reads all of one operand) Blocking operators kill pipeline parallelism Make partition parallelism all the more important. Sort Runs Scan Sort Runs Sort Runs Sort Runs Tape File SQL Table Process Merge Runs Merge Runs Merge Runs Merge Runs Table Insert Index Insert Index Insert Index Insert SQL Table Index 1 Index 2 Index 3 Database Load Template has three blocked phasesSimple Aggregates (sort or hash?): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Simple Aggregates (sort or hash?) Simple aggregates (count, min, max, ...) can use indices More compact Sometimes have aggregate info. GROUP BY aggregates scan in category order if possible (use indices) Else If categories fit in RAM use RAM category hash table Else make temp of <category, item> sort by category, do math in merge step. Parallel Aggregates: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Aggregates For aggregate function, need a decomposition strategy: count(S) = S count(s(i)), ditto for sum() avg(S) = (S sum(s(i))) / S count(s(i)) and so on... For groups, sub-aggregate groups close to the source drop sub-aggregates into a hash river. Sort: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sort Used for loading and reorganization (sort makes them sequential) build B-trees reports non-equijoins Rarely used for aggregates or equi-joins (if hash available Sort Runs Input Data Sorted Data MergeParallel Sort: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Sort M input N output Sort design Disk and merge not needed if sort fits in memory Scales linearly because Sort is benchmark from hell for shared nothing machines net traffic = disk bandwidth, no data filtering at the sourceSIGMOD Sort Award: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey SIGMOD Sort Award Datamation Sort: 1M records (100 B recs) 1000 seconds 1986 60 seconds 1990 7 seconds 1994 3.5 seconds 1995 (SGI challenge) micros finally beat the mainframe! finally! a UNIX system that does IO SIGMOD MinuteSort 1.1GB, Nyberg, 1994 Alpha 3cpu 1.6GB, Nyberg, 1995 SGI Challenge (12 cpu) no SIGMOD PennySort recordNested Loops Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Nested Loops Join Outer Table Inner Table If inner table indexed on join cols (b-tree or hash) then sequential scan outer (from start key) For each outer record probe inner table for matching recs Works best if inner is in RAM (=> small innerMerge Join (and sort-merge join): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Merge Join (and sort-merge join) Left Table Right Table NxM case Cartesian product Partitions well: partition smaller to larger partition. Works for all joins (outer, non-equijoins, Cartesian, exclusion,...) If tables sorted on join cols (b-tree or hash) then sequential scan each (from start key) left < right left=right left > right advance left match advance right Nice sequential scan of data (disk speed) (MxN case may cause backwards rescan) Sort-merge join sorts before doing the mergeHash Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Hash Join Hash smaller table into N buckets (hope N=1) If N=1 read larger table, hash to smaller Else, hash outer to disk then bucket-by-bucket hash join. Purely sequential data behavior Always beats sort-merge and nested unless data is clustered. Good for equi, outer, exclusion join Lots of papers, products just appearing (what went wrong?) Hash reduces skew Right Table Left Table Hash BucketsHash Join Variants: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Hash Join Variants Hybrid: Keep one bucket in memory. Bitmap: If disk-based, keep bitmap of first table join vals Filter second table with first. (useless if a 1-N join (which is typical)) Bucket Tuning: use many small buckets Aggregate them based on memory pressureParallel Hash Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Hash Join ICL implemented hash join with bitmaps in CAFS machine (1976)! Kitsuregawa pointed out the parallelism benefits of hash join in early 1980’s (it partitions beautifully) We ignored them! (why?) But now, Everybody's doing it. (or promises to do it). Hashing minimizes skew, requires little thinking for redistribution Hashing uses massive main memory Index-only Scan and Join: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Index-only Scan and Join Index has secondary key and primary key or RID. Index is like a table (but compact and clustered) Think of it as a replica Can scan it rather than base table (called a semi-join when used for joins) Can scan it, select primary keys, Sort primary keys and Scan base table Index Sec Key(s) Pri Keys (called a hybrid-join when used for joins) Exotic Joins (Exclusion, Cartesian, Outer, ...): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Exotic Joins (Exclusion, Cartesian, Outer, ...) Exclusion used for NOT IN and DIFFERENCE queries Outer is “lossless” join, {left, right} appears with null sibling if matching sibling not found. Cartesian is often a mistake (missing where clause) but also important for small-table large-table optimization. Small table used as a pick list. Each small table represents a mapping: name -> code set membership (e.g. holidays) Best plan: Restrict small tables, Form Cartesian product of all small Send it to each partition of large table for hash join. Observations: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Observations It is easy to build a fast parallel execution environment (no one has done it, but it is just programming) It is hard to write a robust and world-class query optimizer. There are many tricks One quickly hits the complexity barrier Common approach: Pick best sequential plan Pick degree of parallelism based on bottleneck analysis Bind operators to processWhat’s Wrong With That?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What’s Wrong With That? Why isn’t the best serial plan, the best parallel plan? Counter example: Table partitioned with local secondary index at two nodes Range query selects all of node 1 and 1% of node 2. Node 1 should do a scan of its partition. Node 2 should use secondary index. SELECT * FROM telephone_book WHERE name < “NoGood”; Sybase Navigator & DB2 PE should get this right. We need theorems here (practitioners do not have them) N..Z Table Scan A..M Index ScanGreat Debate: Shared What?: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Great Debate: Shared What? Shared Memory (SMP) Shared Disk Shared Nothing (network) Easy to program Difficult to build Difficult to scaleup Hard to program Easy to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2 Winner will be a synthesis of these ideas Distributed shared memory (DASH, Encore) blurs distinction between Network and Bus (locality still important) But gives Shared memory message cost.What Systems Work This Way: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey What Systems Work This Way Shared Nothing Teradata: 400 nodes Tandem: 110 nodes IBM / SP2 / DB2: 128 nodes Informix/SP2 48 nodes ATT & Sybase 8x14 nodes Shared Disk Oracle 170 nodes Rdb 24 nodes Shared Memory Informix 9 nodes RedBrick ? nodes Outline: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Outline Why Parallelism: technology push application pull Benchmark Buyer’s Guide metrics simple tests Parallel Database Techniques partitioned data partitioned and pipelined execution parallel relational operators Parallel Database Systems Teradata - Oracle -DB2 Tandem - Informix -RedBrick - Sybase System Survey Ground Rules: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey System Survey Ground Rules Premise: The world does not need yet another PDB survey It would be nice to have a survey of “real” systems Visited each parallel DB vendor I could (time limited) Asked not to be given confidential info. Asked for public manuals and benchmarks Asked that my notes be reviewed I say only nice things (I am a PDB booster) Acknowledgments: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Acknowledgments Teradata Todd Walter and Carrie Ballinger Tandem Susanne Englert, Don Slutz, HansJorge Zeller, Mike Pong Oracle Gary Hallmark, Bill Widdington Informix Gary Kelley, Hannes Spintzik, Frank Symonds, Dave Clay Navigator Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman , Bob McDonald, Jan Graveson Ron Chung Hu, Stuart Thompto DB2 Chaitan Baru, Gilles Fecteau, James Hamilton, Hamid Pirahesh Redbrick Phil Fernandez, Donovan Schneider Teradata : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Teradata • Ship 1984, now an ATT GIS brand name • Parallel DB server for decision support SQL in, tables out • Support Heterogeneous data (convert to client format) Data hash partitioned among AMPs with fallback (mirror) hash. Applications run on clients Biggest installation: 476 nodes, 2.4 TB Ported to UNIX baseParsing Engines: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parsing Engines Interface to IBM or Ethernet or... Accept SQL, return records and status. Support SQL 89, moving to SQL92 Parse, Plan & authorize SQL cost based optimizer Issue requests to AMPs Merge AMP results to requester. Some global load control based on client priority (adaptive and GREAT!) Access Modules Almost all work done in AMPs A shared nothing SQL engine scans, inserts, joins, log, lock,.... Manages up to 4 disks (as one logical volume) Easy design, manage, grow (just add disk)Data Layout: Hash Partitioning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Layout: Hash Partitioning All data declustered to all nodes Each table has a hash key (may be compound) Key maps to one of 4,000 buckets Buckets map to one of the AMPs Non-Unique secondary index partitioned by table criterion Fallback bucket maps to second AMP in cluster. Typical cluster is 6 nodes (2 is mirroring). Cluster limits failure scope: 2 failures only cause data outage if both in same cluster. Within a node, each hash to cylinder then hash to “page” Page is a heap with a sorted directory Teradata Optimization & Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Teradata Optimization & Execution Sophisticated query optimizer (many tricks) Great emphasis on Joins & Aggregates. Nested, merge, product, bitmap join (no hash join) Automatic load balancing from hashing & load control Excellent utilities for data loading, reorganize Move > 1TB database from old to new in 6 days, in background while old system running Old hardware, 3.8B row table (1TB), >300 AMPs typical scan, sort, join averages 30 minutes Query Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution Protocol PE requests work AMP responds OK (or pushback) AMP works (if all OK) AMP declares finished When all finished, PE does 2PC and starts pull Simple scan: PE broadcasts scan to each AMP Each AMP scans produces answer spool file PE pulls spool file from AMPs via Ynet If scan were ordered, sort “catcher” would be forked at each AMP pipelined to scans Ynet and PE would do merge of merges from AMPs Aggregates, Updates: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Aggregates, Updates Aggregate of Scan: Scan’s produce local sub-aggregates Hash sub-aggregates to Ynet Each AMP “catches” its sub-aggregate hash buckets Consolidate sub-aggregates. PE pulls aggregates from AMPs via Ynet. Note: fully scaleable design Insert / Update / Delete at a AMP node generates insert / update /delete messages to unique-secondary indices fallback bucket of base table. messages saved in spool if node is downQuery Execution: Joins: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution: Joins Great emphasis on Joins. Includes small-table large-table optimization cheapest triple, then cheapest in triple. If equi-partitioned, do locally If not equi-partitioned, May replicate small table to large partition (Ynet shines) May repartition one if other is already partitioned on join May repartition both (in parallel) Join algorithm within node is Product Nested Sort-merge Hash bit map of secondary indices, intersected. Utilities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Utilities Bulk Data Load, Fast Data Load, Multi-load, Blast 32KB of data to an AMP Multiple sessions by multiple clients can drive 200x parallel Double buffer AMP unpacks, and puts “upsert”onto Ynet One record can generate multiple upserts (transaction-> inventory, store-sales, ...) Catcher on Ynet, grabs relevant “upserts” to temp file. Sorts and then batches inserts (survives restarts). Online and restartable. Customers cite this as Teradata strength. Fast Export (similar to bulk data load)Utilities II: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Utilities II Backup / Restore: Rarely needed because of fallback. Cluster is unit of recovery Backup is online, Restore is offline Reorganize: Rarely needed, add disk is just restart Add node: rehash all buckets that go to that node: (Ynet has old and new bucket map) Fully parallel and fault tolerant, takes minutes Port To UNIX: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Port To UNIX New design (3700 series) described in VLDB 93 Ported to UNIX platforms (3600 AP, PE, AMP) Moved Teradata to Software Ynet on SMPs Based on Bullet-Proof UNIX with TOS layer atop. message system communications stacks raw disk & virtual processors virtual partitions (buckets go to virtual partitions) removes many TOS limits Result is 10x to 60x faster than an AMP Compiled expression evaluation (gives 50x speedup on scans) Large main memory helps UNIX 5.4 (SMP, RAS, virtual Ynet) UNIX PDE: TOS adapter Teradata SQL (AMP logic) Parsing engine (parallelism) Applications SQL HARDWARECustomer Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Customer Benchmarks Standard Benchmarks Only old Boral/DeWitt Wisconsin numbers. Nothing public. Moving > 1TB database from one old to new in 6 days, in background while old system running So: unload-load rate > 2MB/s sustained Background task (speed limited by host speed/space) Old hardware, 3.8B row table, >300 AMPs typical scan, sort, join averages 30 minutes rates (rec size not cited): krec/s/AMP k rec/s scan: 9 2.7 mr/s !!!!!! clustered join: 2 600 kr/s insert-select: .39 120 kr/s Hash index build: 3.3 100 kr/sUNIX/SMP Port of Teradata: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey UNIX/SMP Port of Teradata op rows seconds k r/s MB/s scan 50000000 737 67.8 11.0 copy 5000000 1136 4.4 0.7 aggregate 50000000 788 63.5 10.3 Join 50x2M (clustered) 52000000 768 67.7 11.0 Join 5x5 (unclustered) 10000000 237 42.2 6.8 Join 50Mx.1K 50000100 1916 26.1 4.2 Times to process a Teradata Test DB on a 8 Pentium, 3650. These numbers are 10 to 150x better than a single AMP Compiled expression handling more memory Teradata Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Teradata Good Things Scaleable to large (multi-terabyte) databases Available TODAY! It is VERY real: in production in many large sites Robust and complete set of utilities Automatic management. Integrates with the IBM mainframe OLTP world Heterogeneous data support is good data warehouse Tandem: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Message-based OS (Guardian): (1) location transparency (2) fault isolation (failover to other nodes). Expand software 255 Systems WAN Classic shared-nothing system (like Teradata except applications run inside DB machine. 4 node System 8 x1M B/S 30MB/S 1-16 MIPS R4400 cpus dual port controllers, dual 30MB/s LAN 224 PROCESSORS 1974-1985: Encompass: Fault-tolerant Distributed OLTP 1986: NonStopSQL: First distributed and high-performance SQL (200 tps) 1989: Parallel NonStopSQL: Parallel query optimizer/executor 1994: Parallel and Online SQL (utilities, DDL, recovery, ....) 1995: Moving to ServerNet: shared disk modelTandem Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Data Layout Each table or index range partitioned to a set of disks (anywhere in network) Index is B-tree per partition clustering index is B+ tree Table fragments are files (extent based). Descriptors for all local files live in local catalog (node autonomy) Tables can be distributed in network (lan or wan) Duplexed disks and disk processes for failover Partition Block Extents may be added File= {parts}Tandem Software (Process Structure): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Software (Process Structure) Disk Server Pair Data partition C/COBOL/.. Application SQL SQL engine Joins, Sorts global aggs triggers index maintenance views security Query Compiler Utilities Transactions Helper Processes GUI Selects Update, Delete Record/Set insert Aggregates Assertions Locking Logging buffer pool Disk Pair or Array Hardware & OS move data at 4MB/s with >1 ins/byteOLTP Features: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey OLTP Features Insert / Update / Delete index in parallel with base table If 5 indices, 5x faster response time. Record and key-value range locking, SQL92 isolation levels Undo scanner per log: double-buffers undo to each server 21 k tpc-C (WOW!!) with 110 node server (800GB db) Can mix OLTP and batch. Priority serving to avoid priority inversion problem Buffer management prevents sequential buffer pollutionTandem Query Plan & Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Query Plan & Execution Simple selects & aggregates done in disk servers Parallelism chosen: scan: table fragmentation hash: # processors or Outer table fragments Sorts: redistribution, sort in executors (N-M) Joins done in executors (nest, sort-merge, hash). Redistribution is always a hash (minimize skew) Pipeline as deep as possible (use lots of processes) Multiple logs & parallel UNDO avoid bottlenecks Can mix OLTP and batch. Priority serving to avoid priority inversion problem Buffer management prevents sequential buffer pollutionParallel Operators: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Operators Initially just inserted rivers between sequential operators Parallel query optimizer Created executors at all clustering nodes or at all nodes, repartitioned via hash to them Gave parallel select, insert, update, delete join, sort, aggregates,... correlated subqueries are blocking Got linear speedup/scaleup on Wisconsin. Marketing never noticed, product slept from 1989-1993 Developers added: Hash Join aggregates in disk process SQL92 features parallel utilities online everything converted to MIPSco fixed bugs Join Strategies: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Join Strategies Nested loop Sort merge Both can work off index-only access Replicate small to all partitions (when one small) Small-table Cartesian product large-table optimization Now hybrid-hash join uses many small buckets tuned to memory demand tuned to sequential disk performance no bitmaps because (1) parallel hash (2) equijoins usually do not benefit When both large, and unclustered (rare case) N+M scanners, 16 catchers: sortmerge or hybrid hashAdministration (Parallel & Online everything): Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Administration (Parallel & Online everything) All utilities are online (claim to reduce outages by 40%): Add table, column,... Add index: builds index from stale copy uses log for catchup in final minute, gets lock, completes index. Reorg B-tree while it is accessed Add / split/ merge/ reorg partition Backup Recover page, partition, file. Add, alter logs, disks, processors, ... You need this: Terabyte operations take a long time! Parallel Utilities: load (M to N) index build (M scanners, N inserters, in background) recovery:Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks No official DSS benchmark reports Unofficial results 1 to 16 R4400 class processors, 64MB each (Himalayas) 3 disks, 3 ctlrs each Sequential 16x Parallel rec/s MB/s rec/s MB/s speedup Load Wisc 1.6 kr/s 321 Kb/s 28 kr/s 5.4 MB/s 16 Parallel Index build 1.5 kr/s 15Kb/s 24 kr/s 240 KB/s 16 SCAN 28 kr/s 5.8 MB/s 470 kr/s 94 MB/s 16 !!!!!!! Aggregate (1 col) 25 kr/s 4.9 MB/s 400 kr/s 58 MB/s 16 Aggregate (6 col) 18 kr/s 3.6 MB/s 300 kr/s 60 MB/s 16 2-Way hash Join 13 kr/s 2.6 MB/s 214 kr/s 42 MB/s 16 3-Way hash Join ? kr/s ? Mb/s ? kr/s ? MB/s ? 1x and 16x rates are best I’ve seen anywhere.Tandem Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Tandem Good Things 21 K TPM-C (WOW!) It is available TODAY! Online everything Fault tolerant, distributed, high availability Mix OLTP and batch Great Hash Join Algorithm Probably the best peak performance availableOracle: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Parallel Server (V7): Multiple threads in a server Multiple servers in a cluster Client/server, OLTP & clusters (TP lite) Parallel Query (V7.1) Parallel SELECT (and sub-selects) Parallel Recovery: (V7.1) @ restart, one log scanner, multiple redoers Beta in 1993, Ship 6/94. More Parallel (create table): V7.2, 6/95 Shared disk implementation ported to most platforms Parallel SELECT (no parallel INSERT, UPDATE, DELETE, DDL) except for sub-selects inside these verbs.Oracle Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Data Layout Homogenous: one table (index) per segment extents picked from a TableSpace Files may be raw disk Segments are B-trees or heaps. data -> disk map is automatic No range / hash / round-robin partitioning ROWID can be used as scan partitioning on base tables. Guiding principal: If its not organized, it can’t get disorganized, and doesn’t need to be reorganized.Oracle Parallel Query Product Concept: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Parallel Query Product Concept Convert serial SELECT plan to parallel plan If Table scan or HINT then consider parallel plan Table has default degree of parallelism (explicitly set) Overridden by system limits and hints. Use max degree of all participating tables. Intermediate results are hash partitioned Nested Loop Join and Merge Join User hints can (must?) specify join order, join strategy, index, degree of parallelism,... Query Planning: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Planning Query Coordinator starts with Oracle Cost-Based plan If plan requests Table scan or HINT then consider parallel plan Table has default degree of parallelism (explicitly set) Overridden by system limits and hints. Use max degree of all participating tables. Shared disk makes temp space allocation easy Planner picks degree of parallelism and river partitioning. Proud of their OR optimization. Query Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution Coordinator does extra work to merge the outputs of several sorts subsorts pushed to servers aggregate the outputs of several aggregates aggregates pushed to servers Parallel function invocation is potentially a big win. SELECT COUNT ( f(a,b,c,...)) FROM T; Invokes function f on each element of T, 100x parallel. Join Strategies: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Join Strategies Oracle has (1) Nested Loop Join (2) Merge Join Replicate inner to outer partition automatic in shared disk (looks like partition outer). Has small-table large-table optimization (Cartesian product join) User hints can specify join order, join strategy, index degree of parallelism,... Transactions & Recovery: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Transactions & Recovery Transactions and transaction save points (linear nest). ReadOnly snapshots for decision support. SQL92 isolation levels (ACID = Snapshot isolation) Database has multiple rollback segments UNDO log, Transaction has one commit/REDO log so may be a bottleneck Parallel recovery at restart: One log scanner, DEGREE REDO streams, typically one per disk INSTANCE REDO streams, typically two-deep per disk Oracle Utilities : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Utilities User can write parallel load / unload utility Index build, Constraints, are separate steps Not incremental or online or restartable. Update Statistics (Analyze) is not parallel Index build is a N-1 parallel: N scanner/sorter, 1 inserter. Parallel recovery at restart: One log scanner, DEGREE REDO streams, typically one per disk INSTANCE REDO streams, typically two-deep per disk Administration Not much special: Limit degree of parallelism at a server Set default parallelism of a table Query can only lower these limits No special tools, meters, monitors,... Just ordinary Parallel Server Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks Sequent 20x 50MHz 486, .5GB RAM, 20 disk Sequential 20x Parallel rec/s krecr/s KB/s rec/s MB/s speedup Load 5M Wisc .5 kr/s 113 KB/s 8.8 kr/s 1.8 MB/s 16 Parallel Index load 2.2 kr/s 18 Kb/s 29 kr/s 235 KB/s 13 SCAN 1.7 kr/s 364 KB/s 26 kr/s 5.3 MB/s 15 Agg MJ 3.3 kr/s 660 KB/s 45 kr/s 9.3 MB/s 14 Agg NJ 1.4 kr/s 290 KB/s 26 kr/s 5.4 MB/s 19 Same benchmark on 16x SP1 (a shared nothing machine), got similar results. 168x N-cube ( 16MB/node), 4 lock nodes, 64 disk nodes got good scaleup Oracle has published details on all these benchmarks. 20 Pentium, 40 disk system, SCAN at 44 MB/s 55% cpu Sept 1994 news:Oracle Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Oracle Good Things Available now! Parallel Everywhere (on everybody’s box) A HIGH FUNCTION SQL No restrictions (triggers, indices,...) Very easy to use (almost no knobs or options) Parallel invocation of stored procedures Near-linear scaleup and speedup of SELECTs. Respectable performance on SequentInformix: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix DSA (Dynamic Scaleable Architecture) describes redesign to thread-based, server-based system. V6 - 1993 - : DSA -- rearchitecture (threads, OLTP focus) V7 - 1994 - : PDQ -- Parallel Data Query (SMP) V8 - 1995 - : XMP -- Cluster parallelism (shared disk/nothing). Parallelism is a MAJOR focus now that SQL92 under control Other major focus is TOOLS (ODBC, DRDA, NewEra 4GL). Informix is a UNIX SQL system: AIX (IBM), HP/UX (HP), OSF/1 (DEC, HP), SCO/UNIX, Sequent/DYNIX, SUN (SunOS, Solaris) Today shared nothing parallelism on IBM SP2, ATT3650, ICL, (beta) Informix Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Data Layout DBspace Block Chunks may be added File Table or index maps to homogeneous set of DB spaces contains “chunks” (extents) Partition by: range, round robin expression hash (V8) Access via B+Tree, B* tree, and hash (V8) Built an extent-based file system on raw disks or files High speed sequential, clustering, async IO,...Informix Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Execution Completely parallel DML, some parallel DDL Parallel SELECT, UPDATE, DELETE Executor per partition in all cases. Parallel sort, joins (nest, merge, hash) aggregates, union Whenever an operator has input and a free output buffer, it can work to fill the output buffer. Natural flow control Blocking operators (sort, hash join, aggregates, correlated subqueries) Spool to a buffer (if small), else spool to disk. Shared buffer pool minimizes data copies.Parallel Plans: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Plans Query plan is parallelized by scanner per table partition (does select, project) sub-aggregates per partition (hash or sort) If clustered join (nested loop or merge) then operator per outer or per partition If hash-join, parallel scan smaller first, build bitmap and hash buckets then scan larger and: join to smaller if it fits in memory else filter via bitmap and build larger buckets then join bucket by bucket Hybrid hash join with bitmaps and bucket tuning. Parallel Operators: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Operators Parallel SELECT, UPDATE, DELETE Executor per partition in all cases. Parallel sort, joins, aggregates, union Only correlated subqueries are blocking Completely parallel DML, some parallel DDL Transactions & Recovery: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Transactions & Recovery SQL 2 isolation levels allow DSS to run in background Transaction save points Separate logical and physical logs. Bulk updates could bottleneck on single log. Recovery unit is data partition (DBspace) Parallel recovery: thread per DBspace If DB fragment unavailable, DSS readers can skip itInformix Administration: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Administration Can assign % of processors, memory, IO to DSS (parallel query) Sum of all parallel queries live within this quota Each query can specify the % of the total that it wishes. (0 means sequential execution) Parallel Data load (SMP only) Parallel Index Build (N - M) Parallel recovery Online backup / restore UtilitiesBenchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks Sequent system: 9 Pentium processors 1 GB main memory Base tables on 16 disk (FWD SCSI) Indices on 10 discs Temp space on 10 disks Sequential Parallel rec/s MB/s rec/s MB/s speedup Load 300M Wisc 3kr/s 600Kb/s Parallel Index load 48kr/s 1MB/s SCAN 17kr/s 3.5MB/s 147kr/s 30MB/s 8.3 Aggregate 11kr/s 2.3MB/s 113kr/s 23MB/s 10.1 2-Way hash Join 18kr/s 3.2MB/s 242kr/s 31MB/s 9.7 3-Way hash Join 25kr/s 3.5Mb/s 239kr/s 33MB/s 9.5 Informix Shared Nothing Benchmark : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Shared Nothing Benchmark IBM SP2 - : TPC-D-like database 48 SP2 Processors Customer Benchmark, Not audited benchmark. Load 60 GB in 40 minutes, 250 GB in 140 min about 100 GB/hr ! 2GB/node/hr Scan & Aggregate (#6) 60 GB in 7 min = 140 MB/s = 3 MB/s/node = 30 kr/s 260 GB in 24 min = 180 MB/s = 4 MB/s/node = 40 kr/s Power Test (17 complex queries and 2 load/purge ops) 60 GB in 5 hrs 260 GB in 18 hrs Multiuser Test: 1 user, 12 queries: 10 hrs, 4 users, 3 queries: 10 hrs Informix Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Informix Good Things A full function SQL Available today on Sequent Beautiful manuals Linear speedup and scaleup Best published performance on UNIX systems Probably best price performance. (but things are changing fast!) Some mechanisms to mix OLTP and batch.Sybase Navigator Product concept: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sybase Navigator Product concept Two layer software architecture: (1) Navigator drives array of shared-nothing SQL engines. (2) Array of SQL engines, each unaware of others. similar to Tandem disk processes SQL engine is COTS. Goal: linear scaleup and speedup, plus good OLTP support Emphasize WHOLE LIFECYCLE Configurator: tools to design a parallel system Administrator: tools to manage a parallel system (install/upgrade, start/stop, backup/restore, monitor/tune) Optimizer: execute requests in parallel.Configurator: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Configurator Fully graphical design tool Given ER model and dataflow model of the application workload characteristics response time requirements, hardware components (heavy into circles and arrows) Recommends hardware configuration/ Table definitions (SQL) table partitioningAdministrator: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Administrator Made HUGE investments in this area. Truly industry leading graphical tools make MPP configuration “doable”. GUI interface to manage: startup / shutdown of cluster backup / restore / manage logs configure (install, add nodes, configure and tune servers) Manage / consolidate system event logs System stored procedures (global operations) (e.g. aggregate statistics from local to global cat) Monitor SQL Server events Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Layout Pure shared nothing Navigator partitions data among SQL servers • map to a subset of the servers • range partition or hash partition. Secondary indices are partitioned with base table No Unique secondary indices Only shorthand views, no protection views Schema server stores global data definition for all nodes. Each partition server has schema for its partition data for its partition. log for its partition Sybase SQL Server Backgrounder: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sybase SQL Server Backgrounder Recently became SQL89 compliant (cursors, nulls, etc) Stored procedures, multi-threaded, internationalized, B*-tree centric (clustering index is B+tree) Use nested loops, sort-merge join (sort is index build). Page locking, 2K disk IO, ... other little-endian design decisions. Respectable TPC-C results (AIX RS/6000). UNIX raw disks or files are base (also on OS/2, NetWare,...). table->disk mapping CREATE DATABASE name ON {device...} LOG ON {device...} SP_ADDSEGMENT segment, device CREATE TABLE name(cols) [ ON segment] Microsoft has a copy of the code, deep ported to NT Navigator Extension Mechanisms: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Navigator Extension Mechanisms Navigator extended Sybase TDS by Adding stored procedures to do things Extending the syntax (e.g. see data placement syntax below) Sybase TDS and OpenServer design are great for this All “front ends based on OpenServer and threads”Process Structure - Pure Shared Nothing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Process Structure - Pure Shared Nothing DBA Server does everything: SQL compilation System management Catalog management SQL server restart (in 2nd node) DBA fallback detects deadlock does DBA takeover on fail Control server at each node manages SQL servers there (security, request caching, 2PC, final merge /aggregate,... parallel stored procedures (SMID) ) Split server manages re-partitioning of data SQL Server is unit of query parallelism, (one per cpu per node) Simple Request Processing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Simple Request Processing Control (1/node) Client SQL Split DBA server schema server Client connects to Navigator (a Control Server) using standard Sybase TDS protocol. SQL request flows to DBA server that compiles it sends stored procedures (plans) to all control servers plans to all relevant SQL servers Control server executes plan. Pass to SQL server, returns results. Plan cached on second call, DBA server not invoked. Good for OLTP Parallel Request Processing: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Parallel Request Processing Control Split Control Client SQL Split DBA server schema server Control Split If query involves multiple nodes, then command sent to each one (diagram shows secondary index lookup) Query sent to SQL servers that may have relevant data. If data needs to be redistributed or aggregated, split servers issue queries and inserts (that is their only role) split servers have no persistent storage. Data Manipulation: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Data Manipulation SQL server is unit of parallelism "Parallelized EVERYTHING in the T-SQL language" Includes SIMD execution of T-SQL procedures, plus N-M data move operations. Two-level optimization: DBA Server has optimizer (BIG investment, all new code, NOT the infamous Sybase optimizer) Each SQL server has Sybase optimizer If extreme skew, different servers have different plans DBA optimizer shares code with SQL server (so they do not play chess with one another). Very proud of their optimizer.Query Execution : Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Query Execution Classic Sellinger cost-based optimizer. SELECT, UPDATE, DELETE N-to-M parallel Bulk and async INSERT interface. N-M Parallel sort Aggregate (hash/sort) select and join can do index-only access if data is there. eliminate correlated subqueries (convert to join). (Gansky&Wong. SIGMOD87 extended) Join: nested-loop, sort-merge, index only Sybase often dynamically builds index to support nested loop (fake sort-merge) Typically left-deep sequence of binary joins.Join and Partition Strategy: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Join and Partition Strategy Partition strategies If already partitioned on join, then no splitting Else Move subset of T1 to T2 partitions. or Replicate T1 to all T2 partitions or repartition both T1 and T2 to width of home nodes or target. No hash join, but all (re) partitioning is range or hash based. Not aggressive parallelism/pipelining: 2 op at a time. Pipeline to disk via split server (not local to disk and then split). Split servers fake subtables for SQL engines. Top level aggregates merged by control, others done by split. Utilities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Utilities Bulk data load (N-M) async calls GUI manages Backup all SQL serves in parallel Reorg via CREATE TABLE <new> , INSERT INTO <new> SELECT * FROM <old> Utilities are mostly offline (as per Sybase) Nice EXPLAIN utility Futures: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Futures Hash join within split servers Shared memory optimizations Full support for unique secondary indices Full trigger support (cross-server triggers) Full security and view support.Benchmarks: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Benchmarks Preliminary: 8x8 3600 - Ynet. node: 8 x (50MHz 486 256k local cache) 512MB main memory, 2 x 10 disk arrays, @ 2GB 4 MB/s per disk. 6 x Sybase servers Scaleup & speedup tests of 1, 4, and 8 nodes. Numbers (except loading) reported as ratios of elapsed times S&S tests show a >7x speedup of 8-way over 1-way Tests cover insert, select, update, delete, join, aggregate, load Reference Account: Chase Manahattan Bank 14x8 P5 ATT 3600 cluster: (112 processors) 56 SQL servers, 10GB each = 560 GB 100x faster than DB2/MVS (minutes vs days) Linearity is great. Navigator Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Navigator Good Things Concern for lifecycle design, install, manage, operate, use Good optimization techniques Fully parallel, including stored procedures! Scaleup and Speedup are near linear.Sybase IQ: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Sybase IQ Sybase bought Expressway Expressway evolved from Model 204 bitmap technology: index duplicates with bitmap compress bitmap. Can give 10x or 100x speedup. Can save space and IO bandwidth Currently, two products (Sybase and IQ) not integratedDB2: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2 DB2/VM: = SQL/DS: System R gone public DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...) Parallel and async IO into one process (on mainframe) Parallel execution in next release (late next year?) MVS PQS now withdrawn? DB2/AS400: Home grown DB2-2-PE: OS2/DM grown large. First moved to AIX Being extended parallelism Parallelism based on SP/2 -- shared nothing done right. Benchmarks today - Beta everywhere DB2++: separate code path has OO extensions, good TPC-C Ported to HP/UX, Solaris, NT in beta DB2/2 Data Layout: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Data Layout DATABASE: a collection of nodes (up to 128 SP2s so far) NODEGROUP: a collection of logical nodes (a 4k hash map LOGICAL NODE: A DB2 instance (segments, log, locks...) PHYSICAL NODE: A box. Logical Node: Segments of 4 k pages Segments allocated in units (64K default) Tables stripe across all segments Table created in NodeGroup: Hash (partition key) across all members of group Cluster has single system Image Segments Nodes: Group 1 Group 2DB2/2 Query Execution: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Query Execution Each node maintains pool of AIX server processes Query optimizer does query decomposition to node plans (like R* distributed query decomposition) Parallel Optimization is 1Ø (not like Wai Hong’s work) Sends sub-plans to nodes to be executed by servers Node binds plan to server process Intermediate results hashed Proud that Optimizer does not need hints. “Standard” join strategies (except no hash join).DB2/2 Utilities: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Utilities 4 loaders: import raw-insert (fabricates raw blocks, no checks) insert bulk insert Reorganize hash map, add / drop nodes, add devices Table unavailable during these operations Online & Incremental backup Fault tolerance via HACMPDB2/2 Performance: Good performance Great Scaling: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Performance: Good performance Great Scaling Wisconsin scaleups big = 4.8 M rec = 1 GB small = 1.2 M rec = 256MB scan rate ~12 kr/s/node raw load: 2.5 kr/s/node see notes for more data DB2/2 Good Things: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey DB2/2 Good Things Scaleable to 128 nodes (or more) From IBM Good performance Complete SQL (update, insert,...) Will converge with DB2/3 (OO and TPC-C stuff) Will be available off AIX someday (aix is slow and SP2 is very expensive)RedBrick: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey RedBrick Read-only (LOAD then SELECT only) Database system Load is incremental and sophisticated Precompute indices to make small-large joins run fast Indices use compression techniques. Only join via indices Many aggregate functions to make DSS reports easy Parallelism: Pipeline IO Typically a thread per processor (works on index partition) Piggyback many queries on one scan Parallel utilities (index in parallel, etc) SP2 implementation uses shared disk model.Summary: Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey Summary There is a LOT of activity (many products coming to market) Query optimization is near the complexity barrier Needs a new approach? All have good speedup & scaleup if they can find a plan Managing huge processor / disk / tape arrays is hard. I am working on commoditizing these ideas: low $/record/sec (scaleup PC technology) low Admin $/node (automate, automate, automate,...) Continuous availability (online & fault tolerant)