Capacity Planning for the Newer Workloads: Capacity Planning for the Newer Workloads Linwood Merritt
Capital One Services, Inc.
linwood.merritt@capitalone.com
Disclaimer: Disclaimer These generic issues are addressed by this presentation:
Vendor capacity ratings
e-Commerce
Continuous availability
Data warehousing
Growth rates
This presentation contains no specific business-related information.
Introduction: Environment: Introduction: Environment Capital One
5th largest card issuer in the United States
Capital One to S&P 500 in 1998
Fortune 500 company (#260)
Managed loans at $48.6 billion as of Q1 2002
Accounts at 46.6 million as of Q1 2002
Fortune 100 “Best Places to Work in America”
CIO 100 Award “Master of the Customer Connection”
Information Week “Innovation 100” Award Winner
ComputerWorld “Top 100 places to work in IT”
Outline of Approach: Outline of Approach Understand behavior and issues around workloads, hardware, and data
Create projections and build recommendations.
Report the findings.
Outline of Presentation: Outline of Presentation Discussion of workload types and capacity projection approaches
Overall summary of issues and approaches
Examples
What Workloads?: What Workloads? E-Commerce
Relational database systems
Mainframe-class UNIX
Multiple platforms
New characteristics
e-Commerce WorkloadsDirect to Client (business-to-business): e-Commerce Workloads Direct to Client (business-to-business) Access
Internet
Leased line
Services
Point of Care / Point of Sale
Value-added analysis
e-Commerce WorkloadsDirect to Customer: e-Commerce Workloads Direct to Customer Access
Internet
Dial-in
Services
Marketing
Account query
e-Commerce WorkloadsHow to Predict: e-Commerce Workloads How to Predict Take business projections of volumes or users (include fudge factor)
Estimate transaction volumes and CPU/transaction
Convert to normalized unit such as MIPS
Relational Databases: Relational Databases Sub-second (OLTP), decision support / data mining
Distributed gateways
Database machines
Redundant data with extracts
How to predict: estimate a factor over current database demand or take usage estimates
Mainframe-Class Unix: Mainframe-Class Unix Types: Mainframe USS or Linux, Future UNIX vendor offerings
Candidate applications
Web server
Vendor-ported applications
User-ported / new applications
How to predict:
Estimate by timeframe
Add factor to growth rates
Multiple Platforms: Multiple Platforms Mainframe: plan like existing applications (#users, transactions * CPU/transaction, application look-alikes, sizing tools)
Distributed: use vendor sizing, modeling tools, existing applications
Network: use network simulation tools, rules-of-thumb, bandwidth calculations
New Characteristics: New Characteristics External users
Continuous availability
New user interfaces
Cross-platform
External Users: External Users Drive need for continuous availability
Different access patterns (e.g., doctor’s office vs. call center)
Service level measurement - harder to put agent on external workstations
Continuous Availability: Continuous Availability Driven by external users
24x7 schedule
Application redesign
Data Sharing: CPU overhead
Coupling Facility
Expansion of “prime shift”
99.999% “up time”
Redundancy, overhead
Availability reporting
User Interfaces: User Interfaces TCP/IP - no “definite response” (end-to-end response time measurement)
Multiple internal transactions per “mouse click”
Response time measurement:
Agent on workstations
Scripting from “robots”
Cross Platform Applications: Cross Platform Applications Only unified view: simulation package
Each platform (“silo”) can be analyzed separately.
Different application development groups
May be able to cross-validate user numbers
Types of Implementation (1): Types of Implementation (1) Standalone / “shrink-wrap”
Layered onto legacy applications
New mainframe application code
GUI front-end
Browser
Middle-tier (Unix or NT)
MQSeries - can add middle-tier and new mainframe applications
Types of Implementation (2): Types of Implementation (2) Legacy extracts
Re-engineered legacy applications
Convergence of business rules / applications
Re-usable components
Redundant access
Salvage investment, fix Band-Aids
Simplify logic, reduce platform complexity
What Are We Analyzing?(Mainframe): What Are We Analyzing? (Mainframe) MIPS - growth, latent demand, software cost
Memory - track and watch 2 GB limit on central storage (goes away with 64-bit)
I/O - channels, gigabytes of disk, tape
Coupling Facility - Parallel Sysplex, Shared Data, continuous availability
Vendor upgrade paths
New partitions
What Are We Analyzing?(Distributed): What Are We Analyzing? (Distributed) Number and types of platforms
CPU, memory, disk space
Bandwidth
Location of applications / processes
Platform limitations (CPU, memory)
Software pricing considerations
Porting opportunities
Measurement of New Workloads : Measurement of New Workloads Summarize by platform:
Workload rules (process or user names)
Processes by descending CPU%
Resources: CPU, memory, disk space, Coupling Facility, network traffic
Growth:
Resources/user/application
Number of users + application changes
Distributed Approach: Distributed Approach Consider tiers of service (not currently at Capital One)
Address service level measurement issue
Implement reporting
Add to Capacity Plan
“Silo” vs. “Application”
Tiers of Service“Platinum”: Tiers of Service “Platinum”
Most expensive
Modeling product
Install in one server for each major application, use collection product for other servers
Tiers of Service“Gold”: Tiers of Service “Gold”
Collection product
Capacity planning with Rules of Thumb
Tiers of Service“Brass”: Tiers of Service “Brass”
Least expensive (man-hours only)
“Native”
Unix scripts
NT PerfMon
Service Level Measurement: Service Level Measurement API call at workstation - “Applications Response Measurement” (ARM) or Windows 2000 trace API calls
Agents: software tracing of Windows API calls - can be installed in a subset of end-user base (sampling)
Scripting (“robots”)
Stop watch sampling and logging
Distributed Reporting: Distributed Reporting
Add to Capacity Plan: Add to Capacity Plan
Scope of Analysis: Scope of Analysis Silos
Look at each hardware/application environment independently.
Applications
Look at each application as a whole.
Application instrumentation
Inference: put platform silos together.
Analyzing the DataGrowth Rates: Analyzing the Data Growth Rates General list of business plans
List of technical scenarios
Timeline
Estimate median and maximum likely MIPS/CPU/users/business units
Derive scenario growth rates
Analyzing the DataAdditional Resources: Analyzing the Data Additional Resources Parallel Sysplex (Coupling Facility): important for continuous availability, level set functionality
Disk / channels / tape: disk megabytes, channel maximum, tape connectivity
Communications connectivity: new partitions for availability
Memory: 2 GB constraint, 64-bit
Growth: Growth “Baseline” growth
“Scenario” growth
Independent events (merger/acquisition, potential major project)
Example 1: Mainframe Upgrade: Example 1: Mainframe Upgrade Task force, led by Capacity Planner
Driven by expiring three-year lease (CPU replacement, three-year planning horizon)
“Vendor parade” - presentations and dialogues
Upgrade paths
Technology / service differences
References / site visits
Capacity sizing: MIPS charts, LSPR / sizing tools
Mainframe Upgrade Deliverables: Mainframe Upgrade Deliverables Document
Business drivers and technical scenarios
Growth forecasts
Vendor options and growth paths
Coupling Facility / Parallel Sysplex
Evaluation
Difference thresholds: MIPS claims, price/MIPS, ICF
Differentiators
Business and Technical: Business and Technical Business Drivers
Cost management
External business
Improved data access
Business expansion
Technical Scenarios
Consolidation of distributed servers
Continuous availability
Significant external business
Data Warehousing
Acquisition/merger
Projections: Projections Make educated guess by timeframe for each scenario
Add to “baseline” growth
Convert to growth rate
Use both “baseline” and “scenario growth”
Compare maximum scenario growth to maximum for platform family
Impact Analysis: Impact Analysis
Scenario Timeline: Scenario Timeline
Vendor Upgrade PathsDetail: Vendor Upgrade Paths Detail Use logarithms:
Start*CAGR^x = Threshold
x years = log(Threshold/Start)/log(CAGR)
Model MIPS MSU +40%/Yr +25%/Yr
GS2068E 952 160 Aug-00 Sep-00
GS2074E 1013 171 Oct-00 Dec-00
GS2084E 1141 193 Apr-01 Jul-01
GS2094E 1260 213 Sep-01 Dec-01
GS2104E 1378 234 Nov-01 May-02
Vendor Upgrade PathsSummary: Vendor Upgrade Paths Summary
Upgrade Document: Upgrade Document
Example 2: UNIX Modeling: Example 2: UNIX Modeling Modeling product installed on MQSeries server
Application running with a known number of users
Projected rollout schedule used to drive model
Mainframe side: CICS application, IMS load
UNIX Platform Workloads: UNIX Platform Workloads Two primary workloads:
MQSeries userids (mqm*) - memory intensive
Messaging application processes (MDA*) - “CPU intensive”
Workload Modeling Methodology: Workload Modeling Methodology MQSeries - Calculate relative workload intensity, enter model ratio.
Messaging application processes - Keep constant until application is removed from platform (“design loop” - always uses 1 CPU). Must adjust across CPU upgrade to continue using 1 CPU.
Track Across Upgrade: Track Across Upgrade
Model Spreadsheet: Model Spreadsheet
Model Presentation: Model Presentation Timeframe: April 2000
#Users: 180, 100
Ratios: 1.27, 1.00
Config: F50/02,2GB
Comment: Add Event1 Users
Validation - Tracking Users(on mainframe): Validation - Tracking Users (on mainframe) //ECLUSRS EXEC SASV8,REGION=0M
//ECLD1 DD DSN=XYZ.PRD.A.AAAPRD.I.VOLFIL,DISP=SHR
//ECLDPDB DD DSN=CAPLAN.PRD.ECLDPDB,DISP=OLD
//SYSIN DD *,DLM=@@
data ecld1;
format date date.;
format dt datetime.;
INFILE ECLD1 MISSOVER;
INPUT @1 RECNUM $CHAR5.
@6 RECTYPE $CHAR8.
@14 USERCT $CHAR5.
@19 USERMAX $CHAR5.;
if recnum =: '99999' and rectype =: 'TCSCONFG';
dt = datetime();
date = datepart(dt);
hour = hour(dt);
data ecldpdb.users;
update ecldpdb.users ecld1;
by date hour;
proc print;
title 'Ecloud1 Users';
Example 3: Server Replacement: Example 3: Server Replacement Project: replace “old” NT servers
Application: Imaging servers
Capacity sizing data:
Rules-of-thumb analysis by vendor, using projected claims/minute and processor clock speeds
Benchmark information
Server Replacement Process: Server Replacement Process Multiple servers: each server is a workload, must be sized separately.
Enumerate and measure servers.
Apply growth rates and determine processing power requirements for the replacements.
Research available configurations and order appropriate server configurations.
Track CPU utilization across the upgrades.
Update relative capacity specs for next upgrade.
Server Sizing: Server Sizing Find (or derive) benchmark capacity ratings for starting and replacement configurations.
Apply an estimate of current CPU utilization, a growth percentage, and a “peak/average” and performance buffer (+100% for this study).
Output: estimated percentages of a standard configuration. The number of estimated CPUs needed (23) came very close to the vendor’s original number of 24.
Sizing Spreadsheet: Sizing Spreadsheet
Example 4: Hundreds of Servers: Example 4: Hundreds of Servers Data capture
Reporting
Business drivers
Data Capture: Data Capture Time-based scheduling product
Script-based data “pull”
Issue: data loss, time to find and rebuild
Potential fixes:
Product
Data “push” from servers
Data Reporting, Analysis: Data Reporting, Analysis Color-based “health index” (Concord NetHealth metric).
Statistical Analysis (over two standard deviations from mean)
Thumbnail drilldown graphs
Automatic generation of html
“Treemap” graphs
Health Index *: Health Index * * Concord NetHealth metric
Statistical Process Control: Statistical Process Control cmg
Thumbnail Html: Thumbnail Html
Automatic Generation of Html: Automatic Generation of Html Driven by “matrix”
Originally spreadsheet
Converted to relational database
Ultimate capacity planning solution: information by server, application, platform, business driver
SAS code - builds web pages and hyperlinks
Treemap: Treemap Paper by Ben Shneiderman, University of Maryland, http://www.cs.umd.edu/hcil/treemaps ASSDSDFVVBNM XSDFFGFRRFHFHJKJKLLXXXXX XESDGFKOKJHHMM XESDGFKOKJ DERFFVBBNHGFF XESDG XES SDEFBJMGG XESDG
Business Drivers: Business Drivers Capacity Councils - business units responsible for capacity planning of “demand” side
Capacity Planners - build projections based on business drivers and historical trending
Business Driver Based Forecasts: Business Driver Based Forecasts Server Application Application Application Business
Driver Business
Driver Projections Projections
Regression Analysis: Regression Analysis Widgets
Gadgets
Customers CPU By month (input = Widgets, Gadgets, Customers):
projection =Widgets*f1 + Gadgets*f2 + Customers*f3; f1
f2
f3 Output = Coefficients Input = CPU and Business Drivers by month
Graphical Output: Graphical Output Widgets Gadgets Customers
Enterprise “Capacity at a Glance”: Enterprise “Capacity at a Glance”
SummaryIssues: Summary Issues Access patterns and schedules
Platforms (more types and numbers)
Resources (what to track)
Levels of capacity management
Reporting of utilization and service levels, for large numbers of platforms
Higher availability (redundancy, reporting)
Deriving and reporting projections
SummaryDeriving Projections: Summary Deriving Projections Basic capacity planning:
Growth rates
Upgrade thresholds
Aggressive estimate of “scenario” demand
Bracket growth:
Lower end: “baseline”
Upper end: “scenarios”
SummaryTypes of Projections: Summary Types of Projections Number of transactions
Number of users
Number of platforms
Application sizing input
Application complexity
Fraction of an existing workload
Growth rate
SummaryCapacity Planning: Summary Capacity Planning Projections based on application and platform
Levels of capacity planning service
Report on all enterprise resources
Organize data with “matrix” database