Mike Payole Rewriting The Rules for Enterprise IT

Views:
 
Category: Entertainment
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator: 

Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA Platform Computing 2007/03/01 National Grid Meeting Ankara

Platform Enterprise Grid Orchestrator boosting EU-Grid Technology exploitation: 

Platform Enterprise Grid Orchestrator boosting EU-Grid Technology exploitation Agenda Increasing the industrial impact of EU Grid Technologies Programme About Platform Computing Understanding Industry requirements Unified Grid resource layer Integrate your Grid solution with Platform EGO Platform Collaborations – EGEE, DEISA etc. Conclusion - Open for new ideas

Platform Enterprise Grid Orchestrator boosting EU-Grid Technology exploitation: 

Platform Enterprise Grid Orchestrator boosting EU-Grid Technology exploitation Increasing the industrial impact of EU Grid Technologies Programme with Platform Enterprise Grid Orchestrator The EU Grid Technologies Programme targets the logical next step: 'From Vision to Impacts in Industry and Society' How to make this real? Platform Computing holds probably the largest commercially productive install base of Grid infrastructure in industry worldwide. Now introducing the Enterprise Grid Orchestrator (EGO), the first large scale rolled out Grid-SOI (Service Oriented Infrastructure) for technical as well as business computing. Platform Computing EGO invites all Grid technology solutions to integrate with its unified Grid resource layer.

Slide4: 

Platform Computing

Platform Computing: 

Platform Computing The leading systems infrastructure software company accelerating applications and delivering I.T. agility to High Performance Data Centers 14 years of grid computing experience Global network of offices, resellers & partners 7 x 24 world-wide support and consulting Gartner Group 2006 “Cool Vendor” award in I.T. Operations Management

Over 2,000 leading Global Customers: 

Over 2,000 leading Global Customers

Our Customers: from all verticals: 

Our Customers: from all verticals AMD ARM ATI Broadcom Cadence HP IBM Motorola NVIDIA Qualcomm Samsung ST Micro Synopsys Texas Instr. Toshiba Fidelity Investments HSBC JP Morgan Chase Mass Mutual Royal Bank of Canada Sal Oppenheim Société Générale Lehman Brothers BMW Boeing Bombardier Airbus Daimler Chrysler GE GM Lockheed Martin Pratt & Whitney Toyota Volkswagen AstraZeneca Bristol Myers- Squibb Celera Dupont GSK Johnson & Johnson Merck Novartis Pfizer Wellcome Trust Sanger Institute Wyeth ASCI CERN DoD, US DoE, US ENEA Fleet Numeric MaxPlanck SSC, China TACC Univ Tokyo Bell Canada Cablevision Ebay Starwood Hotels Telecom Italia Telefonica Sprint GE IRI Cadbury Schweppes

Slide8: 

Understanding Industry requirements

Understanding Industry requirements: 

Understanding Industry requirements Grid value: shared resources & shared usage. Unify many different users AND multiple different workload types Avoid building “Grid-Silos”: don’t become part of the problem Primary target is “agility” – speed & ease of change Driven by business process & business change needs As consequence of handling all workload in the Grid, orchestration, scaling, acceleration, results in agility Lets have a look at the users Industry – generically: professional users aiming to create results (€,$,₤) using the tool “Grid” – Call them customers (change of perspective)

Understanding Industry requirements: 

Understanding Industry requirements Quality requirements Reliability (self-healing, recovery from incidents, policy driven proactive problem containment, no job loss during operation or in error condition, while reconfig or failover. Performance (n*10millions jobs per day throughput with 90% job-slot utilization based on 15min job-runtime, max 5min for failover) Scalability (n*1000’s users, hosts, n*millions jobs in one logical cluster at any time, n*10millions jobs per day throughput, n*1000’s way-parallel jobs)

LSF Roadmap : 

LSF Roadmap LSF product roadmap is based on the feedback and interviews with 75+ customers including: Agilent, Airbus, AMD, ARM, Apple, ATI, BASF, BMS, Boehringer Ingelheim, Boeing, Broadcom, Caltech, CEA, Cineca, Cinesite, Conexant, Daimler Chrysler, DEISA, Devon Energy, Disney, DoD (ARL, ASC, ERDC, MHPCC, NAVO), DoE (LANL, LLNL, Sandia), Dreamworks, Emulex, Engineous, Ferrari, Fleet Numerics, Ford, Freescale, GE, GM, Halliburton/Landmark, Harvard, Hilti, HP, IDT, Intel, J&J, LandRoverJaguar, Lockheed Martin, LSILogic, Magma, Merck, Motorola, MSC, MTU, NCAR, NCSA, Nissan, NOAA, Novartis, NovoNordisk, NVidia, Philips, Pratt & Whitney, Pfizer, PSA, Qlogic, Qualcomm, RBC, Renesas, Samsung, Sandisk, Seagate, Shell, Skyworks, Synopsys, TACC, TenorNetworks, TI, Toshiba and Volvo

Understanding Industry requirements: 

Understanding Industry requirements Quality requirements Why scaling counts: Performance and Scalability translates into Reliability Reliability can be measured as “MTBF” – Mean Transactions (=Jobs) Between Failure Platform technology meets this requirement – Technology-Leader Support 24/7 around the globe Non-Technical Quality requirements Focus on Grid technology – commitment - Reliable partner: experienced, stable, profitable.

Slide13: 

Unified Grid resource layer

Enterprise Grid Problem: workload characteristics: 

Enterprise Grid Problem: workload characteristics Result: under-provisioning or over-provisioning

IT Architectures Are Still Statically Coupled and Silo’d : 

IT Architectures Are Still Statically Coupled and Silo’d Core Applications in the Data Center Unpredictable, Infinite Demand With Multiple Engineering groups collaborating on multiple designs, core and business applications can consume vast amounts of computing resources Finite Computing Resources Applications are “siloed”, often procured out of different budgets at different times for different purposes

Results of statically Coupled and Silo’d Infrastructure : 

The Need is for Variable Resources to Meet Variable Business Demand The Need is for Variable Data Center Business “Pain Points” Underutilized Resources Diffculty meeting SLAs Costly I.T. Environment Complex Unpredictable Some server silos have insufficient capacity while there is an excess capacity in others It is difficult to meet application SLAs because resources may not be available when required With application silos underutilized, excess capacity, cooling, space and power are required Coordination of resources is complex, time-consuming and error-prone Hardware failures, outages or insufficient capacity makes the environment unpredictable Results of statically Coupled and Silo’d Infrastructure

Model architecture: 

Model architecture Core Applications in the Data Center Unpredictable Infinite Demand Computing Resources are Finite

Open & Decoupled Architecture Platform Enterprise Grid Orchestrator: 

Open & Decoupled Architecture Platform Enterprise Grid Orchestrator SOI SOA

Example: Dynamic Resource Allocations – Live SOI: 

Example: Dynamic Resource Allocations – Live SOI Platform EGO responds to requests from consumers and allocates supply according to policy – Service Oriented Infrastructure Resource allocation: min, max, conditions, resource req. Dynamic response: Resource re-allocation based on policies (=> SLA’s) – “lend&borrow” Dynamic response: acquisition of additional resources

Slide20: 

3rd party Middleware integration Integrate your Grid solution with Platform EGO

Integrate your Grid solution with Platform EGO: 

Integrate your Grid solution with Platform EGO Meet industrial quality requirements AND deploy innovative technologies and methods Specific and targeted solutions as well as general purpose workload adapters can join one unified resource Grid Reliability (self-healing, recovery from incidents, policy driven proactive problem containment) Dynamic Resource Allocation – peak power on demand Scalability & Performance

Integrate your Grid solution with Platform EGO: 

Integrate your Grid solution with Platform EGO Platform EGO offers by open API/SKD policy based access to all resources in the Grid. Access the same resource Grid from & for all workload types or Grid solutions No Grid silos! Access to resources on EGO includes dynamic allocations within SLA guarantees. “Breathing” resource allocations: SLA: minimum, maximum – lend&borrow This may well replace traditional static Advanced Reservations that were building up “virtual silos” – a virtual grid-based flavor of silo’ed infrastructure Grid technology was supposed to make redundant. No Grid silos – not even virtual!

Slide23: 

Platform Collaborations

Platform Engagements and Collaborations: 

Platform Engagements and Collaborations Currently, Platform Computing is engaged at: QOSCOS DEISA EGEE …

Slide25: 

Platform Collaborations - QOSCOS

What is QosCosGrid?: 

What is QosCosGrid? Quasi Opportunistic Grid Research Project Research project proposal to European Union Commission 9 academic partners & Platform Computing SARL form a consortium IST Proposal Specific Targeted Research Project (STREP) IST Call 5 FP6-2005-IST-5   Quasi-Opportunistic Supercomputing for Complex Systems in Grid Environments (QosCosGrid)

What is QosCosGrid?: 

What is QosCosGrid? Target & Definition, from the proposal paper: …. “Whereas supercomputing resources are more or less dependable, the grid approach is characterized by an opportunistic sharing of resources as they become available. This distributed quasi-opportunistic supercomputing, while not offering the quality of service of a supercomputer, will be to some degree better than the pure opportunistic grid approach. Furthermore it will enable users to develop applications with supercomputing requirements without the need to deploy supercomputers themselves. … QosCosGrid is, therefore, an effort to use the best from two worlds: the opportunistic approach of the grid technology to sharing and using resources whenever they become available, and the reliant or dependable approach of the supercomputing. By developing an infrastructure for quasi-opportunistic supercomputing, QosCosGrid aims at providing a reliable, effortless and cost-effective access to the enormous computational and storage resources required across a wide range of CS research areas and application domains and industrial sectors.” Prof. Dr. Dubitzki, University of Ulster

What is QosCosGrid?: 

What is QosCosGrid? The Proposal to the EU-Commission (click here ) Why Platform Computing? Researchers from initiating University of Ulster remembered Platform Computing from D-Grid (German e-science initiative) working groups and asked for Platform participation EU-Commission funding rule: for each research project there must be a commercial partner Platform is invited to enter the academic IT research scene in Europe and by this increase success in a currently under developed market Platform was offered a package of 45 person-months with a total of +400000 Euro funding

QosCosGrid Project Plan: 

QosCosGrid Project Plan Platform (PCC) marked 30 months runtime

QosCosGrid Technology Stack & LSF: 

QosCosGrid Technology Stack & LSF QosCosGrid Technology Stack: QosCosGrid research and development efforts will be based on the existing grid technology (such as GT4[[i]], Glite[[ii]] and LSF[[iii]] from PCC), and will focus on three additional layers, as depicted in Figure below. To achieve that, one of the first activities in the project will be the roll-out of a world-spanning Platform LSF-MultiCluster grid – from Ireland across Europe, Israel and Australia. [[i]] GT4: www.globus.org/toolkit / [[ii]] Glite: glite.web.cern.ch/Glite / [[iii]] LSF: www.platform.com/Products

Slide31: 

Platform Collaborations - DEISA

Heterogeneous job submissions and Co-Allocation capability: 

Heterogeneous job submissions and Co-Allocation capability OpenPBS / PBSPro IBM Loadleveler PLATFORM LSF Develop and extend heterogeneous job submission capability (UNIVERSUS) NEC NQS (optional) Virtualized Infrastructure

Heterogeneous job submissions and Co-Allocation capability: 

Heterogeneous job submissions and Co-Allocation capability OpenPBS / PBSPro IBM Loadleveler PLATFORM LSF Develop and extend heterogeneous job submission capability (UNIVERSUS) NEC NQS (optional) Virtualized Infrastructure Co-Allocation: Heterogeneous Multi-Site resource allocation Example: Give me 200 CPU on Site1 and 300 CPU on Site2 at the same time 200 CPU 300 CPU

Slide34: 

Platform Collaborations - EGEE

Platform Computing - EGEE-Business-Associate: 

Platform Computing - EGEE-Business-Associate The collaboration “Plan” Step 1 Immediate improvements for the EGEE users and resource providers Technology boost SLA Scheduling Parallel job control and accounting Resource aware scheduling – double compute efficiency What‘s next? Step 2 Mid term target: production Grid unifying all resources AND all users Enable & integrate with new user groups and their resources All kind of applications: commercial code; complex systems Long term target: SOA/SOI for Service Oriented Science „IT-Agility“ for scientific computing Introduce novelties faster respond to changing requests in time

EGEE & Platform: the “Plan”: 

EGEE & Platform: the “Plan” The collaboration “Plan” Step 1: 4 Actions 1st Action: improve LSFgLite integration Platform LSF is one of the supported batch systems of gLite. Actually, about 45% of all CPUs in EGEE are on LSF May include version maintenance as well as performance improvements Will include improved documentation and communication Leeds to better understanding the capabilities of LSF in order to build complex algorithms that may benefit from information passing to use all the features of LSF

EGEE & Platform: the “Plan”: 

EGEE & Platform: the “Plan” The collaboration “Plan” Step 1: 4 Actions 2nd Action: SLA Scheduling exploit LSF and gLite features to enhance user and resource provider capabilities SLA scheduling helps both: for the User it provides guaranteed result delivery – in time or in troughput Resource provider, it translates to „least impact scheduling“, that is: serving the SLA user while there is still room left to host other requests. I other words: handling different Service Levels, working with different customers, at the same time Expected results: Resource providers will offer more resources to EGEE users under well defined SLAs User perceives predictable result delivery, predictable behaviour of the Grid

EGEE & Platform: the “Plan”: 

EGEE & Platform: the “Plan” The collaboration “Plan” Step 1: 4 Actions 3rd Action: Parallel application support gLite today supports sequential and provides basic support for parallel jobs based on mpich Exploit LSF-HPC features LSF-HPC allows control of MPI parallel jobs down to task level Provides signalling layer for management or workflow control signals Delivers accounting that include all children of a parallel application Multiple MPI type in one cluster support Is parallel application support in EGEE easy? No. LSF-HPC might be the best choice to start with. We may identify topics worth a research project / support action E.g.: parallel application checkpoint / restart

EGEE & Platform: the “Plan”: 

EGEE & Platform: the “Plan” The collaboration “Plan” Step 1: 4 Actions 4th Action: Resource aware scheduling – double compute efficiency Exploit LSF features LSF supports a generic resource concept, thus data is resource, too All resources can be used for scheduling decisions Scheduling paradigm “job-follows-data” results in up to 50% gain in compute power Is Resource aware scheduling in EGEE easy? No. EGEE supports co-location of data and computation based on sites, but not for computation scheduling within a site Major topics in operations model Medium topics for the compute resources, re-think, re-build, re-budget Maybe switch to Mid-term horizon …

SLA Scheduling for EGEE: 

SLA Scheduling for EGEE LSF service-level agreement (SLA) scheduling: Is a goal-oriented "just-in-time" scheduling policy that enables the user to focus on the "what and when" of a project instead of "how“ the resources need to be allocated to satisfy various workload Defines an agreement between LSF administrators and users Helps configure workload so that jobs complete on time Reduces the risk of missed deadlines Three different types of service-level goals are Deadline Velocity Throughput or a combination of the service-level goals

SLA Scheduling for EGEE: SLA “Deadline”: 

SLA Scheduling for EGEE: SLA “Deadline” now I need to work now! Early enough for me

SLA Scheduling for EGEE: SLA “Throughput”: 

SLA Scheduling for EGEE: SLA “Throughput” SLA 2 consumes 25% of cluster now 100% 4 Results/hr 4 Results/hr 4 Results/hr 4 Results/hr 4 Results/hr I am a scientist, I need just as many results as I can process per time interval. time more EGEE users !

EGEE High Performance Parallel Computing: 

EGEE High Performance Parallel Computing Distributed computation “Imperfectly parallel” – the real world inter-task-runtime-communication often implemented using MPI – Message Passing Interface MPI - Many Possible Implementations Different communication patterns: “Neighbour” tasks (defined by problem decomposition topology) “All to all”, “some to many” (=N-to-M) Central instance to tasks (commercial code, …)

LSF-HPC – LSF for High Performance Computing: 

LSF-HPC – LSF for High Performance Computing LSF-HPC LSF plus additional functionality Topology aware scheduling large SMPs large Clusters Task granular control for parallel computation Generic and vendor specific MPI integrations Signal forwarding to all tasks Resource usage accounting for all tasks Limit enforcement: time, mem, threads, …. Scalability: +8000 in LSF6.2 / +16000 in LSF7.0

Platform LSF/HPC – Generic integration: 

Platform LSF/HPC – Generic integration Without the generic PJL framework, the PJL starts tasks directly on each host, and manages the job. Even if the MPI job was submitted through LSF, LSF never receives information about the individual tasks. LSF is not able to track job resource usage or provide job control. If you simply replace PAM with a parallel job launcher that is not integrated with LSF, LSF loses control of the process and is not able to monitor job resource usage or provide job control. LSF never receives information about the individual tasks. Architecture Running a parallel job using a non-integrated PJL

Platform LSF/HPC – Generic integration: 

Platform LSF/HPC – Generic integration PAM is the resource manager for the job. The key step in the integration is to place TS in the job startup hierarchy, just before the task starts. TS must be the parent process of each task in order to collect the task process ID (PID) and pass it to PAM. Architecture: Using the generic PJL framework

LSF-HPC – LSF for High Performance Computing: 

LSF-HPC – LSF for High Performance Computing Advantage for EGEE, users and resource providers Freedom to integrate and use All MPI types All compute architectures May implement optional automated MPI selection, dependent on actual availability – best possible choice Full application control, ready to implement optional parallel Preemption - important to guarantee service levels Suspend/resume Checkpoint/migrate/restart

Resource aware scheduling for EGEE: 

Resource aware scheduling for EGEE The collaboration “Plan” Step 1: 4 Actions 4th Action: Resource aware scheduling – double compute efficiency Exploit LSF features LSF supports a generic resource concept, thus data is resource, too All resources can be used for scheduling decisions Scheduling paradigm “job-follows-data” results in up to 50% gain in compute power Is Resource aware scheduling in EGEE easy? No. EGEE supports co-location of data and computation based on sites, but not for computation scheduling within a site Major topics in operations model Medium topics for the compute resources, re-think, re-build, re-budget Maybe switch to Mid-term horizon …

EGEE: data handling in the resource center: 

EGEE: data handling in the resource center EGEE example operations model Job arrives and is started on compute node Requested data is ordered from storage robot Tape mounted and content “data set” provided to compute node via NFS allocating 2 nodes for 1 job

Resource aware scheduling – up to double compute efficiency: 

Resource aware scheduling – up to double compute efficiency Resource aware scheduling 1Job arrives and is queued, resource requirement e.g. “data=#4711” 2Requested data-set “#4711” is ordered from storage robot by LSF 3Tape mounted and LSF resource “data” is updated4 to “data=#4711” As soon as resource requirements are satisfied, job is 5dispatched to the right host, holding the right data locally Resource: dataValue: “identifier”

Slide51: 

Conclusion

Conclusion: Increasing the industrial impact of Grid: 

Conclusion: Increasing the industrial impact of Grid Increasing the industrial impact of EU Grid Technologies Programme with Platform Enterprise Grid Orchestrator Platform Computing invites all Grid technology solutions to integrate with its unified Grid resource layer, the Enterprise Grid Orchstrator – EGO - Platform Computing is open to partner with academia, research and industry to push forward adoption and “impact” of Grid technology. Contact: Christof Westhues, SE Manager EMEA Platform Computing GmbH, cwesthue@platform.com Proline Bilişim A.Ş. Tel : +90 212 236 8070 Fax :+90 212 236 7740

Slide53: 

Thank you