Cognitive Behavior Analysis for Fault Prediction in Cloud Computing

Views:
 
     
 

Presentation Description

Reza FARRAHI MOGHADDAM, Fereydoun FARRAHI MOGHADDAM, Vahid ASGHARI, Mohamed CHERIET Presented in Network of Future (NoF’12), Nov 21st-23rd, 2012, Tunis, Tunisia

Comments

Presentation Transcript

Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing (NoF’12, Nov 21st-23rd, 2012, Tunis, Tunisia):

Cognitive Behavior Analysis framework for Fault Prediction in Cloud Computing (NoF’12, Nov 21 st -23 rd , 2012, Tunis, Tunisia) Reza FARRAHI MOGHADDAM, Fereydoun FARRAHI MOGHADDAM, Vahid ASGHARI, Mohamed CHERIET Synchromedia Lab, ETS, University of Quebec, Montreal, Quebec, Canada

Outline:

Motivation for Behavior Analysis (BA) and Failure Prediction Proposed BA framework Probabilistic Behavior Analysis Simulated Probabilistic Behavior Analysis Behavior-Time Profile Modeling and Analysis Scalability of the Proposed BA framework Conclusions and Future Prospects Outline 11/23/2012 2 NoF’12

Why Behavior Analysis (BA)?:

Benefits of BA for Failure Prediction Preventing Service-Layer or System-Level failures Enabling operation in “unallowable” states to save energy and cost, and also to reduce footprint Profiling the Actors Profiling end users, service providers, and other actors in a computing business (for example, a telecom business) The ensemble of these actors resembles more an ecosystem than a system Profiling helps in: Smart management of resources Building reputations and trust for actors Identifying and isolating wrong-acting actors and threats Why Behavior Analysis (BA)? 3 NoF’12 11/23/2012

Why Failure Prediction?:

Why Failure Prediction? 4 NoF’12 A new failure source: Cyclic ElastoPlastic Operation (CEPO) 11/23/2012

Cyclic elastoplastic operation (CEPO): in Civil and Mechanical Engineering:

Safe operation in plastic mode Repeatable transitions between elastic and plastic modes Cyclic operation is the key Cyclic elastoplastic operation (CEPO): in Civil and Mechanical Engineering 5 NoF’12 Plastic regime Elastic regime Plastic Collapse Point 11/23/2012

Cyclic elastoplastic operation (CEPO): its counterparts in Computing Systems:

Carbon Enabling Effect and Green Push: Doing more with less 1. PUE of Data centers Increasing inlet air flow temperature (2-4% energy saving per 1 ° C increase) For example: PUE = 1.5, 20% saving (5°C  )  PUE = 1.2 Reducing or eliminating fans Failure at component level (servers) increases with temperature (ASHRAE TC 9.9. 2011) Failure Prediction and Behavior Analysis can isolate component-level failures (even before their occurrence) in order to prevent system-level failures (which may violate SLO constraints) Again, cyclic operation is the key to success 2. Can be applied to Bandwidth too?? Cyclic elastoplastic operation (CEPO): its counterparts in Computing Systems 6 NoF’12 Inlet temperature Stress on System Plastic mode Elastic mode Uncertainty increases with the length of stay in the plastic mode Bearable stress level Allowable Elastic range 11/23/2012

The Proposed BA framework:

An Ensemble-of-Experts approach: The sub-paradigms Probabilistic Behavior Analysis Simulated Probabilistic Behavior Analysis Behavior-Time Profile Modeling and Analysis Two different pictures: Systemic picture Ecosystemic picture The Proposed BA framework 7 NoF’12 11/23/2012

BA Framework: Systemic picture:

BA Framework: Systemic picture NoF’12 8 11/23/2012

BA Framework: Ecosystemic picture:

BA Framework: Ecosystemic picture NoF’12 9 11/23/2012

Multiple layers in BA framework:

Multiple layers in BA framework Layers vs (physical and non-physical) location: Toward Location Intelligence in Computing systems Various layers Hardware (Compute/Network) Hardware Drivers/Software Middleware/Protocols Virtualware Virtualware Drivers/Software Applications (Software) NoF’12 10 11/23/2012

Sub-paradigm 1: Probabilistic Behavior Analysis:

Sub-paradigm 1: Probabilistic Behavior Analysis 11 NoF’12 The PoA is related to CDF of failure: The Differential Density Function (DDF): Each layer of system is considered as a graph Sub-graphs constitute super-components of higher levels (vertical scaling) The behavior is modeled as PoA 11/23/2012

Sub-paradigm 1: Probabilistic Behavior Analysis:

Sub-paradigm 1: Probabilistic Behavior Analysis 12 NoF’12 An example of a 2-component system: 11/23/2012

Sub-paradigm 1: Tanh distribution:

Sub-paradigm 1: Tanh distribution Tanh CDFs Tanh DDFs NoF’12 13 11/23/2012

Sub-paradigm 1: Probabilistic Behavior Analysis:

Sub-paradigm 1: Probabilistic Behavior Analysis Lanl05 database Lanl05 database statistics NoF’12 14 Duration: 9 years Retrieved from FTA Availability statistics: 19874 records mean = 1777.99 (hrs) std = 3462.33 Skewness = 3.09 GoF p-value (Tanh) = 0.500 GoF p-value (Weib.) = 0.416 Unavailability statistics: mean = 5.88 (hrs) std = 78.39 Skewness = 43.96 11/23/2012

Sub-paradigm 2: Simulated Probabilistic Behavior Analysis:

For highly-complex system topologies, the CDFs of high-level sub-graphs and components is estimated using simulation based on CDFs of basic components It can be also used to validate the calculations of the first sub-paradigm Monte Carlo strategy is used In each run, the fault time of each basic component is calculated randomly based on its CDF The cumulative behavior of all runs of the high-level sub-graph is used to estimate its CDF 1000-run simulations have been used Sub-paradigm 2: Simulated Probabilistic Behavior Analysis 15 NoF’12 11/23/2012

Sub-paradigm 2: Simulated Probabilistic Behavior Analysis:

Sub-paradigm 2: Simulated Probabilistic Behavior Analysis MC simulation: G_1,1 MC simulation: G_2,1 NoF’12 16 11/23/2012

Sub-paradigm 2: Simulated Probabilistic Behavior Analysis:

Sub-paradigm 2: Simulated Probabilistic Behavior Analysis MC simulation: CDFs MC simulation: DDFs NoF’12 17 11/23/2012

Sub-paradigm 3: Behavior-Time Profile Modeling and Analysis:

Time-profile of components characteristics collected by opportunistic agents across the system (or ecosystem) Time-profile of state transitions in components and also higher level sub-graphs at various layers collected or injected by BSU Machine learning methods are used to match the state transitions with the characteristics Support Vector Machine (SVM) Bayesian networks Agent-based data mining Fuzzy logic ··· Sub-paradigm 3: Behavior-Time Profile Modeling and Analysis 18 NoF’12 11/23/2012

Sub-paradigm 3: Behavior-Time Profile Modeling and Analysis:

Four motivations for behavior-time profile analysis: Spontaneous faults compared to cause-and-effect faults have been reduced significantly Less pure hardware-caused faults compared to interaction-caused faults Patterns and cycles in fault occurrence and in general in behavior Handling of faulty systems that do not have any faulty components context-sensitive diagnosis [Lamperti2011] handling of gradual events Sub-paradigm 3: Behavior-Time Profile Modeling and Analysis 19 NoF’12 11/23/2012

Sub-paradigm 3: Behavior-Time Profile Modeling and Analysis:

Sub-paradigm 3: Behavior-Time Profile Modeling and Analysis 20 NoF’12 A simple example: 11/23/2012

SLA and Service Grading:

Even without considering elastoplastic use case, BA can help in upgrading a service (for example, to the telco grade) Probability of Availability (PoA): Lease-based business models Predicting, isolating and resolving failure events at component or sub-system levels before they get to the Service Layer. Probability of Completion (PoC): Task-based business models Countermeasure options: Put out high risk components (maintenance tickets) Temporal redundancy But, all this depends on the ability to predict high risk or failure An example: No BA : Major fault mode with MTBF = 10 weeks, MTTR = 10 minutes  52:09 minutes downtime a year < 52:33  4 nines With BA : 90% of faults are detected 15 minutes before system failure  5:13 minutes downtime a year < 5:15  5 nines SLA and Service Grading 21 NoF’12 11/23/2012

Countermeasures and cost savings:

Countermeasures and cost savings An example: Full system Two alternative modes to save both energy (cost) and life expectancy of components NoF’12 22 11/23/2012

Scalability:

Scalability Horizontal and Vertical scaling Federated scaling NoF’12 23 11/23/2012

Conclusions and Future Prospects:

A multi-paradigm, multi-layer, multi-level cognitive behavior analysis framework is introduced Three sub-paradigms (cross-cover): Statistical inference Statistical inference by means of simulation Time-profile modeling and analysis Multiple granularity analysis and scalability: Horizontal, vertical and hierarchical scaling Including other layers in the analysis: virtualware and middleware Estimation of PoA to improve system dependability and its service grade A new distribution is introduced: Tanh distribution validated on a real database: lanl05 database Future Prospects: Large-scale operation of each sub-paradigm Cognitive Response: Multi-Expert Decision Making, Cognitive Models Integration of the framework with real computing systems: OpenStack, Open GSN Machine learning techniques for the time-profile modeling sub-paradigm Development of more sophisticated distributions Conclusions and Future Prospects 24 NoF’12 11/23/2012

Thanks you, Any question! BATG:

Reza FARRAHI MOGHADDAM, Eng., Ph.D., MIEEE [email protected] , [email protected] Fereydoun FARRAHI MOGHADDAM, Eng., M.Sc., MIEEE [email protected] , [email protected] Vahid ASGHARI, Eng., Ph.D., MIEEE [email protected] Mohamed CHERIET, Eng., Ph.D., SMIEEE [email protected] Research Associate PhD Student Postdoctoral Fellow Director of Synchromedia Lab http://www.synchromedia.ca/ Thanks you, Any question! BAT G NSERC

authorStream Live Help