Research Challenges inAutonomic Computing : Research Challenges in Autonomic Computing Jeff Kephart
IBM Research
kephart@us.ibm.com
www.research.ibm.com/autonomic
Outline : Outline Background and Motivation
Autonomic Computing Research at IBM
Architecture
Overview of Research Program
Autonomic Computing Research Challenges
Conclusions
Background and Motivation (Kephart) : Background and Motivation (Kephart) My role in autonomic computing
My group does research on agents and multi-agent systems
Architecture, Communication, Negotiation, Machine learning
AC Research strategy; joint program manager
University relations; faculty awards, equipment grants
Chair, Autonomic Computing Advisory Board
What I hope to achieve here
Stir up interest in autonomic computing research
Explore collaborations with IBM Research
Learn from you: new viewpoints, new approaches
Complex heterogeneous infrastructures are a reality! : Complex heterogeneous infrastructures are a reality!
Autonomic Computing: Motivation : Autonomic Computing: Motivation Individual system elements increasingly difficult to maintain and operate
100s of config, tuning parameters for commercial databases, servers, storage
Heterogeneous systems are becoming increasingly connected
Integration becoming ever more difficult
Architects can't intricately plan component interactions
Increasingly dynamic; more frequently with unanticipated components
This places greater burden on system administrators, but
they are already overtaxed
they are already a major source of cost (6:1 for storage) and error
We need self-managing computing systems
Behavior specified by sys admins via high-level policies
System and its components figure out how to carry out policies
Facets of Self-Management : Facets of Self-Management
Evolving towards Autonomic Computing Systems : Manual Autonomic Benefits Skills Characteristics Level 1 Level 2 Level 3 Evolving towards Autonomic Computing Systems Multiple sources of system generated data Extensive, highly skilled IT staff Basic Requirements Met Data andamp; actions
consolidated
through mgt
tools IT staff
analyzes andamp;
takes actions Greater system awareness
Improved productivity Sys monitors correlates andamp; recommends actions IT staff
approves andamp; initiates actions Less need for deep skills
Faster/better decision making Sys monitors correlates andamp; takes
action IT staff manages performance against SLAs Human/system interaction
IT agility andamp; resiliency Level 5 Components
dynamically respond to business policies IT staff focuses
on enabling business needs Business policy drives IT mgt
Business agility and resiliency Level 4
Outline : Outline Background and Motivation
Autonomic Computing Research at IBM
Architecture
Overview of Research Program
AI Research Challenges
Conclusions
Autonomic Computing ArchitectureThe Autonomic Element : Autonomic Computing Architecture The Autonomic Element AEs are the basic atoms of autonomic systems
An AE contains
Exactly one autonomic manager
Zero or more managed element(s)
AE is responsible for
Managing own behavior in accordance with policies
Interacting with other autonomic elements to provide or consume computational services An Autonomic Element Managed Element Autonomic Manager An Autonomic Element E.g. Database, storage, server, software app, workload mgr, sentinel, arbiter, OGSA infrastructure elements Service-oriented architecture
Software agents
Autonomic Computing Architecture Element interactions : Autonomic Computing Architecture Element interactions System self-* properties, behavior arise from interactions among autonomic managers
Interactions are
Dynamic, ephemeral
Formed by (negotiated) agreement
Flexible in pattern; determined by policies
Based on OGSA and specific AC extensions
Required messages
Optional but standard
Application-specific
For advanced interactions: conversation support
'Choreography' defines structure of multi-step interactions
A multi-agent system!
Overview of IBM’s Autonomic Computing Research Program : Overview of IBM’s Autonomic Computing Research Program Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
Modeling, analysis, forecasting, optimization, planning, feedback control, etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning, …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
Overview of IBM’s Autonomic Computing Research Program : Overview of IBM’s Autonomic Computing Research Program Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
Modeling, analysis, forecasting, optimization, planning, feedback control, etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning, …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
LEarning Optimizer for DB2 (LEO)G. Lohman, Almaden : LEarning Optimizer for DB2 (LEO) G. Lohman, Almaden SQL Compilation 1. Monitor 3. Feedback 4. Exploit Query
IBM IceCube ServerR. Freitas, Almaden : IBM IceCube Server R. Freitas, Almaden 'Brick' 10 Gbit/s
capacitive
'Coupler'
(6) per brick =
'Thermal
Bus Array' 6' Prototype Brick:
- (12) 2.5' disks
- 8-port Switch
- Linux on fast CPU Full IceCube System
blue: Storage Bricks
yellow: Compute Bricks
3D mesh @ 10 Gb/s per link No connectors,
wires, fibers,
lasers or fans Lego-like Collection of ‘Intelligent Bricks'
Fail-in-place policy: bad bricks are left in place
7 x smaller than equivalent standard systems
Fast, power-hungry components (CPU etc) ok
Includes resource allocation software
First Application : Petabyte-class Storage Server
intended to be managed by one person
SLEDS (SLA-based management of storage performance)D. Chambliss, Almaden : SLEDS (SLA-based management of storage performance) D. Chambliss, Almaden Storage customers establish SLAs w/ storage system
Storage system throttles optimally in accord w/ SLAs Cust
Policy Cust
Policy Storage Customers SAN Fabric Storage Server SLA
Server Manager
Personal software configurationD. Bantz & D. Frank, Watson : Personal software configuration D. Bantz andamp; D. Frank, Watson Automate SW maint andamp; migration on personal devices
'Upgrade all my applications'
'Make my new laptop work like the old one'
'Migrate most valuable Palm apps to my PC'
Overview of IBM’s Autonomic Computing Research Program : Overview of IBM’s Autonomic Computing Research Program Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
Modeling, analysis, forecasting, optimization, planning, feedback control, etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning, …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
Autonomic Manager ToolkitW. Arnold et al., Watson : Autonomic Manager Toolkit W. Arnold et al., Watson
Facilitates autonomic mgr construction
In accordance w/ AC architecture
Catcher for generic AM technologies
OGSA messaging
Policy tools
Monitoring technologies
AI tools for knowledge representation, reasoning
Math libraries for modeling, analysis, planning
Feedback control
V1.0 available as part of Emerging Technologies Toolkit v 1.1 on IBM alphaWorks (www.alphaworks.ibm.com)
Considering open source
Policies and Autonomic ComputingD. Verma and D. Kandlur, Watson : Policies and Autonomic Computing D. Verma and D. Kandlur, Watson
Policy: Set of guidelines or directives provided to autonomic element to influence its behavior.
Key Challenge:
Move away from low level controls
Move towards high level directives (policies) over autonomic decisions
Developing scenarios, standards and technologies to support policies for autonomic computing
Mathematical Modeling and OptimizationM. Squillante, Watson : Mathematical Modeling and Optimization M. Squillante, Watson
Develop and implement sophisticated mathematical methods and algorithms to support AC systems
Modeling
Statistical Analysis
Stochastic Models
Forecasting
Optimization
Discrete
Stochastic
Nonlinear
Control
Control Theory
Dynamical Systems
Chaos
Generic Adaptive ControlJ. Hellerstein, Watson : Generic Adaptive Control J. Hellerstein, Watson E S KeepAlive
MaxClients CPU
Mem CPU*
Mem* Apache Server Controller M + - e t Web service requests A E P
Feedback control to tune effectors
Based on high-level behavioral specs
Multiple goals
Multiple effectors
Time-varying demand
Various database and server applications
Utility Functions and Autonomic ComputingW. Walsh, Watson : Utility Functions and Autonomic Computing W. Walsh, Watson
Utility functions can guide autonomic decision making within an element
Self-optimization: natural and flexible way to express optimization criteria based on business objectives
Avoids hard-coded preferences, special-purpose algorithms
Basis for translating business-level objectives into resource allocation objectives
Algorithms based on modeling and optimization Response time RT V(RT) Utility function
Overview of IBM’s Autonomic Computing Research Program : Overview of IBM’s Autonomic Computing Research Program Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
Modeling, analysis, forecasting, optimization, planning, feedback control, etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning, …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
Dependency Mgt & Self-Healing G. Kar, Watson and H. Lee & S. Ma, Watson : Dependency Mgt andamp; Self-Healing G. Kar, Watson and H. Lee andamp; S. Ma, Watson Determine functional dependencies among elements
Mine design docs, system config metadata, log files
Actively probe running system
Use dependency information for system management
Localize problem (real-time active inference andamp; learning) Dependency Matrix Probe Analysis andamp; Control Router Web Server DB Server App Server HWS HAS HDBS
Overview of IBM’s Autonomic Computing Research Program : Overview of IBM’s Autonomic Computing Research Program Over 150 researchers working on various aspects of Autonomic Computing
Some projects predate AC initiative; now trying to realign them with AC architecture
Technologies for specific autonomic elements
Database, storage, server, client…
Generic element technologies for autonomic elements
Autonomic Manager Toolset integrates many element-level technologies
Modeling, analysis, forecasting, optimization, planning, feedback control, etc.
Uses Open Grid Services Architecture standards for inter-element communication
Available (with ETTK v1.1) on www.alphaworks.ibm.com; open source later
Generic system-level technologies
Dependency management, problem determination and remediation, workload management, provisioning, …
System scenarios and prototypes
Small- to medium-scale autonomic systems
Demonstrate self-* arising from AC architecture + technology
Identify gaps, necessary modifications
Human Interaction with Autonomic SystemsP. Maglio, Almaden : Human Interaction with Autonomic Systems P. Maglio, Almaden
Basic questions
What do middleware administrators do?
How can we better support the problems and practices they have?
Learn answers to these questions via ethnographic studies
Use insights to develop new ways to interact with complex computing systems
… but we thought that was the return port! We had it wrong. Our assumption of how it worked was incorrect. We start with looking at the proxy server log files, then the web server log files, then the application server admin log files then the application log files.
Enterprise Workload ManagementD. Dillenberger : Enterprise Workload Management D. Dillenberger
Large, distributed,
heterogeneous system
Achieves end-to-end performance via adaptive algorithms
Administrator defines policy
Desired response times for various classes of users, apps
eWLM managers on each resource cooperate to adaptively tune parameters
OS, network, storage, virtual server knobs
JVM heap size, # garbage collection threads
Workload balancing, routing parameters
Example scenario: Autonomic Data Center : Example scenario: Autonomic Data Center Autonomic Data Center Client
1-1 Client
1-2 Client
2-1 Client
2-2 Resource-level utility Service-level utility
Outline : Outline Background and Motivation
Autonomic Computing Research at IBM
Architecture
Overview of Research Program
Scenarios
Autonomic Computing Research Challenges
Systems and Software
Architecture, software engineering andamp; tools, testing/validation
Prototyping a large-scale self-* system
Human-Computer Interaction
Policies, Interfaces
Artificial Intelligence
Learning, Negotiation, Self-healing, Emergent Behavior
Conclusions
Challenge: Architecture : Challenge: Architecture AE: How to coordinate multiple threads of activity?
AE’s live in complex environments
Multiple task instances and types
concurrent, asynchronous
Multiple interacting expert modules
AE: How to detect/resolve conflicts arising from
Internal decisions by independent expert modules
External directives (possibly asynchronous)
Internal policies vs. external directives
System-level: Enable more flexible, service-oriented patterns of interaction
As opposed to traditional top-down, hierarchical systems management
Multi-agent architecture
Communication
Representing and reasoning about needs, capabilities, dependencies Define set of fundamental architectural principles from which self-* emerges
Challenge: Software engineering and programming tools : Challenge: Software engineering and programming tools Develop appropriate software engineering concepts and programming tools for composing autonomic elements and systems; support for
Monitoring, analysis, planning and execution
Expressing and understanding policies
Interactions with other elements
Negotiation
Monitoring and enforcing agreements
Challenge: Testing and Verification : Challenge: Testing and Verification Develop methods for testing and verifying behavior of autonomic elements
testbeds and simulation environments
in situ mechanisms that permit new versions of software to run alongside old versions until they have established their trustworthiness
Challenge: Policy : Policy: 'Set of guidelines or directives provided to autonomic element to influence its behavior' Challenge: Policy Human interface
Authoring and understanding policies
Avoiding or ameliorating specification errors
Developing a universal representation and grammar
Many different application domains, disciplines
Many different flavors of policy
Covers service agreements too?
Algorithms that operate upon policies (and agreements?)
Automated derivation of actions (e.g. planning, optimization)
Automated derivation of lower-level policies from high-level policies
E.g. 'Maximize profit from this set of service contracts'
Conflict resolution
Both design time and run time
Need to establish protocols, interfaces, algorithms
Three flavors of (policy = “decision-making guide”) : Three flavors of (policy = 'decision-making guide') Action rule
If (S) then do a2
Results implicitly in desired state s2
Goal
Achieve a most desired state s2
Compute a2 most likely to result in s2
Assumes that most desired state can be determined a priori
Utility function
Achieve state s with maximal net value V(s) – C(aSds)
Benefit and burden of being explicit about value
States have intrinsic value; value of policy is a derived quantity
Policies: Theory meets Reality : Policies: Theory meets Reality We can’t specify the full state of the world
Policy conflicts can arise from incomplete descriptions of state
E.g. different action-rule antecedents can apply to same state, but have conflicting consequents
Goal-type policies can conflict too (sets of acceptable and feasible states don’t intersect)
It’s hard to elicit a full specification of desired behavior from people
Preference elicitation is difficult when there are many attributes
But people are good at noticing when the system isn’t behaving as they like
'Complaint-based tuning' (Ganger, CMU)
Can a universal representation and calculus handle such a broad range?
Storage, network, database, server, etc.
Temporal conditions; correlations
Access control
Classification
Challenge: Human-System Interface : Challenge: Human-System Interface Develop new languages, metaphors and translation technologies that enable humans to monitor, visualize, and control AC systems
Specify goals and objectives to AC systems, and visualize their potential effect
Techniques must be
Sufficiently expressive of preferences regarding cost vs. performance, security, risk and reliability
Sufficiently structured and/or naturally suited to human psychology and cognition to keep specification errors to an absolute minimum
Robust to specification errors
Challenge: Learning : Challenge: Learning Single element level
AE needs to learn a model of itself and environment quickly; environment is noisy, and dynamic in both state and structure
On-line, so exploration of the space can be costly and/or harmful
May be several hundreds of tunable parameters!
Maybe only a few dozen are relevant, but which ones?
Some of them can only be changed upon reboot – is it worthwhile?
System level
Multi-agent system: several interacting learners
What are good learning algorithms for cooperative, competitive systems?
What are conditions for stability?
What is sensitivity to perturbations?
Opportunities for layered learning Establish theoretical foundation for understanding and performing learning and optimization in multi-agent systems.
Challenge: Negotiation : Challenge: Negotiation Develop and analyze
Methods for expressing or computing preferences
Negotiation protocols
Negotiation algorithms
Establish theoretical foundation for negotiation
Explore conditions under which to apply
Bilateral
Multi-lateral (mediated, or not)
Supply-chain
Study how system behavior depends on mixture of negotiation algorithms in AE population
Challenge: Self-Healing Systems : Challenge: Self-Healing Systems GUI Inference andamp; Learning Engs. Probe
Driver Real-time
Event
Mgr Diagnos. State
Dep. Info, Config Problem Diagnosis/Localization Mgr Simulator andamp; Action Mgr Remediator Develop robust, scalable approaches to monitoring/controlling health, security and performance of autonomic systems
Automated capture of human expert knowledge about problem diagnosis and recovery
Predictive, adaptive diagnosis/recovery
Data mining to learn correlated event patterns for diagnosis
Automated learning and execution of appropriate recovery plan
Construction and learning of adaptive statistical models of large networked systems
And do it all without being too invasive!
Challenge: Control and Harness Emergent Behavior : Challenge: Control and Harness Emergent Behavior Understand, control, and exploit emergent behavior in autonomic systems
How do self-*, stability, etc. depend on
Behaviors and goals of the autonomic elements
Pattern and type of interactions among AEs
External influences and demands on system
Invert relationship to attain desired global behavior
How?
Are there fundamental limits?
Develop theory of interacting feedback loops
Hierarchical
Distributed
Outline : Outline Background and Motivation
Autonomic Computing Research at IBM
Architecture
Scenarios
Overview of Research Program
AI Research Challenges
Conclusions
Conclusions : Conclusions
Autonomic Computing is a grand challenge, requiring advances in several fields of science and technology
Policy, planning, learning, knowledge representation, multi-agent systems, negotiation, emergent behavior
Human-system interfaces
Integrating these technologies to support self-management in complex, realistic environments is a research challenge in itself
What are the best architectures and design patterns? Role of (multi-)agent systems?
Building system prototypes is key to developing and validating AC technology and architecture
What to do if you’re interested in working on these problems
Just go do it and publish your results
Find an IBM Researcher who is interested in collaborating with you (I can help)
Get them to help you pursue a faculty award or equipment grant
How can we establish a research community around autonomic computing?
International Conference on Autonomic Computing, May 17-18, 2004, New York City
Co-located with WWW 2004
Co-chair: Manish Parashar
What about defining challenge problems?
We have developed several realistic industry scenarios that could serve as a basis
Additional Information : Additional Information
A Vision of Autonomic Computing
IEEE Computer, January 2003
IBM Systems Journal special issue on Autonomic Computing
http://www.research.ibm.com/journal/sj42-1.html
Web site
www.research.ibm.com/autonomic
International Conference on Autonomic Computing
www.autonomic-conference.org
May 17-18, New York City
Submission deadline: January 12, 2003
Backup Slides : Backup Slides
Other Autonomic Computing Workshops and Conferences : Other Autonomic Computing Workshops and Conferences
First Workshop on Algorithms and Architectures for Self-Managing Systems (at FCRC ’03)
June 11, 2003 in San Diego, CA
5th Annual International Conference on Active Middleware Services: Autonomic Computing Workshop
June 25, 2003 in Seattle, WA
IJCAI-03 AI and Autonomic Computing: Developing a Research Agenda for Self Managing Computer Systems
August 10, 2003 in Acapulco, Mexico
First International Workshop Autonomic Computing Systems at 14th International Conference on Database and Expert Systems Applications (DEXA'2003)
1-5 September, 2003 in Prague, Czech Republic
14th IFIP/IEEE International Workshop on Distributed Systems: Operations andamp; Management (DSOM-03)
October 20-22, 2003 in Heidelberg, Germany
Slide46 : Locus of high-level policy optimization
Authority over thermostats in domain Local knowledge of environment
Direct control of cooling mechanism
Varying degrees of sophistication Challenge: Putting it all together into a self-managing system Autonomic Thermostat scenario
Scenario: Autonomic Thermostat : Scenario: Autonomic Thermostat How much would you pay to get temperature T? How costly is it to attain temperature T? U(Temperature) = Value(Temperature) – Cost(Temperature) Controller Policy: Choose temperature that maximizes
Scenario: Autonomic Thermostat : Scenario: Autonomic Thermostat V1(T) – C1(T) ?
Scenario: Autonomic ThermostatConflict Resolution : Scenario: Autonomic Thermostat Conflict Resolution Temp. goal = T* +/- d* Action Policies 1. If (in cooling mode andamp;andamp; Tcurr andlt; T* - d*) then turn AC off 2. If (in cooling mode andamp;andamp; Tcurr andgt; T* + d*) then turn AC on