Abbeel Quigley Ng Uimirl ICML 2006

Uploaded from authorPOINT
Download as
 PPT
Presentation Description 

No description available

authorSTREAM Premium Service
What's up on authorSTREAM?
Views: 33
Like it  ( Likes) Dislike it  ( Dislikes)
Added: September 25, 2007 This Presentation is Public 
Presentation Category : Science & Technology All Rights Reserved
Presentation Transcript

Using Inaccurate Models in Reinforcement Learning: Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University


Overview: Overview Reinforcement learning in high-dimensional continuous state-spaces. Model-based RL: Difficult to build an accurate model. Model-free RL: Often requires large numbers of real-life trials. We present a hybrid algorithm, which requires only an approximate model, a small number of real-life trials. Resulting policy is (locally) near-optimal. Experiments on flight simulator and real RC car.


Reinforcement learning formalism: Markov Decision Process (MDP) M = (S, A, T , H, s0, R ). S = n (continuous state space) Time varying, deterministic dynamics T = { ft : S x A ! S, t = 0,…,H}. Goal: find policy  : S ! A, that maximizes U() = E [ R (st) |  ]. Focus: task of trajectory following. Reinforcement learning formalism H t=0


Motivating Example: Motivating Example Student-driver learning to make a 90 degree right turn Only a few trials needed. No accurate model. Student-driver has access to: Real-life trial. Crude model. Result: good policy gradient estimate.


Algorithm Idea: Input to algorithm: approximate model. Start by computing the optimal policy according to the model. Algorithm Idea Real-life trajectory Target trajectory The policy is optimal according to the model, so no improvement is possible based on the model.


Algorithm Idea (2): Algorithm Idea (2) Update the model such that it becomes exact for the current policy.


Algorithm Idea (2): Algorithm Idea (2) Update the model such that it becomes exact for the current policy.


Algorithm Idea (2): Algorithm Idea (2) The updated model perfectly predicts the state sequence obtained under the current policy. We can use the updated model to find an improved policy.


Algorithm: Algorithm Find the (locally) optimal policy  for the model. Execute the current policy  and record the state trajectory. Update the model such that the new model is exact for the current policy . Use the new model to compute the policy gradient  and update the policy:  :=  +  . Go back to Step 2. Notes: The step-size parameter  is determined by a line search. Instead of the policy gradient, any algorithm that provides a local policy improvement direction can be used. In our experiments we used differential dynamic programming.


Performance Guarantees: Intuition: Performance Guarantees: Intuition Exact policy gradient: Model based policy gradient: Evaluation of derivatives along wrong trajectory Derivative of approximate transition function Our algorithm eliminates one (of two) sources of error.


Performance Guarantees: Performance Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give the same performance guarantees for model-based RL. The constant K depends only on the dimensionality of the state, action, and policy (), the horizon H and an upper bound on the 1st and 2nd derivatives of the transition model, the policy and the reward function.


Experiments: Experiments We use differential dynamic programming (DDP) to find control policies in the model. Two Systems: Flight Simulator RC Car


Flight Simulator Setup: Flight Simulator Setup Flight simulator model has 43 parameters (mass, inertia, drag coefficients, lift coefficients etc.). We generated 'approximate models' by randomly perturbing the parameters. All 4 standard fixed-wing control actions: throttle, ailerons, elevators and rudder. Our reward function quadratically penalizes for deviation from the desired trajectory.


Flight Simulator Movie: Flight Simulator Movie


Flight Simulator Results: Flight Simulator Results desired trajectory model-based controller our algorithm 76% utility improvement over model-based approach


RC Car Setup: RC Car Setup Control actions: throttle and steering. Low-speed dynamics model with state variables: Position, velocity, heading, heading rate. Model estimated from 30 minutes of data.


RC Car: Open-Loop Turn: RC Car: Open-Loop Turn


RC Car: Circle: RC Car: Circle


RC Car: Figure-8 Maneuver: RC Car: Figure-8 Maneuver


Related Work: Related Work Iterative Learning Control: Uchiyama (1978), Longman et al. (1992), Moore (1993), Horowitz (1993), Bien et al. (1991), Owens et al. (1995), Chen et al. (1997), … Successful robot control with limited number of trials: Atkeson and Schaal (1997), Morimoto and Doya (2001). Robust control theory: Zhou et al. (1995), Dullerud and Paganini (2000), … Bagnell et al. (2001), Morimoto and Atkeson (2002), …


Conclusion: Conclusion We presented an algorithm that uses a crude model and a small number of real-life trials to find a policy that works well in real-life. Our theoretical results show that----assuming a deterministic setting and assuming a reasonable model----our algorithm returns a policy that is (locally) near-optimal. Our experiments show that our algorithm can significantly improve on purely model-based RL by using only a small number of real-life trials, even when the true system is not deterministic.


Slide22:


Motivating Example: Motivating Example Student-driver learning to make a 90 degree right turn Only a few trials needed. No accurate model. Key aspects Real-life trial: shows whether turn is wide or short. Crude model: turning steering wheel more to the right results in sharper turn, turning steering wheel more to the left results in wider turn. Result: good policy gradient estimate.