logging in or signing up VRP15 Alohomora Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 63 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 26, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong Tang Rasaratnam Logendran Vehicle Routing & Product Delivery: Vehicle Routing & Product DeliveryContributions of our Research: Contributions of our Research Multiple vehicle product delivery is a well-studied problem in operations research We have formulated this problem as an average reward reinforcement learning (RL) problem We have combined inventory control with vehicle routing We have scaled RL methods to work with large state spacesMarkov Decision Processes: Markov Decision Processes Action a Actions are stochastic: Pi,j(a) Actions have costs or rewards: ri(a) Move Unload UnloadAverage Reward Reinforcement Learning: Average Reward Reinforcement Learning Goal: Maximize average reward/time step Minimize stockout penalty + movement penalty Policy: states → actions Value function: states → real values expected long-term reward from a state, relative to other states, when following the optimal policyH-Learning: H-Learning The value function satisfies the Bellman equation: The optimal action a* maximizes the immediate reward + expected value of the next state H-Learning is a real-time algorithm for solving the value functionH-Learning: an example 1: H-Learning: an example 1 -.1, 1/1 0, 9/9 0, 0/9 A E D C BH-Learning: an example 2: H-Learning: an example 2 Stockout penalty: -20 A E D C B -.1, 1/1 0, 9/10 -20, 1/10H-Learning: an example 3: H-Learning: an example 3 A E D C B -.1, 1/1 0, 9/10 -20, 1/10H-Learning: an example 4: H-Learning: an example 4 Move penalty: -.1 A E D C B -.1, 2/2 0, 9/10 -20, 1/10On-line Product Delivery: On-line Product Delivery Deliver 1 product 9 truck actions: 4 levels of unload 4 move directions wait P(Inventory decrease | shop) Stockout penalty: -20 Movement penalty: -.1 5 Shops DepotThe problem of state-space explosion: The problem of state-space explosion The loads of trucks and shop inventories are discretized into 5 levels States grow exponentially in shops and trucks 10 locations, 5 shops, 2 trucks = (102)(55)(52) = 7,812,500 states 5 trucks = 976,592,500,000 states Table-based methods take too much time and spacePiecewise Linear Function Approximation: Piecewise Linear Function Approximation We use a different linear function for each possible 5-tuple of locations l1,…, l5 of trucks Each function is linear in truck loads and shop inventories Every function represents 10 million states million-fold reduction of learnable parametersPiecewise linear function approximation vs. table-based: Piecewise linear function approximation vs. table-basedStoring and using the action models: Storing and using the action models Problem: exponential time to determine the expected value of the next state: - Each shop’s consumption is independent - Value function is piecewise linear ? ? ? ?Ignoring Truck Identity: Ignoring Truck Identity m = number of locations (10) k = number of trucks (2-5) 5 trucks: 105 functions Learnable parameters: 1.1 million 2002 functions Learnable parameters: 22,022 mkThe problem of action-space explosion: The problem of action-space explosion Every action a is a vector of individual “truck actions” a = (a1, a2,…,an) Actions grow exponentially in the number of trucks 9 “truck actions” For 2 trucks: 92 = 81 total actions For 5 trucks: 95 = 59,049 total actionsHill Climbing Search: Hill Climbing Search We initialize the vector of truck actions a to all “wait” actions We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat This results in an order-of-magnitude improvement in search timeHill climbing vs. exhaustive search for 4 and 5 trucks: Hill climbing vs. exhaustive search for 4 and 5 trucksConclusion: Conclusion Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery Hill climbing shows great potential for speeding up search in domains with a large action space Problems of scaling are surmountable Future Work: Future Work Scaling! More trucks, more locations, more shops, more depots, and more items Allowing trucks to move with non-uniform speeds (event-based model needed) Real-valued shop inventory and truck load levels You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
VRP15 Alohomora Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 63 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: February 26, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles: A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong Tang Rasaratnam Logendran Vehicle Routing & Product Delivery: Vehicle Routing & Product DeliveryContributions of our Research: Contributions of our Research Multiple vehicle product delivery is a well-studied problem in operations research We have formulated this problem as an average reward reinforcement learning (RL) problem We have combined inventory control with vehicle routing We have scaled RL methods to work with large state spacesMarkov Decision Processes: Markov Decision Processes Action a Actions are stochastic: Pi,j(a) Actions have costs or rewards: ri(a) Move Unload UnloadAverage Reward Reinforcement Learning: Average Reward Reinforcement Learning Goal: Maximize average reward/time step Minimize stockout penalty + movement penalty Policy: states → actions Value function: states → real values expected long-term reward from a state, relative to other states, when following the optimal policyH-Learning: H-Learning The value function satisfies the Bellman equation: The optimal action a* maximizes the immediate reward + expected value of the next state H-Learning is a real-time algorithm for solving the value functionH-Learning: an example 1: H-Learning: an example 1 -.1, 1/1 0, 9/9 0, 0/9 A E D C BH-Learning: an example 2: H-Learning: an example 2 Stockout penalty: -20 A E D C B -.1, 1/1 0, 9/10 -20, 1/10H-Learning: an example 3: H-Learning: an example 3 A E D C B -.1, 1/1 0, 9/10 -20, 1/10H-Learning: an example 4: H-Learning: an example 4 Move penalty: -.1 A E D C B -.1, 2/2 0, 9/10 -20, 1/10On-line Product Delivery: On-line Product Delivery Deliver 1 product 9 truck actions: 4 levels of unload 4 move directions wait P(Inventory decrease | shop) Stockout penalty: -20 Movement penalty: -.1 5 Shops DepotThe problem of state-space explosion: The problem of state-space explosion The loads of trucks and shop inventories are discretized into 5 levels States grow exponentially in shops and trucks 10 locations, 5 shops, 2 trucks = (102)(55)(52) = 7,812,500 states 5 trucks = 976,592,500,000 states Table-based methods take too much time and spacePiecewise Linear Function Approximation: Piecewise Linear Function Approximation We use a different linear function for each possible 5-tuple of locations l1,…, l5 of trucks Each function is linear in truck loads and shop inventories Every function represents 10 million states million-fold reduction of learnable parametersPiecewise linear function approximation vs. table-based: Piecewise linear function approximation vs. table-basedStoring and using the action models: Storing and using the action models Problem: exponential time to determine the expected value of the next state: - Each shop’s consumption is independent - Value function is piecewise linear ? ? ? ?Ignoring Truck Identity: Ignoring Truck Identity m = number of locations (10) k = number of trucks (2-5) 5 trucks: 105 functions Learnable parameters: 1.1 million 2002 functions Learnable parameters: 22,022 mkThe problem of action-space explosion: The problem of action-space explosion Every action a is a vector of individual “truck actions” a = (a1, a2,…,an) Actions grow exponentially in the number of trucks 9 “truck actions” For 2 trucks: 92 = 81 total actions For 5 trucks: 95 = 59,049 total actionsHill Climbing Search: Hill Climbing Search We initialize the vector of truck actions a to all “wait” actions We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat This results in an order-of-magnitude improvement in search timeHill climbing vs. exhaustive search for 4 and 5 trucks: Hill climbing vs. exhaustive search for 4 and 5 trucksConclusion: Conclusion Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery Hill climbing shows great potential for speeding up search in domains with a large action space Problems of scaling are surmountable Future Work: Future Work Scaling! More trucks, more locations, more shops, more depots, and more items Allowing trucks to move with non-uniform speeds (event-based model needed) Real-valued shop inventory and truck load levels