Avoiding failure states during reinforcement learning

The Delft Biorobotics Laboratory develops bipedal humanoid robots. One of these robots, called LEO, is designed to learn to walk using reinforcement learning. During learning, LEO will make mistakes and fall. These mistakes can cause serious dam- age to the system but are an integral part of the learning process. A likely solution is punishing the robot more severely for falling. However, punishing the robot too rigorously can lead to a robot that is too cautious to make a step. In this research, three methods were tested that reduce the amount of falls during learning, without restricting the possible solutions or increasing the learning time. We introduce Threshold Restricted Learning (TRL), a new method for action se- lection that, during exploration, makes the probability an action is chosen dependent on the expected reward for taking that action. Actions with expected rewards below the set threshold have a signifcant reduced probability of being chosen. The concept of TRL developed from the desire to optimize the use of a pre-learned solution. Hence, TRL was tested after pre-learning in simulation. The largest reduction in the amount of falls achieved by TRL in this research was 50% TRL did not increase the learning time. Without pre-learning the largest reduction was 7%. Softmax action selection is a well known but underused selection method. Combined with pre-learning it was able to reduce the amount of falls during learning with approximately 80% and did not increase the learning time either. Without pre-learning, Softmax could still achieve a reduction of approximately 20%. Sarsa2Q, another new method, stores expected rewards at different levels op generalization. The level with little generalization is used to store the expected total positive reward, while the level with more generalization is used to store the expected total negative reward. Sarsa2Q learned to avoid the failure states for the inverted pendulum problem faster than using a single level of generalization. This method does not use pre-learning. The highest achieved reduction in the amount of falls was approximately 20%. Sarsa2Q can have a broader use than just avoiding failure states. Rather than only using the coarse generalization for negative rewards, it can be used for all rewards that need less precision.

Subject

reinforcement learning

To reference this document use:

http://resolver.tudelft.nl/uuid:1f03c580-9fd5-4807-87b5-d70890e05ff6

Embargo date

2011-02-25

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Thesis_M_van_Diepen.pdf

5.64 MB

Close viewer