Controlling the estimation bias in deep reinforcement learning problems with sparse rewards: Towards robust robotic object manipulation learning

Varga, Roland

Controlling the estimation bias in deep reinforcement learning problems with sparse rewards

Title

Controlling the estimation bias in deep reinforcement learning problems with sparse rewards: Towards robust robotic object manipulation learning

Author

Varga, Roland (TU Delft Mechanical, Maritime and Materials Engineering; TU Delft Delft Center for Systems and Control)

Contributor

Boskos, D. (mentor)
Plooij, M. (mentor)
Kober, J. (graduation committee)
Kok, M. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Mechanical Engineering | Systems and Control

Date

2023-01-27

Abstract

Many recent robot learning problems, real and simulated, were addressed using deep reinforcement learning. The developed policies can deal with high-dimensional, continuous state and action spaces, and can also incorporate machine-generated or human demonstration data. A great number of them depend on state-action value estimates, especially the ones in the actor-critic framework. Deriving unbiased estimates for these values is still an open research question, mostly since the connection between accurate value estimates and system performance is not yet well-understood. This thesis work has three main research contributions. Firstly, it analyzes the connection between value estimates and performance for the TD3 algorithm. Secondly, it derives theoretical bounds for the true value function when dealing with environments where a reward is only given for successful completion of a task (sparse/binary reward). Lastly, a deliberate underestimation objective is added to the TD3 algorithm together with the theoretical bounds to improve system performance when using human demonstration data that only covers a specific part of the state and action space. All the algorithms are tested and evaluated using simulated robot manipulation tasks in the robosuite environment, where the robot is first trained on the demonstration data and then can gather more experiences in the simulation. Results show that the deliberate underestimation together with the value bounds enable the robot to learn from human demonstration, which was not possible for the standard TD3. Additionally, applying just the value bounds speeds up the learning process when using machine-generated datasets.

Subject

Deep Reinforcement Learning
Estimation Bias
Object Manipulation
Robot Leaning
TD3
Robosuite

To reference this document use:

http://resolver.tudelft.nl/uuid:9572a11d-5664-4fbb-91d1-c32f6aa49102

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

MScThesis_EstimationBias.pdf

7.97 MB

Close viewer