Deep Exploration by Planning With Uncertainty in Deep Model Based Reinforcement Learning

Oren, Yaniv

Deep Exploration by Planning With Uncertainty in Deep Model Based Reinforcement Learning

Title

Deep Exploration by Planning With Uncertainty in Deep Model Based Reinforcement Learning

Author

Oren, Yaniv (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

Bohmer, Wendelin (mentor)
Spaan, M.T.J. (mentor)

Degree granting institution

Delft University of Technology

Programme

Computer Science

Date

2022-07-22

Abstract

Deep, model based reinforcement learning has shown state of the art, human-exceeding performance in many challenging domains.
Low sample efficiency and limited exploration remain however as leading obstacles in the field.
In this work, we incorporate epistemic uncertainty into planning for better exploration.
We develop a low-cost framework for estimating and computing the uncertainty as it propagates in planning with a learned model.
We propose a new method, \textit{planning for exploration}, that utilizes the propagated uncertainty for inference of the best action for exploration in real time, to achieve exploration that is informed, sequential over multiple time steps and acts with respect to uncertainty in decisions that are multiple steps into the future (deep exploration).
To evaluate our method with the state of the art algorithm MuZero, we incorporate different uncertainty estimation mechanisms, modify the Monte-Carlo tree search planning used by MuZero to incorporate our developed framework, and overcome challenges associated with learning from off-policy, exploratory trajectories with an algorithm that learns from on-policy targets. Our results demonstrate that planning for exploration is able to achieve effective deep exploration even when deployed with an algorithm that learns from on-policy targets, and using standard, scalable uncertainty estimation mechanisms.
We further provide an ablation study that illustrates that the methodology we propose for on-policy target generation from exploratory trajectories is effective at alleviating averse effects of training with trajectories that have not been sampled from an explotiatory policy. We provide full access to our implementation and our algorithmic contributions through GitHub.

Subject

Reinforcement Learning
Exploration
Model based
Uncertainty
Planning

To reference this document use:

http://resolver.tudelft.nl/uuid:f0bc9065-daa8-4da2-adf9-d78affdb7b99

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Yaniv_Oren_MSc_Thesis.pdf

1.24 MB

Close viewer