Print Email Facebook Twitter Deep Exploration by Planning With Uncertainty in Deep Model Based Reinforcement Learning Title Deep Exploration by Planning With Uncertainty in Deep Model Based Reinforcement Learning Author Oren, Yaniv (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Bohmer, Wendelin (mentor) Spaan, M.T.J. (mentor) Degree granting institution Delft University of Technology Programme Computer Science Date 2022-07-22 Abstract Deep, model based reinforcement learning has shown state of the art, human-exceeding performance in many challenging domains. Low sample efficiency and limited exploration remain however as leading obstacles in the field. In this work, we incorporate epistemic uncertainty into planning for better exploration.We develop a low-cost framework for estimating and computing the uncertainty as it propagates in planning with a learned model.We propose a new method, \textit{planning for exploration}, that utilizes the propagated uncertainty for inference of the best action for exploration in real time, to achieve exploration that is informed, sequential over multiple time steps and acts with respect to uncertainty in decisions that are multiple steps into the future (deep exploration).To evaluate our method with the state of the art algorithm MuZero, we incorporate different uncertainty estimation mechanisms, modify the Monte-Carlo tree search planning used by MuZero to incorporate our developed framework, and overcome challenges associated with learning from off-policy, exploratory trajectories with an algorithm that learns from on-policy targets. Our results demonstrate that planning for exploration is able to achieve effective deep exploration even when deployed with an algorithm that learns from on-policy targets, and using standard, scalable uncertainty estimation mechanisms.We further provide an ablation study that illustrates that the methodology we propose for on-policy target generation from exploratory trajectories is effective at alleviating averse effects of training with trajectories that have not been sampled from an explotiatory policy. We provide full access to our implementation and our algorithmic contributions through GitHub. Subject Reinforcement LearningExplorationModel basedUncertaintyPlanning To reference this document use: http://resolver.tudelft.nl/uuid:f0bc9065-daa8-4da2-adf9-d78affdb7b99 Part of collection Student theses Document type master thesis Rights © 2022 Yaniv Oren Files PDF Yaniv_Oren_MSc_Thesis.pdf 1.24 MB Close viewer /islandora/object/uuid:f0bc9065-daa8-4da2-adf9-d78affdb7b99/datastream/OBJ/view