Print Email Facebook Twitter DMQL: Deep Maximum Q-Learning Title DMQL: Deep Maximum Q-Learning: Combatting Relative Overgeneralisation in Deep Independent Learners using Optimism and Similarity Author Dam, Erwin (TU Delft Electrical Engineering, Mathematics and Computer Science; TU Delft Algorithmics) Contributor Bohmer, Wendelin (mentor) Spaan, M.T.J. (mentor) Oliehoek, F.A. (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science | Artificial Intelligence Date 2022-08-29 Abstract Various pathologies can occur when independent learners are used in cooperative Multi-Agent Reinforcement Learning. One such pathology is Relative Overgeneralisation, which manifests when a suboptimal Nash Equilibrium in the joint action space of a problem is preferred over an optimal Equilibrium. Approaches exist to combat relative overgeneralisation in Q-Learning problems, yet many approaches do not scale well with the state space or joint action space, are hard to adapt or configure, or are not applicable in partially observable environments.In this work, we introduce Deep Maximum Q-Learning (DMQL), a methodology combining Deep Recurrent Q-Networks [Hausknecht & Stone, 2015] and the optimistic assumption which can be found in Distributed Q-Learning [Lauer & Riedmiller, 2000]. DMQL is a maximum-based learning technique which can be scheduled to transition to an average-based learner (or any other arbitrary type of learner), which can utilise independent learners without communication. DMQL is designed to be relatively intuitive and easy to adapt and configure and is able to utilise notions of similarity to provide solutions in large and continuous state spaces. DMQL clusters similar histories by mapping them to the same hash based on a subset of the information contained within them, such as the current observation, or other related available information sources, such as state information. Using these hashes, DMQL constructs a hash-action pseudo-maximum Q-value estimation dictionary which is updated at every gradient update step. A dictionary value degradation technique ensures stability by preventing overestimations from being retained in the dictionary by decaying them after they have been encountered. This way, optimism is introduced, and relative overgeneralisation is prevented without using true maximums of past Q-value estimates, as these are not guaranteed to be indicative of the real optimal Q-values. Contrasting similar deep learning methodologies [Palmer et al., 2017], DMQL augments Deep Q-Network targets through value replacement instead of value discardment, potentially leading to improved efficiency. In addition, DMQL can be adapted to be utilised as a maximisation-based step in the greater learning process of other deep learning algorithms.Our experimental results indicate that DMQL is a successful extension of Distributed Q-learning, which can be used in small environments even without the usage of similarity. Using similarity, however, grants us the ability to learn in increasingly large and complex environments. Interestingly, various problems exist within the process of developing a suitable manner of incorporating similarity into hashes. We speculate on how these problems can be prevented or circumvented, and our experiments validate our circumvention methods. Lastly, our experiments show that DMQL can successfully be applied to combat relative overgeneralisation in partially observable environments as well. Subject OptimismRelative overgeneralisationSimilarityDeep LearningDeep Q-LearningDeep Q-NetworkPartial ObservabilityRecurrent Neural Network To reference this document use: http://resolver.tudelft.nl/uuid:9c93c641-80bf-4684-a6cd-ed730e45f259 Part of collection Student theses Document type master thesis Rights © 2022 Erwin Dam Files PDF Erwin_Dam_MSc_Thesis_Report.pdf 2.57 MB Close viewer /islandora/object/uuid:9c93c641-80bf-4684-a6cd-ed730e45f259/datastream/OBJ/view