DMQL: Deep Maximum Q-Learning: Combatting Relative Overgeneralisation in Deep Independent Learners using Optimism and Similarity

Dam, Erwin

DMQL: Deep Maximum Q-Learning

Title

DMQL: Deep Maximum Q-Learning: Combatting Relative Overgeneralisation in Deep Independent Learners using Optimism and Similarity

Author

Dam, Erwin (TU Delft Electrical Engineering, Mathematics and Computer Science; TU Delft Algorithmics)

Contributor

Bohmer, Wendelin (mentor)
Spaan, M.T.J. (mentor)
Oliehoek, F.A. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science | Artificial Intelligence

Date

2022-08-29

Abstract

Various pathologies can occur when independent learners are used in cooperative Multi-Agent Reinforcement Learning. One such pathology is Relative Overgeneralisation, which manifests when a suboptimal Nash Equilibrium in the joint action space of a problem is preferred over an optimal Equilibrium. Approaches exist to combat relative overgeneralisation in Q-Learning problems, yet many approaches do not scale well with the state space or joint action space, are hard to adapt or configure, or are not applicable in partially observable environments.

In this work, we introduce Deep Maximum Q-Learning (DMQL), a methodology combining Deep Recurrent Q-Networks [Hausknecht & Stone, 2015] and the optimistic assumption which can be found in Distributed Q-Learning [Lauer & Riedmiller, 2000]. DMQL is a maximum-based learning technique which can be scheduled to transition to an average-based learner (or any other arbitrary type of learner), which can utilise independent learners without communication. DMQL is designed to be relatively intuitive and easy to adapt and configure and is able to utilise notions of similarity to provide solutions in large and continuous state spaces.

DMQL clusters similar histories by mapping them to the same hash based on a subset of the information contained within them, such as the current observation, or other related available information sources, such as state information. Using these hashes, DMQL constructs a hash-action pseudo-maximum Q-value estimation dictionary which is updated at every gradient update step. A dictionary value degradation technique ensures stability by preventing overestimations from being retained in the dictionary by decaying them after they have been encountered. This way, optimism is introduced, and relative overgeneralisation is prevented without using true maximums of past Q-value estimates, as these are not guaranteed to be indicative of the real optimal Q-values. Contrasting similar deep learning methodologies [Palmer et al., 2017], DMQL augments Deep Q-Network targets through value replacement instead of value discardment, potentially leading to improved efficiency. In addition, DMQL can be adapted to be utilised as a maximisation-based step in the greater learning process of other deep learning algorithms.

Our experimental results indicate that DMQL is a successful extension of Distributed Q-learning, which can be used in small environments even without the usage of similarity. Using similarity, however, grants us the ability to learn in increasingly large and complex environments. Interestingly, various problems exist within the process of developing a suitable manner of incorporating similarity into hashes. We speculate on how these problems can be prevented or circumvented, and our experiments validate our circumvention methods. Lastly, our experiments show that DMQL can successfully be applied to combat relative overgeneralisation in partially observable environments as well.

Subject

Optimism
Relative overgeneralisation
Similarity
Deep Learning
Deep Q-Learning
Deep Q-Network
Partial Observability
Recurrent Neural Network

To reference this document use:

http://resolver.tudelft.nl/uuid:9c93c641-80bf-4684-a6cd-ed730e45f259

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Erwin_Dam_MSc_Thesis_Report.pdf

2.57 MB

Close viewer