Deep reinforcement learning went through an unprecedented development in the last decade, resulting in agents defeating world champion human players in complex board games like go and chess. With few exceptions, deep reinforcement learning research focuses on fully observable environments, while there is slightly less research in the direction of partially observable environments. Though, they are relevant in real-world applications where the physical systems' sensors can read only a limited subset of features required for decision making. In many problems, long-term memory is required for the agent to make optimal decisions, possibly through tens or hundreds of timesteps. In supervised deep learning, in the NLP domain, sequential input processing is done with specialized network architectures like variants of RNNs and attention-based networks.
In this work, we investigate the potential of these advanced sequence-processing architectures in the context of deep reinforcement learning for partially observable environments. Additionally, since partial observability widely appears in the physical world, we take a safe approach by trying to limit high exploration costs and damage to the agent and its environment. First, we augment the soft actor-critic method with constraints on the episodic cost, resulting in an objective function with two Lagrangian multiplier: an entropy temperature and a safety temperature.
Then, to support long-horizon, partially observable environments, we use gated recurrent (LSTM, GRU) and self-attention based neural networks for the policy and the estimation of Q-functions. We also study how the design choices and hyperparameters of the self-attention based method affect the performance.
To evaluate the problem in safety-constrained environments with long-term temporal dependencies, we develop a new set of benchmarks with four parameterizable, partially observable simulations. The environments are also parameterizable with the length of the history containing relevant features. Hence, we can observe how different network architectures handle the same problems with varying time horizons. Additionally, we introduce a practical framework for the reproducible evaluation of the methods.
We conclude that both the recurrent and the self-attention based architectures have high application potential in the introduced domains. We confirm that the feedforward network based baseline agent, shows high performance on problems where only a few, or tens of timesteps have to be processed sequentially. The recurrent and self-attention based architectures show their advantage in environments with longer horizons, where the sequence of events play an important role and looking back to a fixed position is not sufficient.
Maintaining safety proves to be problematic when the reward and cost functions show correlation. Additionally, strict cost limits usually lead to a poor policy and no exploration, contributing to higher costs on the long term, for some environments.
We propose further research for architectural changes to scale up the method for more complex environments, and to analyze the method on discrete-action and environments.