Safe reinforcement learning in long-horizon partially observable environments