Reinforcement learning has been applied to solve several real world challenging problems, from robotics to data center cooling. Similarly, adaption of reinforcement learning for multi-agent systems facilitated applications such as optimal multi-robot control and analysis of social-dilemmas. In this dissertation, we show that multi-agent reinforcement learning algorithms suffer from several stability issues such as multi-scenario learning, unstable training in dual-reward setting, overestimation bias and value function collapse, and provide solutions to each of these problems respectively. Several contributions of this dissertation have been formalized within the framework of a defensive escort team problems, a scenario where a team of learning robots is deployed as a defensive escort team to protect a high value payload. The goal here is to automatically learn an optimal formation around the payload that will minimize potential physical threats from nearby bystanders. We first formalize the defensive escort team problem as a security game problem based on game theoretic principles. Motivated from our defensive escort team problem, we present a distributed multi-agent reinforcement learning based solution to solve large scale security games and show that an optimal formation can automatically be learnt only with self play. In addition, The multi-scenario instability problem arises when reinforcement learning agents are deployed on environments other than they are trained on. To tackle this issue, we propose Multi-Agent Universal Policy Gradients: A novel multi-agent reinforcement learning algorithm inspired from universal value function approximators that generalizes over set of multiple scenarios. Our results show that our proposed solution works better than scenario-dependent policies. The instability in training arises when reinforcement learning agents try to maximize entangled multi-objective reward function. This is a challenging task for current state-of-the-art multi-agent reinforcement algorithms that are designed to either maximize the global reward of the team or the individual local rewards. The problem is exacerbated when either of the rewards is sparse leading to unstable learning. To address this problem, we present Decomposed Multi-Agent Deep Deterministic Policy Gradient (DE-MADDPG): a novel cooperative multi-agent reinforcement learning framework that simultaneously learns to maximize the global and local rewards. Overestimation bias is one of the major issues in reinforcement learning that contributes towards learning sub-optimal policies. Several techniques have been proposed that utilize an ensemble of neural networks to address the overestimation bias in reinforcement learning. However, the neural networks in the ensemble collapse to the same representation space, therefore, invalidating the use of ensemble neural networks to address the overestimation bias. To mitigate this issue, we propose five regularization techniques to maximize the representation diversity in the ensemble of neural networks. Although several contributions of this dissertation have been formalized within the framework of defensive escort team problem, the techniques developed in this dissertation are directly applicable to scenarios where groups of robots and humans need to navigate in a shared space such as robotic guide dogs for visually impaired humans. The contribution in the last chapter are more generic and can be applied to any reinforcement learning algorithm that uses and ensemble of neural networks.

Graduation Date





Boloni, Ladislau


Doctor of Philosophy (Ph.D.)


College of Engineering and Computer Science


Computer Science

Degree Program

Computer Science







Release Date

December 2023

Length of Campus-only Access

3 years

Access Status

Doctoral Dissertation (Campus-only Access)