Reinforcement Learning-Based Techniques for the Optimal Control of Complex Systems

Massenio, Paolo Roberto

doi:10.60576/poliba/iris/massenio-paolo-roberto_phd2021

This doctoral thesis presents the results of the three-years activities carried out during the XXXIII cycle of the Ph.D. program in Electrical and Information Engineering of the Polytechnic University of Bari, Bari, Italy. The topic of this thesis is the optimal control of complex systems using Reinforcement Learning (RL) based techniques. Optimal control theory is aimed at finding control policies that minimize a predefined closed-loop performance criterion, namely the utility function. While optimal control for linear systems is a well-established framework, several issues arise when nonlinearities come into the picture. Feedback optimal control policies for nonlinear systems are found by solving the Hamilton-Jacobi-Bellman (HJB) equation, which is in general analytically intractable. Starting from the 1980s, considerable efforts have been made by the research community to overcome such intractability. This resulted in the development of new approaches based on RL that find approximated solutions of the HJB equation using Neural Networks (NNs). RL is an important branch of the Machine Learning theory. It is inspired by the animal world where living beings improve their behaviors by interacting with an unknown environment, evaluating the effect of their actions and modifying them accordingly. The combination of RL paradigms, NNs, and optimal control results in the Adaptive Dynamic Programming (ADP) approach. ADP algorithms find optimal control laws by means of different learning strategies. Such approach demonstrates the increasing penetration of Artificial Intelligence (AI) in the field of complex control systems. The main purpose of this thesis is to show the effectiveness of ADP-based control systems in real-world scenarios. In fact, although most of the ADP theory has been developed since the second half of the 2000s, experimental tests of real-world ADP-based controllers have only been published more recently. This thesis begins by over-viewing the main ADP algorithms that solve optimal control problems for nonlinear systems, covering the two main learning strategies: the Policy Iteration (PI) algorithm with on-policy learning and the PI algorithm with off-policy learning. The mathematical details of such approaches are presented, discussing the main properties along with pros and cons. Then, the powerful features of the ADP algorithm with off-policy learning are exploited to provide novel control strategies according to two different complex systems. It will be shown how the versatility and power of ADP-based techniques allow to solve control problems with different contexts and objectives in an innovative way. As first case study, the optimal control of mechatronic devices based on dielectric elastomer membranes, namely the Dielectric Elastomer Actuators (DEAs), is considered. A DEA is typically constituted by a flexible polymeric membrane that undergoes a deformation when excited with an electrical voltage. DEAs have recently received a significant interest due to their high energy density, high deformation ranges, and low production costs. They have also showed to be quite attractive in the context of several applications, ranging from micro-positioning systems to soft-robotic structures. However, the interesting features of the DEAs are limited by their strong nonlinear behavior and sensitivity to environmental conditions, which limit their penetration in the industrial sector. The strong nonlinearities due to the underlying physical behavior encouraged the development of advanced control strategies. Nevertheless, energy-efficient controllers have never been developed for such class of actuators. In this thesis, a novel minimum energy control strategy for DEAs is developed. The objective is to minimize the electrical energy required during a positioning task. In principle, the DEA dynamics can be detailed by an energy consistent model, which also describes the losses that occur in the actuator during any positioning task. An optimal feedback control strategy can be employed to minimize those losses, by formulating the energy-minimization problem as an optimal control problem. However, due to the involved nonlinearities, an analytic solution of the HJB equation does not exist. In this thesis, an ADP algorithm with off-policy learning is employed to deal with the optimal energy control problem. In particular, the ADP approach will be used as a tool to solve offline the HJB equation, deriving energy-efficient control laws for a given set of target displacement values. Finally, experimental tests will validate for the first time an energy consistent model of the DEA as well as the energy-efficient controllers. Substantially improvements in terms of energy saving will arise when comparing the proposed approach with other traditional control methods, such as Proportional Integral or feed-forward schemes. The second complex system where ADP is applied is a DC microgrid featuring power buffers. Due to the increasing penetration of DC sources and loads, such as photo-voltaic generators or electrical vehicles, DC microgrids have recently gained significant attention. DC distribution systems are more efficient and reliable than AC microgrids, where redundant conversion stages are present. Moreover, DC microgrids do not suffer of many AC-related issues, such as frequency synchronization or reactive power flows. However, due to a lack of damping inertia, DC systems can face instability issues when volatile source and loads are considered. A possible solution is represented by power buffers, which can be used as damping elements in the DC microgrid. A power buffer is a power converter with a large storage element that can be exploited to decouple the distribution grid from the final load. In fact, when abrupt load changes occur, the energy stored in the buffer compensates the transient mismatch. The input impedance seen by the network can be actively controlled by the power buffer during transients, so that the stability properties of the DC system are improved. By introducing a communication network on top of the physical grid, distributed control policies for such buffers are enabled. Their effective range of action is thus extended to the neighboring power buffers. In this way, power buffers can assist each other during abrupt load changes. This thesis investigates the cooperative distributed control of power buffers. The cooperative assistive control objective is formulated as an optimal control objective, where the single utility function is shared among all the buffers. In contrast with the existing literature, the nonlinear dynamics is considered. Thus, ADP will represent the key tool in designing such optimal policies. Clearly, when dealing with distributed control schemes, the communication topology plays a crucial role. Based on the configuration of the communication network, in this thesis two different control approaches will be presented. Firstly, the communication topology is fixed and inspired by the physical vicinity of the buffers. A set of optimal control policies able to provide assistance during abrupt load changes are learned offline, using the ADP with off-policy learning approach. Such policies are then interpolated in a real-time control scheme. The proposed approach overcome the issues of the existing distributed solutions for power buffers. For example, a feedback controller is directly provided, instead of open-loop policies that require additional control loop to be implemented. By considering the fully nonlinear dynamics, the proposed approach does not rely on small-signal approximations. Thus, performances and stability will be guaranteed also for large-signal variations. Experimental validations conducted in a Controller/Hardware-in-the-Loop (CHIL) environment will asses the effectiveness of the proposed approach. A second approach considers the communication topology a free parameter subject to optimization. In fact, there is no guarantee that a communication topology inspired by the physical vicinity is optimal with regard to the control objectives. Moreover, the energy availability of each power buffer is limited, thus the co-optimization of control performances and communication topologies is important when distributed solutions are considered. A sparsity-promoting optimal controller optimizes a closed-loop utility function, while minimizing at the same time the number of interactions between different control loops. Clearly, DC systems can benefit from sparse communication structures, minimizing computational and communication costs with a limited impact on the resulting closed-loop performances. However, the existing linear formulations for the sparsity-promoting optimal control are not practical for nonlinear systems as the DC microgrid with power buffers. This thesis presents the first attempt in solving sparsity-promoting optimal control problems for nonlinear systems. The versatility properties of the ADP algorithm with off-policy learning are exploited to provide such solution, without requiring the exact knowledge of the system dynamics. In fact, a single set of learning data is repetitively used to find optimal controllers for different communication topologies. The proposed data-driven algorithm employs Domain-of-Attraction estimation methods to check the stability of each distributed controller, while a Tabu Search procedure optimizes the combinatorial problem. The obtained sparsity-promoting controllers are then employed in the DC microgrid. The validity of the proposed approach will be assessed through exhaustive CHIL experiments. Quantitative and qualitative comparisons will show how the proposed methodology significantly outperforms existing approaches.

Reinforcement Learning-Based Techniques for the Optimal Control of Complex Systems / Massenio, Paolo Roberto. - ELETTRONICO. - (2021). [10.60576/poliba/iris/massenio-paolo-roberto_phd2021]