Abstract

Autonomous vehicle control approaches are rapidly being developed for everyday street-driving scenarios. This article considers autonomous vehicle control in a less common, albeit important, situation “a vehicle stuck in a ditch.” In this scenario, a solution is typically obtained by either using a tow-truck or by humans rocking the vehicle to build momentum and push the vehicle out. However, it would be much more safe and convenient if a vehicle was able to exit the ditch autonomously without human intervention. In exploration of this idea, this article derives the governing equations for a vehicle moving along an arbitrary ditch profile with torques applied to front and rear wheels and the consideration of four regions of wheel-slip. A reward function was designed to minimize wheel-slip, and the model was used to train control agents using Probabilistic Inference for Learning COntrol (PILCO) and deep deterministic policy gradient (DDPG) reinforcement learning (RL) algorithms. Both rear-wheel-drive (RWD) and all-wheel-drive (AWD) results were compared, showing the capability of the agents to achieve escape from a ditch while minimizing wheel-slip for several ditch profiles. The policy results from applying RL to this problem intuitively increased the momentum of the vehicle and applied “braking” to the wheels when slip was detected so as to achieve a safe exit from the ditch. The conclusions show a pathway to apply aspects of this article to specific vehicles.

1 Introduction

Autonomous vehicles are a technology that is poised to change transportation. Many prominent companies have allocated significant resources to develop autonomous vehicle technology to ensure safety and reduce traffic issues. However, the investigation of autonomous vehicle control has primarily been concerned with the control of vehicles for on-road, everyday-driving applications [1]. This article seeks to explore the possibility of controlling a vehicle in a less common, albeit important, driving situation—a vehicle stuck in a ditch.

This article presents a unique dynamic model of an idealized vehicle moving on an arbitrary ditch profile, the switching conditions for four regions of possible wheel-slip behavior, and a comparison of multiple reinforcement learning (RL) techniques to train the vehicle to get unstuck from the ditch while minimizing wheel-slip for both rear-wheel-drive (RWD) and all-wheel-drive (AWD) vehicle models. This is a different problem from the RL “mountain-car” scenario [2] as the dynamics model includes significantly more complexity, such as rigid-body vehicle dynamics and wheel-slip, so as to better emulate solving this problem for a real-world scenario (see Sec. 2). In contrast, the “mountain-car” problem relies on a point-mass assumption and a continuous dynamics model. Reward function design and challenges in training an agent to avoid wheel-slip using a discontinuous dynamics model are significantly different than previous approaches [35]. It is likely that the suspension, tires, drive-train, and perhaps other mechanisms influence the performance of a vehicle stuck in a ditch. There are many different suspension designs, drive-trains, and tire models, and these can vary significantly different from vehicle to vehicle. Thus, this article focuses on the dominant effects of rigid body dynamics, wheel-slip, and ditch shape, but does not include the dynamic effects of a specific vehicle, such as the compliance of a specific vehicle suspension or tires, in an effort to provide a basis of comparison for future studies. In our previous work in Ref. [6], a vehicle model was developed that did not consider any wheel-slip and the control problem was considered using human behavioral forcing instead of RL.

Many drivers have found themselves stuck in a ditch at one time or another. The severity of this situation can be compounded by issues, such as lack of cell reception (inability to call a towing service), visibility issues (such as at night), inclement weather conditions, and low-traffic roads (less likely that someone would stop and help). In Ref. [7], the correlation between ditches and car accidents was considered, with the finding that 90% of ditch accidents occur in rural areas. Thus, having a vehicle stuck in a ditch is both a safety concern and a great inconvenience.

When the assistance of a tow vehicle is unavailable, getting a vehicle unstuck is often accomplished by the assistance of human force, with companions pushing behind the vehicle as the driver applies the gas pedal. However, the combination of static human force and torque applied to the wheels is generally insufficient to achieve the desired goal. Instead of applying a static force, a dynamic force is applied rhythmically to the vehicle (similar to pushing a child on a swing), so that the vehicle builds up momentum and achieves escape from the ditch without requiring the substantial applied force of a tow-truck. For the increased safety of occupants and a greater possibility of achieving escape from the ditch, it is desirable that a vehicle would be able to autonomously escape the ditch without human intervention.

Many different types of vehicle dynamics models exist in the literature. For example, some models seek to understand complex problems such as steering, tire deformation, suspension, and braking and utilize many degree-of-freedom (DOF). A comprehensive survey of different vehicle dynamics applications was presented in Ref. [8]. This survey focused primarily on automotive suspension systems, worst-case maneuvering, minimum-time maneuvering, and driver modeling, while citing 185 references. However, in the minimum-time maneuvering problem, the applications were focused on minimum track time for racing, whereas the present article considers escaping from a ditch while minimizing wheel-slip. Reference [9] summarized advancements in the study of vehicle dynamics across a range of vehicle, tire, and driver models while also noting the need to further develop nonlinear dynamic models for vehicles. A vehicle dynamics prediction module for public-road maneuvering was presented in Ref. [10], with a primary emphasis on highway-speed maneuvering. In Ref. [11], several benchmarks for vehicle dynamics problems were considered for both rail and road vehicles, with a particular focus on studying wheel-slip and lateral dynamics.

As mentioned previously, the demand for vehicle automation and innovative optimal control solutions has been a strong motivation for further understanding of vehicle dynamics. Some researchers sought to develop vehicle control strategies that perform well in hazardous scenarios, which is a goal similar to the problem investigated in this article. In Ref. [12], a modified fixed-point control allocation scheme was implemented in a Simulink CarSim simulation to test braking during high-speed double lane changing on slippery roads and hard braking with an actuator failure. In Ref. [13], a coordinate control system involving electronic stability control, active roll control, and engine torque control was used to maximize driver comfort. A linearized 2 DOF dynamics model was used in Ref. [14] to develop an adaptive optimization based second-order sliding mode controller. For modeling the controller, the authors assumed the vehicle velocity while turning was pseudo-constant and the steering and side-slip angles were small. In Ref. [15], the authors proposed a three-dimensional state, including steering and tire force as well as longitudinal vehicle dynamics (position and velocity) as key inputs to a control design that involves synthesizing control approaches using a proportional controller.

Some research has been performed in the area of avoiding hazardous terrain autonomously, such as Ref. [16], which focused on path planning to avoid discrete obstacles. Reference [17] proposed the use of LiDAR to detect hazardous terrain, such as ditches, with the intention of avoidance for autonomous land vehicles. In Ref. [18], navigation of an autonomous vehicle through hazardous terrain is considered using imitation learning from expert demonstration of a task. While these articles are useful for off-road applications, the current article is concerned with the safe exit from a ditch, rather than the avoidance of it all-together.

Since most of the vehicle dynamics models in the literature have focused primarily on everyday driving situations, they have also assumed a flat surface profile, which is typical for most roads. Since the purpose of this research is to address the situation of a vehicle stuck in a ditch, an arbitrary surface profile was assumed. A single-track vehicle model moving on a smooth surface was presented in Refs. [19,20], but without a mathematical derivation or validation of the model through simulation or experiment. Single-track vehicle dynamics were considered in Ref. [21] as well, where the authors derived the dynamics of a cart that is being excited by a moving base using Lagrange’s method. They included results from an earthquake response simulation. A similar dynamics problem of a ball rolling on a two-dimensional potential surface was shown in Ref. [22], with a resulting dynamic model that appears similar in form to the dynamics model presented in this article.

This article derives a dynamic model for a vehicle moving on an unknown surface profile (which allows the possibility of simulating vehicle behavior on any continuous ditch shape) and will consider four different cases of wheel-slip for the vehicle: (1) no wheels are slipping, (2) both rear and front wheels are slipping, (3) the rear wheels are slipping and the front wheels are not slipping, and (4) the rear wheels are not slipping and the front wheels are slipping. In addition, this article derives the terminal conditions for these four slip cases and describes a simulation method to accurately switch between these cases. To develop a control policy for achieving escape from a ditch while minimizing wheel-slip, two different RL methods are used and their results compared.

2 Relevant Reinforcement Learning Background

A more recent control approach that will be applied in this article is RL, and a brief description is included here. The core RL algorithm is composed of two primary functions: the environment and the agent (see Fig. 1). The environment provides the state and corresponding reward achieved based on a given action. The agent uses a control policy π to determine the action based on the state and reward observed from the environment. RL seeks to answer the question: “What action should be taken to maximize the expected long-term reward?” Typically, the reward function is designed in such a way that the algorithm will make decisions that direct the environment toward a desired goal.

Fig. 1
A diagram of a reinforcement learning (RL) action and observation loop. The environment achieves a state and reward based on an action, which is determined by the agent’s control policy π.
Fig. 1
A diagram of a reinforcement learning (RL) action and observation loop. The environment achieves a state and reward based on an action, which is determined by the agent’s control policy π.
Close modal

The vehicle-ditch problem was considered well suited for RL control for two reasons. First, RL can achieve good results while not needing to know the exact model of a complex dynamic system. This is particularly useful when the system dynamics are difficult to model analytically or when a control approach is data driven, instead of based on a model. The discontinuous vehicle dynamics model with four different regions of wheel-slip behavior fits this category well. Some complex control examples, such as control a nonlinear turbo-generator system in Ref. [23] and an optimal tracking control problem in Ref. [24], are solved using RL without prior knowledge of the system dynamics.

The second reason RL is well suited for the vehicle-ditch scenario is that it has the ability to explore many combinations of states and actions and can achieve good controllability for even systems with control constraints. A practical example of a control-constrained system that is similar to the vehicle-ditch scenario is that of a parent pushing a child on a swing. The parent may not be able to exert enough effort to push the child as high as they may want to swing in one push. However, by timing repeated pushes in such a way as to build the child’s momentum, the desired height for swinging can be obtained. Classic control methods, such as proportional-integral-derivative (PID) and linear-quadratic-regulator (LQR), encounter many difficulties when trying to control control-constrained dynamic systems, and this will be discussed further in Sec. 5. The ability of RL to effectively control control-constrained systems is particularly useful for the vehicle-ditch problem, since in a real-world scenario, humans have to effectively time their pushing of a vehicle in combination with the driver applying the gas pedal to achieve escape from the ditch. Hence, the vehicle is often control constrained for these real-life scenarios as well. In this article, we will apply two different RL techniques, Probabilistic Inference for Learning COntrol (PILCO) and deep deterministic policy gradient (DDPG), to control the discontinuous vehicle dynamics model to achieve escape from a ditch. Each of these algorithms will be explained briefly.

Probabilistic inference for learning control is an RL algorithm presented in Ref. [25] that uses a Gaussian process (GP) to create a surrogate model of the dynamics of a system. This algorithm attempts to learn an effective policy while reducing the number of trial episodes necessary to do so. An example of the application of PILCO to a control-constrained system can be found in Ref. [26], where PILCO was applied to a double-pendulum-cart system to achieve successful swing-up. In this article, we chose to use a matlab implementation of PILCO as one method for controlling the vehicle-ditch scenario. The application of this algorithm and its limitations will be presented in more detail in Sec. 5.

Deep deterministic policy gradient is a part of a specific category of RL called deep RL reference [27], where deep neural networks are trained to approximate any one of the following: a value function (which ties long-term reward to actions), the control policy (which ties states to actions), or the system model (which updates the states and rewards for the system). This deep learning is particularly useful when the system is complex, and thus multiple layers of neural networks are needed to achieve an accurate approximation of one or more components of the RL structure. Deep learning techniques have been used to solve difficult control problems. For example, a DDPG RL technique was used in Ref. [28] to control a bicycle effectively. Double Q-learning was used in Ref. [29] to achieve autonomous driving that feels similar to a human.

Deep learning has also been applied extensively to control autonomous vehicles. In Ref. [30], a survey was presented of current deep RL methods for solving typical autonomous vehicle control problems, such as motion and path planning for roadway driving. A classic RL benchmark problem called “mountain-car’” [35] is somewhat similar to the problem considered in this article—getting a vehicle unstuck from a ditch. However, “mountain-car” uses a simple control-constrained point-mass (the car) that is unable to reach the top of a mountain without applying RL. While “mountain-car” has been used as a good benchmark problem with which to test RL methods, it has not been considered as a control problem for real-world use. To solve the vehicle-ditch problem, we include rigid body dynamics, an arbitrary ditch profile, and the potential for slip to occur with either front or rear wheels using both RWD and AWD models. Our purpose is to provide insight into autonomously controlling a vehicle in such a hazardous scenario.

A detailed explanation of DDPG is beyond the scope of this background [27], but we chose to apply this RL algorithm since it has the ability to implement a continuous action space, which is most applicable to typical analog-signal control scenarios and because it is capable of controlling complicated systems due to its deep neural network structure. We implemented a neural network structure as defined in Ref. [31], since it showed good results across a variety of complex systems.

The remainder of this article will present the derivation of the discontinuous analytical model, simulation methods, and results from applying various RL techniques.

3 Derivation of Analytical Model

3.1 Dynamic System Description.

To understand the behavior of a vehicle moving on an arbitrary surface, the equation of motions (EOM) for the system must first be derived. This is done using Newtonian mechanics. A diagram of the system is shown in Fig. 2. To derive the EOM for the system represented in Fig. 2, we begin by defining the position vector for rigid body K as rK (where K represents either wheel A, rigid body M, or wheel B). Vector components are further defined as rK(i^)=rKI^ and rK(j^)=rKJ^, where I^ and J^ are coordinate vectors shown in Figs. 24. In addition, the rotational angle corresponding to rigid body K is defined as θK. The derivation of the analytical expressions for these position vectors, as well as their corresponding velocity and acceleration vectors (r˙K, θ˙K, r¨K, and θ¨K), are included in Appendix  A for completeness. The position coordinate for this system is x, y(x) is the function describing the shape of the ditch surface, and yK,x is the derivative with respect to x of y(x) evaluated at the contact point of wheel K with the surface. For the key dimensions of the vehicle, R is the radius of the wheels, l is the length of M, and xc and yc describe the position of the center of mass of M with respect to the left-hand lower corner of body mass M shown in Fig. 2.

Fig. 2
A vehicle comprised of rigid bodies A, B, and M moving on a ditch profile y(x)
Fig. 2
A vehicle comprised of rigid bodies A, B, and M moving on a ditch profile y(x)
Close modal
Fig. 3
Free-body diagrams of wheels A and B: (a) a free-body diagram of wheel A and (b) a free-body diagram of wheel B
Fig. 3
Free-body diagrams of wheels A and B: (a) a free-body diagram of wheel A and (b) a free-body diagram of wheel B
Close modal
Fig. 4
A free-body diagram of rigid body M
Fig. 4
A free-body diagram of rigid body M
Close modal

This derivation will develop the EOM for this vehicle and provide the state-space dynamics for four possible cases of traction the vehicle experiences with the surface: (1) neither wheels A or B are slipping, (2) both wheels A and B are slipping, (3) wheel A is slipping and wheel B is not slipping, and (4) wheel A is not slipping and wheel B is slipping. The subscript 1 will be used to denote the first case, the subscript 2 will be used to denote the second case, and so on. The subscript n will be used to denote any of the four cases.

When the vehicle is in case 1, wheels A and B are assumed to be in perfect traction with surface y(x). Thus, θA,n and θB,n are functions of xn, and thus, there is a single DOF that describes the behavior of the vehicle in this condition—xn. If wheel A loses traction, the vehicle transitions to case 3, where there is no direct relationship between θA,n and xn, and thus, an additional DOF is introduced to the system due to a spinning or sliding wheel AθA,n. Similarly, if both wheels A and B lose traction, the vehicle is in case 2 where there is no direct relationship between θA,n and xn or θB,n and xn, and thus, two additional DOFs is introduced to the system—θA,n and θB,n.

Since the DOFs change depending on the case n, a state-space model of this discontinuous dynamic system will also change the size. It is necessary to have a state-space model that does not change size depending on n to make switching between cases possible during numerical integration. Uniformity in the size of the state-space model between all four cases is accomplished by setting the state-space size to the maximum it would be for any of the four cases and augmenting the smaller state-spaces to include the maximum DOFs. For instance, with case 1, the state-space model would only depend on xn and x˙n. In case 2, the state-space model would include xn, x˙n, θA,n, θ˙A,n, θB,n, and θ˙B,n. Since case 2 includes all possible DOFs for this vehicle model, the state-spaces for the other cases are augmented to include these states as well. This is explained further in following sections.

3.2 Governing Equations.

Using Newton’s second law and the depiction of forces acting on the rigid bodies in Figs. 3(a), 3(b) and 4, three equations are obtained that describe the motion of each rigid body, for a total of nine equations. These are obtained by sum of forces FK and torques TK on each rigid body K:FK=mKr¨K and TK=IKθ¨K, where IK and mK are the moment of inertia and mass of rigid body K, respectively. These forces and torques are portrayed in Figs. 3(a), 3(b), and 4, where FF,K is the friction force, FN,K is the normal force, Fg,K is the gravitational force, and τK is a rotational torque acting on rigid body K. Also, αK is the angle from the horizontal of wheel K and ΦK describes the curvature of the surface y(x) at the contact point with wheel K (see Eq. (11) in Appendix  A). Finally, internal forces on wheel K are represented by Kx and Ky. The equations describing the motion of these rigid bodies are as follows:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)

The gravitational forces in Eqs. (1)(9) are defined as Fg,K = mKg, where g is the gravitational constant. Angle αK is related to the slope of y(x) at the contact point of wheel K with the surface by tan αk = yK,x, and thus, cosαk=1/1+yK,x2 and sinαK=yK,x/1+yK,x2.

The complex analytical expressions for r¨K and θ¨K,n (included in Appendix  A) must be substituted into Eqs. (1)(9) to solve. Given the complexity of this dynamic system for each of the four cases presented, maple software was implemented to obtain the exact state-space form for each case. The maple files used to derive this system can be found in the data repository for this article. In Secs. 3.33.6, the method will be presented for deriving each case and the general form for each state-space will be provided. For the simplest case, case 1 (see Sec. 3.3), a more complete derivation will be presented to demonstrate the steps needed to derive the other state-spaces.

3.3 Case 1: Neither Wheels A or B Are Slipping.

For this case, there is one DOF—x1. Hence, Eqs. (1)(9) can be reduced to a single equation dependent on x1. Since there is no wheel-slip occurring, θA,1 and θB,1 are dependent on x1. Thus, instead of using Eqs. (3) and (6) to solve for θ¨A,1 and θ¨B,1, the wheel friction force FF,K is treated as the dependent variable. This relationship is used to reduce Eqs. (1)(9) to a single equation dependent on x1. The result is expressed as follows:
(10)
Equation (10) can be reduced to the following form after substitution of relevant expressions from Appendix  A:
(11)
where H1, J1, TA,1, and TB,1 are all nonlinear functions of x1 specific to case 1, and all parameters of this form used later in this article are also nonlinear functions of xn pertaining to a given case n. The superscript position for any parameters Hn, Jn, TA,n, or TB,n does not denote the exponential operation, but either a normal force parameter (i.e., HnN) or an angular acceleration parameter (i.e., Hnθ).
While Eq. (11) is sufficient to provide a state-space consisting of states x1 and x˙1, the extra DOFs that will arise from different cases must be included as well, as mentioned previously. This means that the state-space for case 1 must be artificially augmented to include θA,n, θ˙A,n, θB,n, and θ˙B,n. Solutions for θ¨A,1 and θ¨B,1 for the case where a given wheel is not slipping (see Eqs. (13) and (20) in Appendix  A, respectively) are included here for clarity in the state-space derivation:
(12)
(13)
Since both Eqs. (12) and (13) depend on x¨1, the complete state-space for case 1 can be obtained by simply solving for x¨1 using Eq. (11) and substituting into Eqs. (12) and (13) to obtain:
(14)
(15)
The resulting state-space from Eqs. (11), (14), and (15) is expressed as follows:
(16)
where states z1z6 are the states xn, x˙n, θA,n, θ˙A,n, θB,n, and θ˙B,n, respectively. This collection of states will be referred to as z in future sections.
Now that the state-space for case 1 has been defined, it must be determined what conditions cause the vehicle to transition from this case to any of the other three cases. Case 1 requires perfect traction between both wheels A and B and the surface. Thus, the conditions that would make either wheel-slip would be the terminal conditions for this state-space. Either wheel would slip when the friction force needed to maintain traction with the surface exceeds the product of the static friction coefficient, μs, and the normal force acting on the wheel. These conditions are as follows:
(17)
(18)

Note, if either event Eqs. (17) or (18) occurs, the system transitions to a different case. If Eq. (17) occurs, the system transitions to case 3, and if Eq. (18) occurs, the system transitions to case 4, and if both Eqs. (17) and (18) occur, the system transitions to case 2.

The friction forces acting on wheels A and B can be obtained using Eqs. (3) and (6) and the solutions are found for θ¨A,1 and θ¨B,1 in Eqs. (12) and (13), respectively, to obtain
(19)
(20)
Similar to solving for Eq. (10), the normal forces at wheels A and B can be obtained to evaluate the terminal conditions listed in Eqs. (17) and (18). These forces are obtained by solving Eqs. (1)(9) for FN,A,1 and FN,B,1, resulting in
(21)
(22)
Equations (21) and (22) can be simplified to obtain a form similar to Eq. (11). This is done by substitution of relevant expressions from Appendix  A and using the solution for x¨1 from Eq. (11) to obtain
(23)
(24)

Now, the state-space and terminal conditions for case 1 have been developed. Since the complete EOMs and equations for normal forces for the other three cases are lengthy, we chose only to derive these by example for the first case. For the rest of the cases, a more brief derivation will be provided.

3.4 Case 2: Both Wheels A and B Are Slipping.

In case 2, the direct relationships between θA,n and xn or θB,n and xn that existed for case 1 are no longer valid. The friction force at wheel K is modeled as FF,K,2 = μKFN,K,2, where μK is the dynamic friction coefficient between wheel K and the surface. In this case, there are three DOFs—xn, θA,n, and θB,n, so there is no need to augment the state-space for this case to include any additional states. The following EOM dependent on x2 is derived from Eqs. (1)(9):
(25)
Similarly, EOMs can be obtained that are dependent on the other two DOFs, θA,2, and θB,2:
(26)
(27)
The result from Eqs. (25)(27) is the following state-space:
(28)
Now that the state-space has been defined for case 2, and it must be determined what conditions cause the vehicle to transition from this case to any of the other three cases. Case 2 requires that both wheels A and B be slipping against the surface. Thus, the conditions that would make either wheel stop slipping would be the terminal conditions for this state-space. Either wheel would stop slipping when the relative velocity between the wheel and the surface, vr,K, becomes zero. These conditions are obtained from Eqs. (12) and (18) and are as follows:
(29)
(30)
If either one of these conditions, Eqs. (29) or (30), is satisfied, that is sufficient to transition to a different case. If Eq. (29) occurs, the system transitions to case 4, and if Eq. (30) occurs, the system transitions to case 3, and if both Eqs. (29) and (30) occur, the system transitions to case 1.

3.5 Case 3: Wheel A Is Slipping and Wheel B Is Not Slipping.

In case 3, the direct relationship between θA,n and xn that existed for case 1 is no longer valid. As with case 2, the friction force at wheel A is modeled as FF,A,3 = μAFN,A,3. However, since wheel B is not slipping, the relationship between θB,n and xn described in Eq. (13) is valid. This information and Eqs. (1)(9) are used to get EOMs for x3 and θA,3:
(31)
(32)
An expression for θ¨B,3 is obtained similar in form to Eq. (15):
(33)
The result from Eqs. (31)(33) is the state-space for case 3:
(34)
To leave case 3, one of the two possible events must occur. Either wheel A must stop slipping or wheel B must start slipping. For wheel A to stop slipping, the relative velocity at wheel A, vr,A, must be zero. For wheel B to start slipping, the friction force acting on wheel B must exceed the product of the static friction coefficient and the normal force at wheel B. These two conditions are shown as follows:
(35)
(36)
If Eq. (35) occurs, the system transitions to case 1, and if Eq. (36) occurs, the system transitions to case 2, and if both Eqs. (35) and (36) occur, the system transitions to case 4. Similar to Eqs. (20) and (24), the friction force and normal force for case 3 are defined as follows:
(37)
(38)

3.6 Case 4: Wheel A Is Not Slipping and Wheel B Is Slipping.

In case 4, the direct relationship between θB,n and xn that existed for cases 1 and 3 is no longer valid. The friction force at wheel B is modeled as FF,B,4 = μBFN,B,4. However, since wheel A is not slipping, the relationship between θA,n and xn described in Eq. (12) is valid. This information and Eqs. (1)(9) are used to get EOMs for x4 and θB,4:
(39)
(40)
An expression can be obtained for θ¨A,4 similar in form to Eq. (15):
(41)
The resulting state-space for case 4, from Eqs. (39) to (41), is expressed as follows:
(42)
To leave case 4, one of the two possible events must occur. Either wheel A start slipping or wheel B must stop slipping. For wheel A to start slipping, the friction force acting on wheel A must exceed the product of the static friction coefficient and the normal force at wheel A. For wheel B to stop slipping, the relative velocity at wheel B, vr,B, must be zero. These two conditions are shown as follows:
(43)
(44)
If Eq. (43) occurs, the system transitions to case 2, and if Eq. (44) occurs, the system transitions to case 1, and if both Eqs. (43) and (44) occur, the system transitions to case 3. Similar to Eqs. (19) and (23), the friction force and normal force for case 4 are defined as follows:
(45)
(46)

This concludes the derivation of the state-spaces for the four possible cases for surface contact between the planar vehicle model and the ditch surface profile. The terminal conditions for each state-space have been described as well as the transitions to different cases. In Sec. 4, a method for numerically simulating this discontinuous dynamic system and intelligently switching between each of the four cases will be presented.

4 Simulation

In simulating the derived model, there are numerous issues to consider. First, the spatial dimensions for the surface profile and physical properties of the vehicle must be selected. For the surface profile, an inverted Gaussian shape was chosen to represent a ditch large enough to accommodate the planar dimensions of the vehicle. The form of this shape is expressed as follows:
(47)
where a = 3, b = 16/225, and c = 1/1225. In practice, this function y(x) can be any smooth function that has a radius of curvature greater than wheel radius R and does not approach ±∞ in the simulation region of interest. For the physical dimensions and properties of the vehicle, real data for a 1998 F-150 pickup truck were obtained from the National Highway Traffic Safety Administration Vehicle Research and Test Center and used in the analytical model [32]. This vehicle was chosen because it was one of the few vehicles with relevant moment-of-inertia and center-of-mass data readily available. Typical tire and wheel sizes for this truck were used to derive mass and moment inertia parameters for the wheels. The scale of this scenario is shown in Fig. 5, and a list of the physical vehicle parameters is found in Table 1.
Fig. 5
The vehicle and Gaussian-shaped ditch scenario. The filled dot denotes the center of mass of rigid body M and dashed line denotes the target to achieve escape from the ditch.
Fig. 5
The vehicle and Gaussian-shaped ditch scenario. The filled dot denotes the center of mass of rigid body M and dashed line denotes the target to achieve escape from the ditch.
Close modal
Table 1

Vehicle simulation parameters

ParameterValue
mA, mB51 kg
mM2039.25 kg
IA, IB3.3702 kg m2
IM5091 kg m2
R0.3675 m
l3.517 m
xc2.0520 m
yc0.3335 m
ParameterValue
mA, mB51 kg
mM2039.25 kg
IA, IB3.3702 kg m2
IM5091 kg m2
R0.3675 m
l3.517 m
xc2.0520 m
yc0.3335 m

The primary challenge in simulating this discontinuous dynamic model is the transition between the four state-spaces shown in Eqs. (16), (28), (34), and (42). Typically, simulating a system of continuous ordinary differential equations is straightforward using either a Runge-Kutta method or other numerical integration tool, such as matlabsode45 function. However, for this model, in addition to integrating the state-space for each case, the exact moment at which the terminal condition for the state-space occurs must be solved for to accurately switch to a new state-space at the correct moment. For instance, if the vehicle starts out operating in case 1, it will continue in case 1 until either of the terminal conditions for case 1, Eq. (17) or (18), occurs. If Eq. (17) occurs, the vehicle will transition to case 3. Thus, the state-space must be switched from case 1 to case 3 and integration continued until either of the terminal conditions for case 3 occur and the case changes again, and so on.

Initially, matlabsode45 function was used in conjunction with a custom event function to solve the state-space up until the exact moment the terminal event occurred. However, there was an issue with this quasi-black-box approach. Solving for the time a terminal condition is reached is accomplished by checking an event function for a zero-crossing, and then by iterating using some numerical root-finding method to solve for the exact moment, the terminal condition occurs. matlabsode45 event location feature does not have the ability to stop integration after a certain number of calls to the event function have been made in an attempt to locate the terminal condition. This was discovered to be an issue after noticing that matlabsode45 event location feature occasionally would find a zero’s approximate location, but instead of honing in on its exact location, it continued to cross the zero-point back and forth indefinitely. To better simulate the dynamic model, a Newton-Raphson routine was created to solve for the terminal event for a state-space with a provision if convergence was not achieved within a designated number of iterations, the simulation was terminated. It was feasible to use a Newton–Raphson method instead of a secant method since analytical expressions for the time derivatives of the terminal conditions were available. Pseudo-code outlining the process for evaluating this dynamic system is shown in Algorithm 1.

Single time-step integration

Algorithm 1

   Inputn,zm,τA,τB,tm,Δt,μs

   Outputn,zm+1,tm+1

 1 t=tm,z=zm;

 2 FF,A,FF,B,FN,A,FN,Bgetforces(n,z,τA,τB);

 3 ncheckstartingcase(FF,A,FF,B,FN,A,FN,B,μs);

 4 whilet<tm+Δtdo

 5   tr,zrintegrateode(n,z,τA,τB,t,tm+Δt);

 6   e1,e2getevents(n,zr,τA,τB);

 7   if zerocrossing(e1,e2) then

 8    tnr guess(e1,e2);

 9    t,z,n newtonraphson(n,zr,tnr);

10   else

11    tm+1tr(end);

12    zm+1zr(end);

13    break

14   end

15 end

Algorithm 1 shows the process for integrating the discontinuous dynamic model from Sec. 3 over a single time-step, accounting for switching between four slipping cases. The inputs are as follows: case n, initial conditions zm, torque controls applied during the time-step τA and τB, starting time tm, step-size Δt, and static friction coefficient μs. The outputs are as follows: the slipping case at the end of the time-step n, the states of the system at the end of the time-step zm+1, and the ending time tm+1. These outputs then become the initial conditions, case condition, and starting time for the beginning of the next time-step of integration.

Lines 1–3 perform some initialization steps for the integration process. In particular, Line 2 calculates the friction and normal forces acting on both wheels given the current torque actions. Since either of the torques could cause wheel-slip at the start of the time-step, line 3 checks to see if this occurs and if so, changes the slipping case to the correct one (see Algorithm 2 in Appendix  B). Lines 4–15 are a while-loop that continues, while the simulation time t is less than the ending time tm + Δt. Inside the while-loop, line 5 integrates the state-space for case n and outputs a refined mesh of times tr and states zr over the entire time-step. In line 6, the terminal conditions e1 and e2 for the current case n are calculated. Line 7 checks to see if there was a zero-crossing in e1 or e2. If there was a zero-crossing, a terminal event occurred, and it is necessary to solve for the exact moment the event occurred. In line 8, e1 and e2 are used to provide an initial guess for the event moment tnr. The Newton–Raphson calculation on line 9 seeks to find the event, and if it does, it outputs the time t and states z at the terminal event. The simulation then returns to line 5 with an updated time t, initial conditions z, and the new slipping case n, and the loop continues. If the Newton–Raphson method does not converge, the simulation is considered to have failed and the simulation ends. If there was not a zero-crossing, Lines 12–13 output the updated time tm+1 and states zm+1 at the end of the time-step and the while-loop breaks. This algorithm allows repeatable and accurate simulation of the discontinuous dynamic model.

In addition, a continuous friction model was used for this simulation from Ref. [33]. This can be seen in Fig. 6, where μK is a function of vr,K, where K represents either wheel A or wheel B. This allows different friction coefficients to be applied to either front or rear wheels as a function of relative velocity. To assist the convergence of the Newton–Raphson method, this function incorporates a hyperbolic tangent function to smooth the discontinuity at vr,K = 0.

Fig. 6
Friction coefficient μK as a function of relative velocity vr,K
Fig. 6
Friction coefficient μK as a function of relative velocity vr,K
Close modal

5 Reinforcement Learning Control

As has been mentioned in Sec. 1, RL can be an effective tool for controlling complex dynamic systems even when they are control constrained. The vehicle model from Sec. 3 was intentionally control constrained by limiting the maximum applied torque to 700 N m. Thus, the simulated vehicle is not capable of simply applying unlimited positive torque and exiting the ditch. For all RL training, the parameters describing the ditch shape defined in Eq. (47) were set to the values described in Sec. 4. Three control scenarios are considered in this section: RWD with no wheel-slip, RWD with wheel-slip, and AWD with wheel-slip. In addition, the robustness of each of the resulting control policies will be examined at the end of this section.

Full-state feedback was allowed. This was deemed feasible since in the real world θA,n and θB,n could be measured using potentiometers. In addition, since θM is a function of xn and could be measured using a gyroscope, xn is considered at least partially observable. By using available sensing technologies for autonomous vehicles (such as LiDAR), y(x) could be observed and inform a controller on what best control approach to use to get unstuck from the ditch.

It is useful to explain why classic control methods, such as PID and LQR, are incapable of controlling control-constrained systems, and in particular, the vehicle-ditch problem. These classic methods rely on measuring the error between a desired state and a measured state and computing a desired control effort that will seek to minimize this error. The fundamental issue with these methods is that they rely on assumptions of linearity in the system. When the available control effort is not enough to reach the desired state (in an control-constrained system) when applied in a linear relationship to the state error, the best possible control solution either PID or LQR can achieve is to saturate the control in the direction of the desired state. In Fig. 7, a saturated control of the maximum torque is applied to wheel A in the direction of the goal. However, the vehicle just continues to rock back and forth in the ditch with this constant control effort applied without making any real progress toward the target state. While this is the best solution classic control methods can achieve, it is not useful for this problem due to its extremely poor performance.

Fig. 7
Vehicle trajectory from applying a saturated control policy. The dashed line denotes the target state, and the solid line is the position of the vehicle in the ditch. This figure illustrates the poor performance of classic control methods and the need to use an RL policy.
Fig. 7
Vehicle trajectory from applying a saturated control policy. The dashed line denotes the target state, and the solid line is the position of the vehicle in the ditch. This figure illustrates the poor performance of classic control methods and the need to use an RL policy.
Close modal

5.1 Applying Reinforcement Learning Assuming a Rear-Wheel-Drive Model With No Wheel-Slip.

First, PILCO was applied to control the vehicle to achieve escape from the ditch using only RWD (torque is only applied to wheel A). One of the fundamental weaknesses of this algorithm is that it relies on a continuous dynamics model for simulation training. In addition, since PILCO uses a GP to build a surrogate model of the system dynamics, it cannot account for multiple different regions of behavior (i.e., cases 1–4) with a single GP model without nontrivial alterations to the core algorithm. Thus, to successfully implement PILCO, it was necessary to assume that the vehicle did not slip with either front or rear wheels, and thus not leave case 1. This control algorithm was used to emphasize the importance of considering wheel-slip in controlling the vehicle. The reward function used by PILCO was a positive Gaussian-shaped reward around in the vicinity of the target state. The results after 14 training episodes (308 s of training experience) is shown in Fig. 8. In Fig. 8(a), the blue line illustrates the simulated response of assuming that the vehicle cannot slip. The vehicle in this case successfully reached the target (dashed line) state in approximately 20 s. The control torque profile generated using PILCO is shown in Fig. 8(b). However, when the PILCO control policy was applied to the complete dynamics model, the vehicle failed to achieve escape and fell back into the ditch, as shown by the red line. Since torque was not applied to wheel B, wheel B never slipped so the red line changes only between case 1 and case 3 and wheel A slip (or case 3) is denoted by the gray shaded regions.

Fig. 8
Vehicle trajectories from applying a PILCO-generated policy to a Rear-Wheel-Drive (RWD) dynamics model: (a) The trajectories from applying a PILCO-generated policy, where the dark line assumes no wheel-slip and the light line incorporates wheel-slip. The dashed line denotes the target state and (b) The torque profile generated by PILCO. The shaded regions denote the region when wheel A was slipping when wheel-slip was considered. This figure illustrates the need to incorporate wheel-slip into the dynamics model for the training of an RL policy.
Fig. 8
Vehicle trajectories from applying a PILCO-generated policy to a Rear-Wheel-Drive (RWD) dynamics model: (a) The trajectories from applying a PILCO-generated policy, where the dark line assumes no wheel-slip and the light line incorporates wheel-slip. The dashed line denotes the target state and (b) The torque profile generated by PILCO. The shaded regions denote the region when wheel A was slipping when wheel-slip was considered. This figure illustrates the need to incorporate wheel-slip into the dynamics model for the training of an RL policy.
Close modal

A DDPG algorithm was applied to the same scenario as the PILCO implementation for comparison. The neural network structure was the same as the one implemented in Ref. [31] and training was implemented in matlab using the RL Toolbox. A positive reward function was structured in such a way as to “incentivize” successful achievement of the target state. Often, reward functions that are designed to achieve a target state penalize the system when it is far away from the target state, but when the target state is reached, the penalty is zero. A reward function was chosen that was zero when the system was far away from the target state and was shaped so that as the system approached, the target state it achieved greater and greater rewards. It was not desirable to numerically penalize the vehicle for being far away from the goal, since the vehicle must build momentum by moving in the opposite direction of the goal at times. The reward function was shaped so that it increased as distance to the target (position error ex) decreased. The reward was also dependent on velocity error ex˙ so that it increased as the vehicle slowed down near the target and provided a slight increase for building momentum at the bottom of the ditch. This reward shape rs is shown in Fig. 9 and was designed to incentivize the vehicle to build enough momentum to exit the ditch but additionally to achieve a controlled stop at the target state. While ensuring a controlled stop was a more complex control objective, it is a reasonable safety concern since in the real world it is not desirable that a vehicle exit the ditch in an uncontrolled manner and possibly incur an accident by heading into traffic.

Fig. 9
Reward function used in training a Deep Deterministic Policy Gradient (DDPG) agent. The function is structured in such a way as to maximize reward when position and velocity error are minimized (i.e., near the target state).
Fig. 9
Reward function used in training a Deep Deterministic Policy Gradient (DDPG) agent. The function is structured in such a way as to maximize reward when position and velocity error are minimized (i.e., near the target state).
Close modal

Within 1600 training episodes, the agent was effectively trained to achieve escape with results quite similar to those achieved with PILCO (see Fig. 10). It should be noted that the significant disparity in number of training episodes needed between PILCO and DDPG is due to the fact that PILCO’s use of a surrogate dynamics model allows training to be achieved with significantly fewer training episodes than needed for deep neural network approaches. In Fig. 10(a), similar to Fig. 8(a), a comparison of the vehicle trajectories when assuming no wheel-slip (blue line) versus allowing wheel-slip (red line) can be seen. Figure 10(b) shows the applied control torque τA and the gray regions denote regions when wheel A slipped when wheel-slip was allowed.

Fig. 10
Vehicle trajectories from applying a DDPG-generated policy to a RWD dynamics model: (a) The trajectories from applying a DDPG-generated policy, where the dark line assumes no wheel-slip and the light line incorporates wheel-slip. The dashed line denotes the target state and (b) The torque profile generated by DDPG. The shaded regions denote the region when wheel A was slipping when wheel-slip was considered. This figure illustrates the need to incorporate wheel-slip into the dynamics model for the training of an RL policy.
Fig. 10
Vehicle trajectories from applying a DDPG-generated policy to a RWD dynamics model: (a) The trajectories from applying a DDPG-generated policy, where the dark line assumes no wheel-slip and the light line incorporates wheel-slip. The dashed line denotes the target state and (b) The torque profile generated by DDPG. The shaded regions denote the region when wheel A was slipping when wheel-slip was considered. This figure illustrates the need to incorporate wheel-slip into the dynamics model for the training of an RL policy.
Close modal

On comparing Figs. 8 and 10, it is apparent that there is some similar behavior between the PILCO and DDPG control policies. The torque profiles have a similar shape as a result of intelligently building the vehicle’s momentum to achieve escape from the ditch. Both policies performed well, with PILCO achieving escape in 20 s and DDPG performing slightly better by achieving escape in 17 s. In addition, when these policies were applied to the complete dynamics model allowing for wheel-slip, the vehicle did not achieve escape due to significant wheel-slip, as shown by the gray regions of Figs. 8 and 10.

It is useful to consider what effect the starting position of the vehicle has on completing the objective for this control problem. The results shown in Figs. 8 and 10 show a starting position x0 in the ditch of 0 m. To examine the potential effect of different starting positions, the DDPG agent was trained with random starting positions between −3 and 3 m. The time to achieve the target state tg can be seen as a function of x0 in Fig. 11. Figure 11 shows a discontinuity in tg at x0 ≈ 0.4 m. This is the result of the trained agent requiring one fewer oscillations in the ditch to achieve the target state for x0 > 0.4 m, and thus achieving the goal much faster (tg < 13 s).

Fig. 11
Time to reach the target state tg shown to be significantly dependent on starting position x0
Fig. 11
Time to reach the target state tg shown to be significantly dependent on starting position x0
Close modal

5.2 Applying Reinforcement Learning Assuming a Rear-Wheel-Drive Model with Wheel-Slip.

It is desirable for an RL policy to perform well even when wheel-slip is present, and so a DDPG policy was trained using a RWD dynamics model that allowed for wheel-slip. Since torque was not applied to wheel B, wheel B did not slip in this control scenario, but wheel A could slip since torque was applied to it.

For the vehicle-ditch problem, losing traction with the surface is not desirable. Not only is this is a safety hazard but also if the surface is not rigid (which is more typical of a real-world scenario) and high-rate wheel-spin occurs, the surface can actually be worn away and the wheels can bury themselves in the surface. So, it was desired that the DDPG policy avoid wheel-slip and high-rate wheel-spin, while achieving the overall control objective of escape from the ditch. To incentivize this performance, a reward function was needed that includes additional features to rs in Fig. 9. It was desired to penalize high relative velocities for wheel A, since that is effectively high-rate wheel-spin, and to penalize the condition of slipping. A total reward function was designed such that rt = rs − 0.001|vr,A| − rc (see Eq. (35)), where
(48)
Equation (48) is designed to penalize the system for slipping, and cases 2 and 4 were not included because wheel B could not slip. The observation states that were used in training the DDPG agent were xn, x˙n, vr,A,n, and n. Figure 12 shows the performance of the trained DDPG agent after 2250 training episodes.
Fig. 12
Vehicle trajectory from applying a DDPG-generated policy to a RWD dynamics model with wheel-slip: (a) The vehicle trajectory (solid line) and target state (dashed line) and (b) The applied torque generated by DDPG. The shaded regions denote the region when wheel A was slipping. This figure illustrates the success of incorporating wheel-slip in the dynamics model for the training of an RL policy.
Fig. 12
Vehicle trajectory from applying a DDPG-generated policy to a RWD dynamics model with wheel-slip: (a) The vehicle trajectory (solid line) and target state (dashed line) and (b) The applied torque generated by DDPG. The shaded regions denote the region when wheel A was slipping. This figure illustrates the success of incorporating wheel-slip in the dynamics model for the training of an RL policy.
Close modal

It is apparent that the control policy shown in Fig. 12 is significantly different in nature than Fig. 8 or 10. There are few gray regions on the plot, which indicate when wheel A was slipping. By considering the three narrow slip regions around t ≈ 2 s on Fig. 12, it can be seen that the agent learned to reverse torque direction rapidly to stop slipping. This happened again at t ≈ 4 s, t ≈ 7 s, and t ≈ 11 s. While switching torque directions this quickly is not physically possible on a vehicle, this was effectively “braking” the vehicle to regain traction with the surface. In practice, this control to avoid slipping could be applied via the vehicle brakes. Even though slip was incorporated in training this RWD control policy, the vehicle still achieved escape in about 17 s, which is comparable with the performance achieved when DDPG was applied while ignoring slip. The utility of RL is apparent here, as the DDPG agent has been successfully trained so as to accomplish all desired control objectives: exit the ditch in a controlled manner while avoiding wheel-slip.

5.3 Applying Reinforcement Learning Assuming an All-Wheel-Drive Model With Wheel-Slip.

This section describes how DDPG was applied to control an AWD dynamics model where control torques were applied to both wheels A and B and the system could be in any of the four wheel-slip cases discussed in Sec. 3. This was the most challenging and computationally intensive result to achieve. For this AWD scenario, it was desired that the vehicle exit the ditch while minimizing wheel-slip. Again, a modified reward function was needed to achieve this objective. For this scenario, a total reward function was designed such that rt = rs − 0.001|vr,A| − 0.001|vr,B| − rc, where
(49)

Equation (49) was designed to heavily penalize the system for both wheels slipping (as in case 2), and to penalize less for either wheel slipping (as in cases 3 and 4), and to not penalize the system at all when no wheels slip (as in case 1). Training a DDPG agent to achieve an effective policy for this complex system was computationally intensive and took nearly 12,000 training episodes (several weeks of computing) to achieve an effective policy that accomplished the control objectives. The performance of this policy is shown in Fig. 13.

Fig. 13
Vehicle trajectory from applying a DDPG-generated policy to an AWD dynamics model with wheel-slip: (a) The vehicle trajectory (solid line) and target state (dashed line) and (b) The applied torques: τA (solid line) and τB (dot-dashed line). Cases 2–4 are denoted by medium, light, and dark shaded regions, respectively. This figure illustrates the success of incorporating the complete dynamics model in the training of an RL policy.
Fig. 13
Vehicle trajectory from applying a DDPG-generated policy to an AWD dynamics model with wheel-slip: (a) The vehicle trajectory (solid line) and target state (dashed line) and (b) The applied torques: τA (solid line) and τB (dot-dashed line). Cases 2–4 are denoted by medium, light, and dark shaded regions, respectively. This figure illustrates the success of incorporating the complete dynamics model in the training of an RL policy.
Close modal

In Fig. 13, the yellow shaded region shows where wheel A was slipping, the pink shows where wheel B was slipping, and the green shows where both wheels A and B were slipping. Figure 13(a) shows the vehicle trajectory and Fig. 13(b) shows τA with a black line and τB with a dot-dashed red line. Similar to the results shown in Fig. 12, the control policy intelligently sought to avoid slipping and achieve escape from the ditch. At t ≈ 0.25 s, the vehicle momentarily entered case 4 when wheel B slipped, but the policy immediately corrected by reversing torque directions. At t ≈ 2 s, wheel A was slipping, and τA adjusted successfully to make the vehicle stop slipping. There is only one green region in Fig. 13, which means that the vehicle only once lost traction with both wheels A and B, and it is clear that the policy sought to correct that by reversing torque directions until control of both wheels was regained at t ≈ 6 s. Once escape from the ditch was achieved, τA and τB continued to be applied to maintain the vehicle position. Since the vehicle was in case 1 upon exiting the ditch, it was effectively a single DOF system at that point, which is why τB was maintained constant and τA was adjusting to maintain the vehicle position. The AWD vehicle achieved escape nearly 6 s faster than the RWD vehicle, highlighting the benefit of AWD for hazardous vehicle scenarios. From these results, it is clear that the DDPG policy effectively achieves escape from the ditch for the AWD scenario while minimizing wheel-slip.

5.4 Control Policy Robustness.

The control policies that were trained using RL were trained using one ditch shape, with a, b, and c held constant (see Eq. (47)). Due to the computational cost of training these policies, it was infeasible to train many different policies for numerous ditch shapes. It is useful to examine how robust the trained policies are for ditch shapes other than the one they were trained with. To test the policy on various ditch shapes, parameters a and b were varied from 2 − 4 and 0.05 − 0.1, respectively. The control policies were then applied for each combination of a and b. Parameters a and b influence the shape of the ditch such that as they increase, the ditch becomes narrower and steeper. The performance of the DDPG control policies for the RWD and AWD models with wheel-slip are shown in Figs. 14 and 15.

Fig. 14
Robustness of the control policy for the RWD model with wheel-slip: (a) The time to escape the ditch for different values of a and b and (b) The percent of the trajectory with rear-wheel-slip (case 3). White boxes indicate values for a and b where the control policy failed to achieve escape from the ditch. Shaded boxes have values indicated by the color bar for each plot. The X denotes the values for a and b that were used for training the control policy.
Fig. 14
Robustness of the control policy for the RWD model with wheel-slip: (a) The time to escape the ditch for different values of a and b and (b) The percent of the trajectory with rear-wheel-slip (case 3). White boxes indicate values for a and b where the control policy failed to achieve escape from the ditch. Shaded boxes have values indicated by the color bar for each plot. The X denotes the values for a and b that were used for training the control policy.
Close modal
Fig. 15
Robustness of the control policy for the AWD model with wheel-slip: (a) The time to escape the ditch for different values of a and b, (b) The percent of the trajectory with both front and rear wheels slipping (case 2), (c) The percent of the trajectory with rear wheels slipping (case 3) and (d) The percent of the trajectory with both front wheels slipping (case 4). White boxes indicate values for a and b where the control policy failed to achieve escape from the ditch. Shaded boxes have values indicated by the color bar for each plot. The X denotes the values for a and b that were used for training the control policy.
Fig. 15
Robustness of the control policy for the AWD model with wheel-slip: (a) The time to escape the ditch for different values of a and b, (b) The percent of the trajectory with both front and rear wheels slipping (case 2), (c) The percent of the trajectory with rear wheels slipping (case 3) and (d) The percent of the trajectory with both front wheels slipping (case 4). White boxes indicate values for a and b where the control policy failed to achieve escape from the ditch. Shaded boxes have values indicated by the color bar for each plot. The X denotes the values for a and b that were used for training the control policy.
Close modal

In Fig. 14(a), it can be seen that the control policy for the RWD model with wheel-slip succeeded in achieving escape from the ditch for a range of values for a and b. It is to be expected that large values of either a or b tend to result in poor policy performance, since those correspond to more challenging ditch shapes. This is so, as shown by the white boxes which indicate ditch shapes where the control policy failed to achieve escape from the ditch. For the least challenging ditch shapes indicated by the lower left-hand corner of Figs. 14(a) and 14(b), the policy achieved escape most rapidly and without any slip. For the more challenging ditch shapes, escape time was longer and the chance of wheel-slip increased.

In Fig. 15, the robustness of the control policy for the AWD model with wheel-slip is demonstrated. Figure 15(a) shows the time to escape from the ditch for different values of a and b. The control policy performed well except for large values of a, which correspond to steeper ditch profiles. There was a significant escape time reduction for values of a3, which is due to the vehicle not needing to move backwards to build momentum to achieve escape for those ditch shapes. In Figs. 15(b)15(d), the percentage of the trajectory resulting in cases 2–4 of wheel-slip are shown, respectively. In these figures, it is useful to note the performance at the location marked with the red X (which corresponds to the values for a and b that the control policy was trained with). At the red X locations, the control policy performed well at avoiding slip for all of the wheels. However, the control policy struggled to avoid slip for some of the different ditch shapes, as shown by the color variation in the lower half of Fig. 15(c) and the upper right-hand side of Fig. 15(d).

6 Conclusions

This article presented a discontinuous dynamic model for an idealized vehicle moving on an arbitrarily-shaped ditch profile. This model allowed simulation of a vehicle on any continuous ditch shape and also accounted for four regions of wheel-slip. The complexity of simulating this dynamic system (switching between each of four state-spaces) was addressed through the use of a Newton–Raphson solver.

To achieve escape from the ditch, RL was explored as a means of generating an effective control policy for this discontinuous, control-constrained system. First, PILCO and DDPG were implemented on a RWD dynamic model while ignoring the possibility of wheel-slip. The resulting policies were not capable of achieving the control objective when applied while allowing wheel-slip, illustrating the need to incorporate this dynamic feature in training a RL agent. Second, DDPG was implemented on a RWD dynamic model with wheel-slip. The result was a policy that intelligently applied “braking” to stop the rear wheels from slipping. This policy successfully achieved escape from the ditch while minimizing wheel-slip. Finally, DDPG was implemented on the full AWD dynamic model with wheel-slip. This scenario was by far the most complex, as it required two control torques and had four possible regions of dynamic behavior. After 12,000 training episodes, the trained agent provided a policy that performed well, both by achieving escape from the ditch and also by minimizing wheel-slip for both front and rear wheels. In addition, reward functions were designed for each of these three control scenarios in such a way as to achieve the desired outcome.

This article has sought to address a challenging hazardous vehicle scenario—a vehicle stuck in a ditch. While there has been great progress in vehicle automation for everyday driving, this article has sought to address a unique problem in vehicle automation by including rigid body dynamics, an arbitrary ditch profile, and the potential for slip to occur with either front or rear wheels using both RWD and AWD models. RL policies were successfully trained to control the discontinuous dynamics model in several configurations and the results compared. For this RL application, DDPG shows more promise due to its ability to implement a continuous action space as well as control different regions of dynamic behavior in a discontinuous model. In addition, the control policies generated using DDPG were demonstrated to be robust at achieving escape from the ditch for a wide range of ditch shapes.

Future work in applying RL to this problem should seek to develop an experimental implementation and additional simulation training on different vehicles. Additional modeling for particular vehicle components, such as suspension and tires, may be necessary for increased model accuracy and transferring a control policy from simulation to experiment. The data repository for this project is available and easily adaptable in order to foster additional study in the area of vehicle automation.

Footnote

Acknowledgment

Partial support from ARO W911NF-21-2-0117 is gratefully acknowledged.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The data and information that support the findings of this article are freely available.2

Appendix A: Derivation of Position, Velocity, and Acceleration Vectors

This derivation begins by deriving the position vectors for wheel A, wheel B, and body mass M. These position vectors depend on θM, for which an expression will be derived later. The position vectors are derived from the geometry seen in Fig. 2, where
(A1)
(A2)
(A3)
The velocity and acceleration vectors are time derivatives of Eqs. (A1)(A3), as follows:
(A4)
(A5)
(A6)
(A7)
(A8)
(A9)
where
(A10)
When the wheels are not slipping, the angles of wheels A and B are functions of xn, and thus analytical expressions for θA,n and θB,n are
(A11)
(A12)
where
(A13)
The derivation of θ˙A,n and θ¨A,n follows by simply taking the time derivatives of Eq. (A9):
(A14)
(A15)
It is now shown how to derive the expressions for θ˙B,n and θ¨B,n when wheel B is in traction with the surface. Throughout this article, the spatial coordinate x has been used to describe the contact point of wheel A with the surface y(x). However, to define θ˙B,n and θ¨B,n, it is useful to define a temporary spatial coordinate xB, which describes the contact point of wheel B with the surface y(x). By examining Eqs. (A4) and (A5), we can restate Eq. (A5) by a direct comparison to Eq. (A4) as follows:
(A16)
By setting Eqs. (A5) and (A14) equal to each other, two expressions for x˙B can be obtained:
(A17)
(A18)
It is impossible to analytically show that Eqs. (A15) and (A16) are the same. However, this was checked numerically and indeed they are the same. It is convenient to use the simpler of the two expressions, Eq. (A15), for our solution for x˙B. An expression for θ˙B,n can be obtained by direct comparison to Eq. (A12):
(A19)
which, using Eq. (A15), can be simplified to
(A20)
where
(A21)
By taking the time derivative of Eq. (A18), we get
(A22)
An expression for θM can be obtained by modifying Eq. (A2) to include a parameter Δl, which describes the horizontal distance between the contact points of the two wheels with the surface; the result is
(A23)
The parameter Δl can be solved for by solving the transcendental equation |rBrA| − l = 0, using the expression for rB shown in Eq. (A21). This enforces that the rigid body M be kept at a fixed length l. We can solve for θM by evaluating Eqs. (A21) and (A1) at values of x in
(A24)
It is clear that θM varies spatially with x. Thus, the angular velocity and acceleration of rigid body M can be computed as follows:
(A25)
(A26)

Appendix B: Algorithm for Checking Starting Case

Algorithm 2

  InputFF,A,FF,B,FN,A,FN,B,μs

Outputn

1 switchndo

2   case1

3    if|FF,A|>μsFN,Athen

4     n3

5    end

6    if|FF,B|>μsFN,Bthen

7     n4

8    end

9    if|FF,A|>μsFN,Aand|FF,B|>μsFN,Bthen

10     n2

11    end

12   end

13   case 3

14    if|FF,B|>μsFN,Bthen

15     n2

16    end

17   end

18   case 4

19    if|FF,A|>μsFN,Athen

18     n2

20     end

21   end

22 end

References

1.
Grigorescu
,
S.
,
Trasnea
,
B.
,
Cocias
,
T.
, and
Macesanu
,
G.
,
2020
, “
A Survey of Deep Learning Techniques for Autonomous Driving
,”
J. Field Rob.
,
37
(
3
), pp.
362
386
.
2.
Gatti
,
C.
,
2015
, “The Mountain Car Problem,”
Design of Experiments for Reinforcement Learning
,
Springer Theses, Springer
,
Cham
, pp.
95
109
.
3.
García
,
J.
,
Majadas
,
R.
, and
Fernández
,
F.
,
2020
, “
Learning Adversarial Attack Policies Through Multi-objective Reinforcement Learning
,”
Eng. Appl. Artif. Intell.
,
96
, p.
104021
.
4.
Nguyen
,
T. T.
,
Nguyen
,
N. D.
,
Vamplew
,
P.
,
Nahavandi
,
S.
,
Dazeley
,
R.
, and
Lim
,
C. P.
,
2020
, “
A Multi-objective Deep Reinforcement Learning Framework
,”
Eng. Appl. Artif. Intell.
,
96
, p.
103915
.
5.
Pattanaik
,
A.
,
Tang
,
Z.
,
Liu
,
S.
,
Bommannan
,
G.
, and
Chowdhary
,
G.
,
2018
, “
Robust Deep Reinforcement Learning With Adversarial Attacks
,”
Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems
,
Stockholm, Sweden
,
A.
Pattanaik
,
Z.
Tang
,
S.
Liu
,
G.
Bommannan
, and
G.
Chowdhary
, eds., pp.
2040
2042
.
6.
Manring
,
L. H.
, and
Mann
,
B. P.
,
2020
, “Vehicle Escape Dynamics on an Arbitrarily Curved Surface,”
Nonlinear Structures and Systems
, Vol.
1
,
Springer
, pp.
141
149
.
7.
Viner
,
J. G.
,
1995
, “
Rollovers on Sideslopes and Ditches
,”
Accid. Anal. Prev.
,
27
(
4
), pp.
483
491
.
8.
Sharp
,
R. S.
, and
Peng
,
H.
,
2011
, “
Vehicle Dynamics Applications of Optimal Control Theory
,”
Vehicle Syst. Dyn.
,
49
(
7
), pp.
1073
1111
.
9.
Yang
,
S.
,
Lu
,
Y.
, and
Li
,
S.
,
2013
, “
An Overview on Vehicle Dynamics
,”
Int. J. Dyn. Control
,
1
(
4
), pp.
385
395
.
10.
Leoro
,
J.
,
Krutitskiy
,
S.
,
Tarasov
,
A.
,
Borovkov
,
A.
,
Aleshin
,
M.
, and
Klyavin
,
O.
,
2017
, “
Vehicle Dynamics Prediction Module
,”
Mater. Phys. Mech.
,
34
(
1
), pp.
82
89
.
11.
Schiehlen
,
W.
,
2015
, “
Benchmark Problems From Vehicle Dynamics
,”
J. Mech. Sci. Technol.
,
29
(
7
), pp.
2601
2606
.
12.
Wang
,
J.
, and
Longoria
,
R. G.
,
2009
, “
Coordinated and Reconfigurable Vehicle Dynamics Control
,”
IEEE Trans. Control Syst. Technol.
,
17
(
3
), pp.
723
732
.
13.
Velardocchia
,
M
, and
Vigliani
,
A
,
2013
, “
Control Systems Integration for Enhanced Vehicle Dynamics
,”
Open Mech. Eng. J.
,
7
(
1
), pp.
58
69
.
14.
Ferrara
,
A.
,
Incremona
,
G. P.
, and
Regolin
,
E.
,
2019
, “
Optimization-Based Adaptive Sliding Mode Control With Application to Vehicle Dynamics Control
,”
J. Robust. Nonlinear. Control.
,
29
(
3
), pp.
550
564
.
15.
Beal
,
C. E.
, and
Boyd
,
C.
,
2019
, “
Coupled Lateral-Longitudinal Vehicle Dynamics and Control Design With Three-Dimensional State Portraits
,”
Vehicle Syst. Dyn.
,
57
(
2
), pp.
286
313
.
16.
Josef
,
S.
, and
Degani
,
A.
,
2020
, “
Deep Reinforcement Learning for Safe Local Planning of a Ground Vehicle in Unknown Rough Terrain
,”
IEEE Rob. Autom. Lett.
,
5
(
4
), pp.
6748
6755
.
17.
Shang
,
E.
,
An
,
X.
,
Wu
,
T.
,
Hu
,
T.
,
Yuan
,
Q.
, and
He
,
H.
,
2016
, “
Lidar Based Negative Obstacle Detection for Field Autonomous Land Vehicles
,”
J. Field Rob.
,
33
(
5
), pp.
591
617
.
18.
Silver
,
D.
,
Bagnell
,
J. A.
, and
Stentz
,
A.
,
2010
, “
Learning From Demonstration for Autonomous Navigation in Complex Unstructured Terrain
,”
Int. J. Rob. Res.
,
29
(
12
), pp.
1565
1592
.
19.
Yavin
,
Y.
,
2002
, “
Modelling of the Motion of a Cart on a Smooth Rigid Surface
,”
Math. Comput. Modell.
,
36
(
4–5
), pp.
525
533
.
20.
Yavin
,
Y.
,
2003
, “
Modelling and Control of the Motion of a Cart Moving on a Plane With a Time-Dependent Inclination
,”
Math. Comput. Modell.
,
37
(
3–4
), pp.
293
299
.
21.
Chatzis
,
M. N.
, and
Smyth
,
A. W.
,
2013
, “
Three-Dimensional Dynamics of a Rigid Body With Wheels on a Moving Base
,”
J. Eng. Mech.
,
139
(
4
), pp.
496
511
.
22.
Virgin
,
L. N.
,
Lyman
,
T. C.
, and
Davis
,
R. B.
,
2010
, “
Nonlinear Dynamics of a Ball Rolling on a Surface
,”
Am. J. Phys.
,
78
(
3
), pp.
250
257
.
23.
Wu
,
Q. H.
, and
Pugh
,
A. C.
,
1993
, “
Reinforcement Learning Control of Unknown Dynamic Systems
,”
IEE Proc. D (Control Theory and Applications)
,
140
(
5
), pp.
313
322
.
24.
Kiumarsi
,
B.
,
Lewis
,
F. L.
,
Modares
,
H.
,
Karimpour
,
A.
, and
Naghibi-Sistani
,
M.-B.
,
2014
, “
Reinforcement Q-learning for Optimal Tracking Control of Linear Discrete-Time Systems With Unknown Dynamics
,”
Automatica
,
50
(
4
), pp.
1167
1175
.
25.
Deisenroth
,
M.
, and
Rasmussen
,
C. E.
,
2011
, “
PILCO: A Model-Based and Data-Efficient Approach to Policy Search
,”
Proceedings of the 28th International Conference on Machine Learning (ICML-11)
,
Bellevue, WA
,
June
, pp.
465
472
.
26.
Hesse
,
M.
,
Timmermann
,
J.
,
Hüllermeier
,
E.
, and
Trächtler
,
A.
,
2018
, “
A Reinforcement Learning Strategy for the Swing-Up of the Double Pendulum on a Cart
,”
Procedia Manuf.
,
24
, pp.
15
20
.
27.
Li
,
Y.
,
2017
, “Deep Reinforcement Learning: An Overview,”
arXiv preprint
arXiv:1701.07274. https://arxiv.org/abs/1701.07274
28.
Chung
,
T.
,
2017
, “
Controlling Bicycle Using Deep Deterministic Policy Gradient Algorithm
,”
2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI)
,
Maison Glad Jeju, Jeju, South Korea
,
June
,
IEEE
, pp.
413
417
.
29.
Zhang
,
Y.
,
Sun
,
P.
,
Yin
,
Y.
,
Lin
,
L.
, and
Wang
,
X.
,
2018
, “
Human-Like Autonomous Vehicle Speed Control by Deep Reinforcement Learning With Double Q-Learning
,”
2018 IEEE Intelligent Vehicles Symposium (IV)
,
Changshu, China
,
June
,
IEEE
, pp.
1251
1256
.
30.
Kiran
,
B. R.
,
Sobh
,
I.
,
Talpaert
,
V.
,
Mannion
,
P.
,
Al Sallab
,
A. A.
,
Yogamani
,
S.
, and
Pérez
,
P.
,
2021
, “
Deep Reinforcement Learning for Autonomous Driving: A Survey
,”
IEEE Trans. Intell. Trans. Sys.
, pp.
1
18
.
31.
Lillicrap
,
T. P.
,
Hunt
,
J. J.
,
Pritzel
,
A.
,
Heess
,
N.
,
Erez
,
T.
,
Tassa
,
Y.
,
Silver
,
D.
, and
Wierstra
,
D.
,
2016
, “
Continuous Control With Deep Reinforcement Learning
,” San Juan, PR, June, arXiv preprint. https://arxiv.org/abs/1509.02971
32.
Heydinger
,
G. J.
,
Bixel
,
R. A.
,
Garrott
,
W. R.
,
Pyne
,
M.
,
Howe
,
J. G.
, and
Guenther
,
D. A.
,
1999
,
Measured Vehicle Inertial Parameters-NHTSA’s Data Through November 1998
,
SAE Transactions
, pp.
2462
2485
.https://www.jstor.org/stable/44733900
33.
Berger
,
E. J.
,
2002
, “
Friction Modeling for Dynamic System Simulation
,”
ASME Appl. Mech. Rev.
,
55
(
6
), pp.
535
577
.