Reinforced Learning (and Q Learning in Twin Delay DDPG)
This writing attempts to give some background information about Reinforced Learning and Q Learning in Twin Delay DDPG. Reinforced Learning (RL) is the principle of learning something by trial and error. In RL, the agent tries to obtain the maximum reward over a certain amount of iterations (or time).
Q Learning illustrates the reinforced learning process. The Q stands for quality. It represents the reward from the combination of action + state when the agent takes each step in the maze. In the drawing below, when approaching the checker flag, the Q value increases. When approaching the fire, the Q value decreases.
An important concept in Q learning is the temporal difference. The first two terms are the target Q value. The last term is the predicted Q value.
Deep Q learning is a method to approximate the expected return. It is only applicable in discrete action space. It works by prediction the Q value as close as possible to the target Q value, R(s, a) + gamma*max(Q(s, a)). In other words, it seek to minimise the loss between the prediction and the target. The technique to reduce the loss is by back-propagation the loss to the neural network via stochastic gradient descent (SGD).
Policy Gradient is a method to maximise expected return by directly update the weights of the neural network. The gradient of the expected return is computed w.r.t phi. Then the policy parameter phi is updated through gradient ascent.
Update the policy parameter phi by the gradient and the learning rate alpha.
In the Actor Critic model, the policy parameter of the Actor is updated through gradient ascent. The Critic model outputs a Q value which get closer to the target Q value (to approximate the expected return). The expected return is used to perform the gradient ascent to update the policy parameter. The Actor Critic model is proposed in the academic paper “Addressing Function Approximation Error in Actor Critic Models”.
The Twin Delay DDPG (TD3) model works in continuous action space. DDPG stands for deep deterministic policy gradient. “Deep” because it uses deep neural network for both Actor and Critic. It combines ideas from policy gradient and deep Q learning. It uses two target network because that adds stability to the training process.
In TD3 model, the actor target outputs action (a). The action (a) and state (s) are used as input to critic target. The two critic target output two Q value. Take the minimum value of the two Q value, add a stochastic noise to it, and we get the target Q value. The target Q value is compared to the two Q value from the critic model. From there, we get two loss value.
For Q learning in TD3 model, we use back propagation to minimise the loss value.