TD Reinforcement Learning and Deep Q-Learning

Hyounjun Park

Background

Monte Carlo reinforcement learning required that the returns for an entire episode be computed before any values are available for use. The disadvantage of that approach is the full set of returns are required for state value or action value estimates. However, it turns out there are algorithms which can compute these estimates with fewer steps, as few as one step known as time difference learning and Q learning

In this project, we will use time difference reinforcement leraning and Deep Q-Learning to solve a robot navigation problem by finding optimal paths to a goal in a simplified warehouse environment.

The interaction between a reinforcement learning agent and the environment are illustrated in the figure below. Notice that the only feedback the agent receives from the environment is reward and state.

Drawing

**Reinforcement Learning Agent and Environment**

Scenario : Warehouse robot navigation

The configuration of the warehouse environment is illustrated in the figure below.

Drawing

**Grid World for Factory Navigation Example**

The goal is for the robot to deliver some material to position (state) 12, shown in blue. Since there is a goal state or terminal state this an episodic task.

There are some barriers comprised of the states $\{ 6, 7, 8 \}$ and $\{ 16, 17, 18 \}$, shown with hash marks. In a real warehouse, these positions might be occupied by shelving or equipment. We do not want the robot to hit these barriers. Thus, we say that transitioning to these barrier states is taboo.

Of course, we do not want the robot to hit the edges of the grid world, which represent the outer walls of the warehouse.

Representation

As with many such problems, the starting place is creating the representation. In the cell below is our representation for the possible action-state transitions. From each state there are 4 possible actions:

  • up, u
  • down, d,
  • left, l
  • right, r

There are a few special cases we need to consider:

  • Any action transitioning state off the grid or into a barrier should keep the state unchanged.
  • Any action in the goal state keeps the state unchanged.
  • Any transition within the taboo (barrier) states can keep the state unchanged.
In [1]:
## import numpy for latter
import numpy as np
import numpy.random as nr
In [2]:
neighbors = {0:{'u':0, 'd':5, 'l':0, 'r':1},
          1:{'u':1, 'd':1, 'l':0, 'r':2},
          2:{'u':2, 'd':2, 'l':1, 'r':3},
          3:{'u':3, 'd':3, 'l':2, 'r':4},
          4:{'u':4, 'd':9, 'l':3, 'r':4},
          
          5:{'u':0, 'd':10, 'l':5, 'r':5},
          6:{'u':6, 'd':6, 'l':6, 'r':6},###barrier
          7:{'u':7, 'd':7, 'l':7, 'r':7},###barrier
          8:{'u':8, 'd':8, 'l':8, 'r':8},###barrier
          9:{'u':4, 'd':14, 'l':9, 'r':9},
          
          10:{'u':5, 'd':15, 'l':10, 'r':11},
          11:{'u':11, 'd':11, 'l':10, 'r':12},
          12:{'u':12, 'd':12, 'l':12, 'r':12},#goal
          13:{'u':13, 'd':13, 'l':12, 'r':14},
          14:{'u':9, 'd':19, 'l':13, 'r':14},
          
          15:{'u':10, 'd':20, 'l':15, 'r':15},
          16:{'u':16, 'd':16, 'l':16, 'r':16},###barrier
          17:{'u':17, 'd':17, 'l':17, 'r':17},###barrier
          18:{'u':18, 'd':18, 'l':18, 'r':18},###barrier
          19:{'u':14, 'd':24, 'l':19, 'r':19},
          
          20:{'u':15, 'd':20, 'l':20, 'r':21},
          21:{'u':21, 'd':21, 'l':20, 'r':22},
          22:{'u':22, 'd':22, 'l':21, 'r':23},
          23:{'u':23, 'd':23, 'l':22, 'r':24},
          24:{'u':19, 'd':24, 'l':23, 'r':24}}

Here is the initial transition probabilities for the Markov process. Let's set the probabilities for each transition as a uniform distribution leading to random action by the robot.

Note: As these are just starting values, the exact values of the transition probabilities are not actually all that important in terms of solving the RL problem. Also, notice that it does not matter how the taboo state transitions are encoded. The point of the DP algorithm is to learn the transition policy.

In [3]:
policy = {0:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          1:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          2:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          3:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          4:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          5:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          6:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          7:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          8:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          9:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          10:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          11:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          12:{'u':0, 'd':0, 'l':0, 'r':0},#goal
          13:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          14:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          15:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          16:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          17:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          18:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          19:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          
          20:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          21:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          22:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          23:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
          24:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}}

The robot receives the following rewards:

  • 10 for entering position 0.
  • -1 for attempting to leave the grid. In other words, we penalize the robot for hitting the edges of the grid.
  • -0.1 for all other state transitions, which is the cost for the robot to move from one state to another. If we did not have this penalty, the robot could follow any random plan to the goal which did not hit the edges.

This reward structure is unknown to the MC RL agent. The agent must learn the rewards by sampling the environment.

Our representation of this reward structure will be used in the simulated environment.

In [4]:
reward = {0:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          1:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          2:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          3:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          4:{'u':-1, 'd':-0.1, 'l':-1, 'r':-0.1},
          
          5:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          6:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          7:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          8:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          9:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          
          10:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-0.1},
          11:{'u':-1, 'd':-1, 'l':-0.1, 'r':10},
          12:{'u':0, 'd':0, 'l':0, 'r':0},#goal
          13:{'u':-1, 'd':-1, 'l':10, 'r':-0.1},
          14:{'u':-0.1, 'd':-0.1, 'l':-0.1, 'r':-1},
          
          15:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          16:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          17:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          18:{'u':0, 'd':0, 'l':0, 'r':0},###barrier
          19:{'u':-0.1, 'd':-0.1, 'l':-1, 'r':-1},
          
          20:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-0.1},
          21:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          22:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          23:{'u':-1, 'd':-1, 'l':-0.1, 'r':-0.1},
          24:{'u':-0.1, 'd':-1, 'l':-0.1, 'r':-1}}

The list of taboo state values.

In [5]:
taboo = [6,7,8,16,17,18]

TD Prediction Model

Definition of a general update model:

$$NewValue = OldValue + LearningRate * ErrorTerm$$

Following this general formulation the update for TD state value can be written as:

$$V(S_t) = V(S_t) + \alpha \big[ G_T - V(S_t) \big]$$

Here,
$\alpha = $ the learning rate,
$\big[ G_T - V(S_t) \big] = $ the error term is the difference between the return $G_t$ and the value, $V(S_t)$.

Using the same general formulation we can create and single time step update model know as the one step time difference or TD(0) algorithm.

$$V(S_t) = V(S_t) + \alpha \big[ R_{t+1} + V(S_{t+1}) - V(S_t) \big]$$

Where,
$\delta_t = R_{t+1} + V(S_{t+1}) - V(S_t) = $ the TD error,
$R_{t+1} = $ the return for the next time step,
$V(S_t) = $ is the value at time step t.

Like dynamic programming algorithms, the TD algorithm bootstraps. The return and estimated value at the next time step, $V(S_{t+1})$, are from previous samples. However, unlike MC RL which does not bootstrap, the TD(0) algorithm produces an estimate of $V(S_t)$ in only one time step, rather than waiting to reach the terminal state at the end of the episode. This fact can be seen by examining the backup diagram shown below:

Drawing

**Backup Diagram of TD(0)**

Environment

RL agents sample state values and state action values. The RL agent does not know how many states there are and which states are terminal. The RL agent learns a policy as it proceeds with the learning process. The probabilities of a particular state transition are determined by the current policy.

The code in the cell below simulates the environment. Given the current state and an action the function returns the next state and reward. Thus, this function simulates the interaction the agent will have with the environment.

In [6]:
def state_values(s, action, neighbors = neighbors, rewards = reward):
    """
    Function simulates the environment
    returns s_prime and reward given s and action
    """
    s_prime = neighbors[s][action]
    reward = rewards[s][action]
    return (s_prime,reward)

## Test the function
for a in ['u', 'd', 'r', 'l']:
    print(state_values(11, a))
    
(11, -1)
(11, -1)
(12, 10)
(10, -0.1)

We can see the action values differ by the reward values.

TD(0) Policy Evaluation

With our representations defined, we are now able to create and test functions to perform TD(0) policy evaluation.

We have everything in place to perform TD(0) policy evaluation for the grid world. The code in the cell below implements the TD(0) algorithm and applies it to the policy in the grid world.

Note: The algorithm makes calls to the aforementioned state_values function to find information on successor state and rewards for a transition given a current state and action.

In [7]:
def td_0_state_values(policy, n_samps, goal, alpha = 0.2, gamma = 0.9):
    """
    Function for TD(0) policy 
    """
    ## Initialize the state list and state values
    states = list(policy.keys())
    v = [0]*len(list(policy.keys()))
    action_index = list(range(len(list(policy[0].keys()))))
    for _ in range(n_samps):
        s = nr.choice(states, size =1)[0]
        probs = list(policy[s].values())
        if(s not in taboo+[goal]):
            a = list(policy[s].keys())[nr.choice(action_index, size = 1, p = probs)[0]]
        else:
            a = list(policy[s].keys())[nr.choice(action_index, size = 1)[0]]
        transistion = state_values(s, a)
        v[s] = v[s] + alpha * (transistion[1] +  gamma * v[transistion[0]] - v[s])
    return(v)
    
   

We executed the function for 1,000 episodes and examine the results.

In [8]:
nr.seed(345)    
np.round(np.array(td_0_state_values(policy, n_samps = 1000, goal = 12)).reshape((5,5)), 4) 
Out[8]:
array([[-2.1482, -2.349 , -2.6565, -2.6894, -2.4085],
       [-2.3095,  0.    ,  0.    ,  0.    , -2.6015],
       [ 2.0263,  4.7749,  0.    ,  0.9428, -1.7944],
       [-1.2251,  0.    ,  0.    ,  0.    , -2.3868],
       [-1.9846, -2.4509, -2.8161, -2.8375, -2.7372]])

We can see three trait of the action value from our results to ensure the action value function operates correctly.

  1. The values of the taboo states are 0.
  2. The highest values adjacent to the terminal state, partialy. Because weirdly the right side,13, is smaller than 10.
  3. The values of the states decreasing as the distance from the terminal state increases.

Illustration of the optimal path to the goal based on the policy algorithm discovered.

Drawing

On Policy(SARSA) vs. Off Policy(Q-learning) Algorithms

In this project we will explore examples of two broad categories of RL algorithms known as on policy and off policy methods.

On policy methods evaluate and improve a single policy. On policy methods converge quickly and often to good solution. In general, exploration is performed using $\epsilon$-greedy methods. The TD(0) and MC algorithms we have examined are examples of on policy methods. On policy algorithms are known to have good convergence properties.

In contrast, off policy methods use two policies. The policy the agent is following is called the behavior policy, denoted $b(A_t | S_t)$. The policy being improved is known as the target policy, denoted $\pi (A_t | S_t)$. The agent obtains samples of the environment while following the behavior policy. These samples are used to improve the target policy. An advantage of off policy methods is that a deterministic behavior policy can be used while a better target policy is developed.

Q-learning has the following advantages and disadvantages compared to SARSA:

  • Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. If we want to learn an optimal policy using SARSA, then we will need to decide on a strategy to decay ϵ in ϵ-greedy action choice, which may become a fiddly hyperparameter to tune.
  • Q-learning (and off-policy learning in general) has higher per-sample variance than SARSA, and may suffer from problems converging as a result. This turns up as a problem when training neural networks via Q-learning.
  • SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. The classic toy problem that demonstrates this effect is called cliff walking.

Additionally, the critical point in practice is that the last point can make a big difference if mistakes are costly - e.g. when we are training a robot not in simulation, but in the real world. we may prefer a more conservative learning algorithm that avoids high risk, if there was real time and money at stake if the robot was damaged.

If our goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If our agent learns online, and we care about rewards gained whilst learning, then SARSA may be a better choice.

One Step SARSA Algorithm

Now that we have examined the one step TD(0) algorithm for policy (value) evaluation, we need to define a one step algorithm for control or policy improvement.

The state action reward state action or SARSA algorithm is an on policy method which uses time differencing to evaluate action values. As you might imagine from the name, this algorithm starts with an on policy action from the current state which results in a state transition and a reward. The backup diagram for one step SARSA (SARSA(0)) is shown in the figure below.

Drawing

**Backup Diagram for SARSA(0)**

Compare the backup diagram for SARSA(0) to the one for TD(0) notice that the order of state and action are reversed between the two algorithms. This realization is a good way to understand the difference.

The update equation for SARSA(0) is as follows:

$$Q(S_t,A_t) = Q(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t) \big]$$

Where,
$\delta_t = R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t) = $ the error term,
$Q(S_t,A_t) = $ is the action value in state S given action A,
$R_{t+1} = $ is the reward for the next time step,
$\alpha = $ the learning rate,
$\gamma = $ discount factor.

SARSA(0) Policy Improvement

Now we will perform policy improvement using the SARSA(0) algorithm.

In [9]:
import copy

def select_a_prime(s_prime, policy, action_index, greedy, goal):
    ## Randomly select an action prime 
    ## To handle the terminal state
    if(s_prime != goal and greedy): 
        probs = list(policy[s_prime].values())
        a_prime_index = nr.choice(action_index, size = 1, p = probs)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]
    else: ## Don't probability weight for terminal state or non-greedy selecttion
        a_prime_index = nr.choice(action_index, size = 1)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]   
    return(a_prime_index, a_prime)


def SARSA_0(policy, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform SARSA(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(action_index),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample a state at random ensuring it is not terminal state
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+[goal]): s = nr.choice(states, size = 1)[0]
        ## Now choose action given policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = float('inf') # Value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state 
            ## The next state given the action
            s_prime, reward = state_values(s, a)
            a_prime_index, a_prime = select_a_prime(s_prime, current_policy, action_index, True, goal)
     
            ## Update the action values
            Q[a_index,s] = Q[a_index,s] + alpha * (reward + gamma * Q[a_prime_index,s_prime] - Q[a_index,s])
            
            ## Set action and state for next iteration
            a = a_prime
            a_index = a_prime_index
            s = s_prime

    return(Q)

Execute the code for 1,000 episodes, and with $\alpha = 0.2$, and $\epsilon = 0.1$

In [10]:
Q = SARSA_0(policy, 1000, goal = 12, alpha = 0.2, epsilon = 0.1)

for i in range(4):
    print(np.round(Q[i,:].reshape((5,5)), 4))
[[-4.2282 -5.2765 -5.6288 -5.4639 -5.7089]
 [-3.586   0.      0.      0.     -4.5257]
 [-2.6974  2.564   0.      0.7909 -3.1976]
 [-0.5514  0.      0.      0.     -1.7296]
 [-2.0087 -5.3606 -5.1868 -5.1598 -3.2489]]
[[-2.4912 -4.8844 -5.3963 -5.4337 -3.73  ]
 [-0.9637  0.      0.      0.     -1.911 ]
 [-2.4687  0.3427  0.     -0.9032 -3.3185]
 [-3.3496  0.      0.      0.     -3.869 ]
 [-4.4689 -4.5704 -5.2045 -5.2181 -4.821 ]]
[[-4.2811 -3.464  -4.14   -4.6461 -5.5983]
 [-2.9792  0.      0.      0.     -4.5057]
 [-0.2172  0.3087  0.     10.      1.4675]
 [-3.6716  0.      0.      0.     -4.386 ]
 [-3.0332 -3.1384 -3.794  -4.4014 -4.4273]]
[[-4.4285 -4.2415 -4.4059 -4.3376 -4.2544]
 [-3.2421  0.      0.      0.     -4.5583]
 [ 4.698  10.      0.     -1.9334 -3.2374]
 [-3.9916  0.      0.      0.     -3.1249]
 [-4.0677 -4.3701 -4.5946 -3.9306 -4.4064]]

Check point : the action values are 0 for the goal and taboo states, and the action values are getting bigger when they are reaching the goal.

Generalized Policy Improvement

RL uses the idea of generalized policy improvement. GPI divides the policy improvement and evaluation steps into opposing processes and iterates between them. This process can be done in a quite granular way, even on a single episode or even state at a time. The figure below illustrates the concept of GPI.

Drawing

**Concept of Generalized Policy Improvement**

The code in the cell below uses the GPI method to improve the policy for the grid world using GPI. The outer loop or cycle performs one iteration of GPI which has two steps:

  1. At the start of the loop the returns for the current policy are evaluated. In this case we use the average of several episodes. Keep in mind that the agent's walk is determined by the policy, but the returns accumulated are from the environment.
  2. Next, the policy is updated using the new return values. The policy is improved by increasing the transition probabilities for actions (transitions) with higher reward. To ensure that the algorithm continues to explore, transition probabilities are never set to 0 but rather to a small minimum value $\epsilon$.

GPI with SARSA(0)

The code in the cell below performs general policy improvement (GPI) using SARSA(0) to evaluate action values. Several cycles of SARSA(0) and policy improvement are performed with this code. Notice that policy improvement is $\epsilon$-greedy. The probability of any state transition will never go to zero, allowing continued exploration. Additional details on this algorithm can be seen by reading the code comments.

In [11]:
def SARSA_0_GPI(policy, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_0(policy, episodes = episodes, goal = goal, alpha = alpha, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
In [12]:
 SARSA_0_Policy = SARSA_0_GPI(policy,goal=12, cycles = 10, episodes = 100)
SARSA_0_Policy
Out[12]:
{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 2: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 3: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 4: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 5: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 6: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 7: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 8: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 9: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 10: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 11: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 12: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 13: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 14: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 15: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 16: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 17: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 18: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 19: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 20: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 21: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 22: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 23: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 24: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7}}

Verification : our results at state 2 and 22, they are following the shortest path (left side)

Q-Learning

As we have just seen, the SARSA algorithm is an on policy action value TD estimation method. The Q-learning algorithm is a off policy TD action value estimation method.

The update formula for single step Q-learning or Q-learning(0) is:

$$Q(S_t,A_t) = Q(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma max_a Q(S_{t+1},a) - Q(S_t,A_t) \big]$$

Where,
$\delta_t = R_{t+1} + \gamma max_a Q(S_{t+1},a) - Q(S_t,A_t) = $ the TD error,
$max_a = $ the maximum operator applied to all possible actions in state $S_{t+1}$,
$Q(S_t,A_t) = $ is the action value in state S given action A,
$R_{t+1} = $ is the reward for the next time step,
$\alpha = $ the learning rate,
$\gamma = $ discount factor.

The use of the operator $max_a$ makes Q-learning greedy. But, why does using this operator result in an off-policy algorithm? To answer this question, examine the backup diagram shown below.

Drawing

**Backup Diagram for one-step Q-Learning**

The $max_a$ greedily picks the action with the greatest value, regardless of policy. Therefore, Q-learning is an off-policy algorithm.

In [13]:
def Q_learning_0(policy, neighbors, rewards, episodes, goal, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    possible_actions = list(rewards[0].keys())
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(possible_actions),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample an intial state at random but make sure it is not goal
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+ [goal]): s = nr.choice(states, size = 1)[0]
        ## Now choose action following policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = n_states + 1 # Dummy value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state   
            ## Get s_prime given s and a
            s_prime = neighbors[s][a]
            
            ## Find the index or indices of maximum action values for s_prime
            ## Break any tie with multiple max values by random selection
            action_values = Q[:,s_prime]
            a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
            a_prime = possible_actions[a_prime_index]
            
            ## Lookup the reward 
            reward = rewards[s][a]
            
            ## Update the action values
            Q[a_index,s] = Q[a_index,s] + alpha * (reward + gamma * Q[a_prime_index,s_prime] - Q[a_index,s])
            
            ## Set action and state for next iteration
            a = a_prime
            a_index = a_prime_index
            s = s_prime

    return(Q)
In [14]:
QQ = Q_learning_0(policy, neighbors, reward, episodes=500, goal = 12)

for i in range(4):
    print(np.round(QQ[i,:].reshape((5,5)), 4))
[[4.6576 2.3519 3.0003 0.8702 1.6673]
 [4.5709 0.     0.     0.     4.5546]
 [3.0681 6.6243 0.     5.9028 5.0752]
 [7.91   0.     0.     0.     7.91  ]
 [7.019  1.6248 1.6534 3.5173 7.019 ]]
[[7.019  2.9334 2.5275 2.3373 7.0189]
 [7.91   0.     0.     0.     7.91  ]
 [3.5445 6.6524 0.     6.8731 5.1443]
 [4.6978 0.     0.     0.     3.6438]
 [4.5848 3.1666 2.3795 3.3294 4.7006]]
[[ 3.0979  6.2171  5.4916  1.2128  1.3888]
 [ 3.9946  0.      0.      0.      4.9742]
 [ 6.2168  4.455   0.     10.      8.9   ]
 [ 0.9325  0.      0.      0.      5.2453]
 [ 4.3729  6.2166  5.4281  1.0831  3.9313]]
[[ 4.2212  4.2177  3.1193  6.1009  5.4883]
 [ 5.2951  0.      0.      0.      4.4764]
 [ 8.9    10.      0.      5.2973  5.1199]
 [ 4.745   0.      0.      0.      5.2005]
 [ 3.3908  2.5138  2.1374  6.2125  1.3775]]

Double Q-Learning

The Q-learning algorithm just presented has a significant bias. To understand why this might be consider the following thought experiment. In most cases the sampled action values are inaccurate. Some action values will have a positive error and some will have a negative error. In the error is on the order of the values themselves, the $max_a$ operator has a reasonable chance of selecting an action value that is the largest because of this error. However, the $max_a$ operator will never select an action with a low value simply because of the errors. The net result is a bias toward action values with the largest positive error.

What can be done to correct this situation? One relatively simple and effective algorithm is known as double Q-learning. Double Q-learning maintains two tables of action values. The values from one table are used to perform the bootstrap updates of the other table and vice versa. This approach averages out the bias. For two tables, $Q_1$ and $Q_2$ we can express double Q-learning as follows:

$$Q_1(S_t,A_t) = Q_1(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma Q_2(S_{t+1},argmax_a Q_1(S_{t+1}, a) - Q_1(S_t,A_t) \big] \\ Q_2(S_t,A_t) = Q_2(S_t,A_t) + \alpha \big[ R_{t+1} + \gamma Q_1(S_{t+1},argmax_a Q_2(S_{t+1}, a) - Q_2(S_t,A_t) \big]$$

With a 0.5 probability one or the other of these expressions is used for the TD update at each time step. While double Q-learning requires twice as much memory, to maintain the two tables, the computational complexity is the same when compared to Q-learning.

Apply Double Q-Learning

As a next step, we will apply Double Q-learning(0) to the warehouse navigation problem. In the cell below create and test a function to perform Double Q-Learning for this problem.

In [15]:
def double_Q_learning_0(policy, neighbors, rewards, episodes, goal, alpha = 0.2, gamma = 0.9):
    """
    Function to perform SARSA(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    possible_actions = list(rewards[0].keys())
    action_index = list(range(len(list(policy[0].keys()))))
    Q1 = np.zeros((len(possible_actions),len(states)))
    Q2 = np.zeros((len(possible_actions),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample an intial state at random but make sure it is not goal
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+[goal]): s = nr.choice(states, size = 1)[0]
        ## Now choose action following policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = n_states + 1 # Dummy value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state   
            ## Get s_prime given s and a
            s_prime = neighbors[s][a]
            
            ## Update one or the other action values at random
            if(nr.uniform() <= 0.5):
                ## Find the index or indices of maximum action values for s_prime
                ## Break any tie with multiple max values by random selection
                action_values = Q1[:,s_prime]
                a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
                a_prime = possible_actions[a_prime_index]
                ## Lookup the reward 
                reward = rewards[s][a]
                ## Update Q1 
                Q1[a_index,s] = Q1[a_index,s] + alpha * (reward + gamma * Q2[a_prime_index,s_prime] - Q1[a_index,s])
            
                ## Set action and state for next iteration
                a = a_prime
                a_index = a_prime_index
                s = s_prime
            
            else:
                ## Find the index or indices of maximum action values for s_prime
                ## Break any tie with multiple max values by random selection
                action_values = Q2[:,s_prime]
                a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
                a_prime = possible_actions[a_prime_index]
                ## Lookup the reward 
                reward = rewards[s][a]
                ## Update Q2
                Q2[a_index,s] = Q2[a_index,s] + alpha * (reward + gamma * Q1[a_prime_index,s_prime] - Q2[a_index,s])
            
                ## Set action and state for next iteration
                a = a_prime
                a_index = a_prime_index
                s = s_prime

    return(Q1)
In [16]:
doudble_Q = double_Q_learning_0(policy, neighbors, reward, 500, goal = 12)

for i in range(4):
    print(np.round(doudble_Q[i,:].reshape((5,5)), 4))
[[ 2.9053  1.0291 -0.0575  1.0663  0.7141]
 [ 3.5645  0.      0.      0.      4.1165]
 [ 2.0151  3.8248  0.      1.44    3.9789]
 [ 7.9073  0.      0.      0.      7.91  ]
 [ 6.8824  0.2245  0.3559  0.8978  7.0167]]
[[ 6.9182 -0.4105 -0.532   0.6655  7.0148]
 [ 7.9079  0.      0.      0.      7.9097]
 [ 2.3816  3.904   0.      2.752   4.4786]
 [ 1.1378  0.      0.      0.      3.5121]
 [ 0.5783 -0.726  -0.2829  1.2675  2.5116]]
[[ 3.6394  5.6451  0.8454  2.314   1.4896]
 [ 0.9125  0.      0.      0.      1.7264]
 [ 2.1984  3.5504  0.     10.      8.9   ]
 [ 0.9653  0.      0.      0.      3.265 ]
 [ 0.5424  4.0794  0.3063  2.1038  0.9592]]
[[ 0.4252  0.2685  4.9937  6.1278  2.7775]
 [ 1.7723  0.      0.      0.      3.6496]
 [ 8.9    10.      0.      5.2978  4.0568]
 [ 1.5804  0.      0.      0.      3.5042]
 [ 0.2847  2.0146  5.0325  6.179   3.0988]]

GPI with Double Q Learning

The code in the cell below applies general policy iteration using double Q-learning(0) to estimate the action values. Additional details on this algorithm can be seen by reading the code comments.

In [17]:
def double_Q_learning_0_GPI(policy, neighbors, reward, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = double_Q_learning_0(policy, neighbors, reward, episodes = episodes, goal = goal)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
In [18]:
Double_Q_0_Policy = double_Q_learning_0_GPI(policy, neighbors, reward, cycles = 10, episodes = 500, goal = 12)
Double_Q_0_Policy 
Out[18]:
{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 2: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 3: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 4: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 5: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 6: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 7: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 8: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 9: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 10: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 11: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 12: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 13: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 14: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 15: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 16: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 17: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 18: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 19: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 20: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 21: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 22: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 23: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 24: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7}}

Um.. at state 2 is something weird, because it shows go 'r', whereas it was 'l' in SARSA. But it is still shortest path though. State 22 is 'l' which is same as before.

N-Step Time Differencing

In the first part of this project we have focused on one-step time differencing, TD, algorithms which update estimates at each time step. Now, we will look at algorithms between these two extreme cases, n-step time differencing algorithms.

We can generalize the concept of the one-step bootstrapping of the return starting with TD(0):

$$G_{t:t+1} = R_{t+1} + \gamma V_{t}(S_{t+1})$$

Here, $V_{t}(S_{t+1})$ is being bootstrapped.

We can extend this formulation to two-step TD as follows:

$$G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma^2 V_{t+1}(S_{t+2})$$

In the above it is $V_{t+1}(S_{t+2})$ being bootstrapped.

For n-step TD the return is computed by the following bootstrap formulation.

$$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n})$$

Regardless of home many steps are used to compute the return, the value update is:

$$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \big[ G_{t:t+n} - V_{t+n-1}(S_t) \big],\ 0 \leq t < T]$$

Where $\delta_t = G_{t:t+n} - V_{t+n-1}(S_t) = $ the n-step TD error.

The resulting backup diagrams for general TD(n) is shown in the diagram below.

Drawing

**Backup Diagram for n-step TD**

From Left to right we can see how the backup diagram adds more states and actions. On the extreme right the backup diagram for the MC algorithm is shown. Notice that MC RL does not bootstrap, but samples until the terminal state is reached.

N-step TD methods are generally considered to have two advantages over single-step methods:

  1. N-step methods generally have better convergence properties.
  2. N-step methods break the link between time between actions and time when bootstraping occurs.

Apply N-Step TD Learning

Finally, we will apply N-Step TD learning and N-Step SARSA to the warehouse navigation problem.

In [19]:
def TD_n(policy, episodes, n, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform TD(N) policy evaluation.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    v = [0]*len(list(policy.keys()))
    
    current_policy = copy.deepcopy(policy)
    
    
    ## sample an initial state at random and make sure is not terminal state
    s = nr.choice(states, size = 1)[0]
    while(s in taboo+[goal]):
        s = nr.choice(states, size = 1)[0]  
        
    for _ in range(episodes): # Loop over the episodes
        T = float("inf")
        tau = 0
        reward_list = []
        t = 0
        
        while(tau != T - 1): # Episode ends where get to terminal state 
            if(t < T):
                ## Choose action given policy
                probs = list(policy[s].values())
                a = list(policy[s].keys())[nr.choice(action_index, size = 1, p = probs)[0]]
                ## The next state given the action
                s_prime, reward = state_values(s, a)
                reward_list.append(reward)  # append the reward to the list
                if(s_prime == goal): T = t + 1  # We reached the terminal state
                
            tau = t - n + 1 ## update the time step being updated

            if(tau >= 0): # Check if enough time steps to compute return
                ## Compute the return
                ## The formula for the first index in the loop is different from Sutton and Barto
                ## but seems to be correct at least for Python.
                G = 0.0 
                for i in range(tau, min(tau + n - 1, T)):
                    G = G + gamma**(i-tau) * reward_list[i]   
                ## Deal with case of where we are not at the terminal state
                if(tau + n < T): G = G + gamma**n * v[s_prime]
                ## Update v
                v[s] = v[s] + alpha * (G - v[s])
            
            ## Set state for next iteration
            if(s_prime != goal):
                s = s_prime
            t = t +1
    return(v)
In [20]:
np.round(np.array(TD_n(policy, episodes = 1000, n = 4, goal = 12, alpha = 0.2, gamma = 0.9)).reshape((5,5)), 4)
Out[20]:
array([[-4.119 , -3.8068, -3.6633, -4.7918, -3.8236],
       [-3.5519,  0.    ,  0.    ,  0.    , -3.2064],
       [ 0.2025,  7.4779,  0.    ,  3.8816, -2.7824],
       [-3.1107,  0.    ,  0.    ,  0.    , -4.2211],
       [-3.2287, -4.0398, -3.6634, -3.8334, -4.446 ]])

We can verify that the result we obtained appears correct.

The values of the goal and taboo states are all 0. Additionally, the state values decrease with distance from the goal.

N-step SARSA for control

Just as we used SARSA(0) for the control or policy improvement, we can generalize this algorithm to the n-step case. This is the SARSA(n) algorithm.

The n-step return is computed by the following bootstrap formulation.

$$G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \ldots + \gamma^{n-1} R_{t+n} + \gamma^n Q_{t+n-1}(S_{t+n}, A_{t+n}),\ n \geq 1, 0 \leq t < T-n$$

Using this return, the action value update becomes:

$$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \big[ G_{t:t+n} - Q_{t+n-1}(S_t,A_t) \big],\ 0 \leq t < T]$$

Where $\delta_t = G_{t:t+n} - Q_{t+n-1}(S_t,A_t) = $ the n-step error.

In [21]:
def SARSA_n(policy, episodes, n, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform SARSA(N) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(action_index),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample a state at random and make sure is not terminal state
        s = nr.choice(states, size = 1)[0]
        while(s in taboo+[goal]):
            s = nr.choice(states, size = 1)[0]
        ## Now choose action given policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        t = 0 # Initialize the time step count
        T = float("inf")
        tau = 0
        reward_list = []
        while(tau != T - 1): # Episode ends where get to terminal state 
            if(t < T):
                ## The next state given the action
                s_prime, reward = state_values(s, a)
                reward_list.append(reward)  # append the reward to the list
                if(s_prime == goal): T = t + 1  # We reached the terminal state
                else:
                    # Select and store the next action using the policy
                    a_prime_index, a_prime = select_a_prime(s_prime, current_policy, action_index, True, goal)
                
                
            tau = t - n + 1 ## update the time step being updated
  
            if(tau >= 0): # Check if enough time steps to compute return
                ## Compute the return
                ## The formula for the first index in the loop is different from Sutton and Barto
                ## but seems to be correct at least for Python.
                G = 0.0 
                for i in range(tau, min(tau + n, T)):
                    G = G + gamma**(i-tau) * reward_list[i]   
                ## Deal with case of where we are not at the terminal state
                if(tau + n < T): G = G + gamma**n * Q[a_prime_index,s_prime]
                ## Finally, update Q
                Q[a_index,s] = Q[a_index,s] + alpha * (G - Q[a_index,s])
            
            ## Set action and state for next iteration
            if(s_prime != goal):
                s = s_prime   
                a = a_prime 
                a_index = a_prime_index
                
            
            ## increment t
            t = t + 1
    return(Q)
In [22]:
Q = SARSA_n(policy, episodes = 1000, n = 4, goal = 12, alpha = 0.2, gamma = 0.9)

for i in range(4):
    print(np.round(Q[i,:].reshape((5,5)), 4))
[[-4.8314 -6.0364 -6.4035 -6.124  -5.662 ]
 [-4.7546  0.      0.      0.     -5.3932]
 [-4.7726 -1.1951  0.     -2.8841 -4.6767]
 [-3.6235  0.      0.      0.     -4.3611]
 [-4.1346 -5.6899 -5.7401 -6.5775 -4.8216]]
[[-4.7786 -5.5577 -6.1351 -6.0685 -4.5772]
 [-4.5579  0.      0.      0.     -3.6685]
 [-4.3975 -0.9344  0.     -3.2095 -4.6363]
 [-4.4945  0.      0.      0.     -4.4534]
 [-5.3988 -5.149  -5.5325 -6.4859 -6.1119]]
[[-5.3401 -4.6512 -5.4976 -5.9829 -5.5633]
 [-5.0836  0.      0.      0.     -4.7166]
 [-4.2079 -4.1256  0.      8.3106 -1.5408]
 [-5.0304  0.      0.      0.     -5.2128]
 [-4.4834 -4.578  -4.9972 -4.9428 -4.7617]]
[[-4.8867 -5.5297 -5.3266 -4.7102 -4.6549]
 [-4.7365  0.      0.      0.     -5.1647]
 [ 0.2925  8.424   0.     -4.0164 -4.2909]
 [-5.2342  0.      0.      0.     -5.5486]
 [-4.5428 -4.8035 -4.9351 -4.4463 -6.085 ]]

GPI with N-step SARSA

Finally, the code in the cell below executes the GPI algorithm using n-step SARSA to evaluate the action values. Details of the algorithm can be found by reading the code comments.

In [23]:
def SARSA_n_GPI(policy, n, cycles, episodes, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_n(policy, episodes, n, goal = goal, alpha = alpha, gamma = gamma, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 
In [24]:
SARSA_N_Policy = SARSA_n_GPI(policy, n = 4, cycles = 5, episodes = 500, goal = 12, alpha = 0.2, epsilon = 0.1)
SARSA_N_Policy
Out[24]:
{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 2: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 3: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 4: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 5: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 6: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 7: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 8: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 9: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 10: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 11: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 12: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 13: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 14: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 15: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 16: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 17: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 18: {'d': 0.25, 'l': 0.25, 'r': 0.25, 'u': 0.25},
 19: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 20: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 21: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 22: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 23: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 24: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7}}

We can examine our results here. The most probable paths to the goal from states 2 and 22 are the shortest possible. Additionally, at the states 2 and 22, it shows go left and right respectively, which is indicating the shortest path correctly.

Thanks.