[note][reinforcement-learning] (0) using reinforcement to solve frozen-lake

using reinforcement to solve frozen-lake

we will use GYM library to solve frozen-lake problem.

GYM provides an easy way to experiment with agents in a grid game.

The frozen-lake environment consists of a 4 by 4 grid of blocks. There are a start block, a goal block, a safe frozen block and a dangerous hole.

The objective is to move agent from start block to goal block and not fall into the hole block.

Each time agent can choose go four directions: up, down, left and right by one block.

There maybe a wind which occasionally blows the agent to a block that it didn't choose.

As such, a perfect performance everytime is impossible.

But learning to avoid hole and reach the goal is sill doable.

The reward of each step is 0, except for entering the goal block which rewards 1.

Thus, we will need an algorithm that learns long-time expected rewards.

Q-learning is a table of values for every state(row) and action (column) possible in the environment.

In the case of frozen-lake problem, we have 16 states and 4 actions, giving us a 16 by 4 table of Q-values.

We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

We update the table by Bellman Equation. It states that the expect long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. In this way, we reuse our own Q-table when estimating how to update our table for future actions! In equation form, the rule looks like this:

Q (s t a t e, a c t i o n) = r + γ * m a x (Q (s t a t e', a c t i o n'))

This says that the Q-value for a given state (s) and action (a) should represent the current reward (r) plus the maximum discounted (γ) future reward expected according to our own table for the next state (s’) we would end up in. The discount variable allows us to decide how important the possible future rewards are compared to the present reward. By updating in this way, the table slowly beings to obtain accurate measures of the expected future reward for a given action in a given state.

import gym
import numpy as np

env = gym.make('FrozenLake-v0') # gym 自带FrozenLake环境,所以字符串必须一致

[2016-08-28 18:52:39,722] Making new env: FrozenLake-v0

# grids 
Q = np.zeros((env.observation_space.n, env.action_space.n))
# learning parameters

lr = 0.85
y = 0.99

num_steps = 2000

rList = []
for ii in xrange(num_steps):
    j = 0
    # reset environment, get first observation
    s = env.reset()
    rAll = 0
    d = False
    while j < 99:
        j += 1
        # choose action greedily
        a  = np.argmax(Q[s,:] + np.random.randn(1, env.action_space.n)*1.0/ (ii+1))
        s1, r, d, _ = env.step(a)
        # update Q table
        Q[s, a] = Q[s,a] + lr * (r + y * np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:break
    rList.append(rAll)
print("Score over time: " +  str(sum(rList)/num_steps))
print("Final Q-Table Values")
print(Q)

Score over time: 0.35
Final Q-Table Values
[[  6.13590716e-03   6.14249158e-03   5.29860236e-01   1.56022837e-02]
 [  1.91698554e-04   2.95870014e-04   2.10001305e-04   3.33685432e-01]
 [  2.24586661e-03   2.56720079e-03   3.65946835e-03   2.80146626e-01]
 [  1.45247907e-03   1.82229915e-04   1.21586370e-03   2.62976533e-01]
 [  7.62102747e-01   9.49716764e-05   1.21329204e-03   1.12768247e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.64798004e-01   4.95678153e-05   7.58364558e-04   8.00482449e-10]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.99071293e-03   1.31990383e-05   2.12142655e-03   5.37417612e-01]
 [  9.48052771e-04   8.17525930e-01   9.00640853e-04   1.09892593e-03]
 [  8.95540010e-01   4.88582407e-04   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  7.92931304e-04   0.00000000e+00   9.71283575e-01   2.33591697e-03]
 [  0.00000000e+00   0.00000000e+00   9.99524630e-01   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]]

We instead need some way to take a description of our state, and produce Q-values for actions without a table:

that is where neural networks come in.

By acting as a function approximator, we can take any number of possible states that can be represented as a vector and learn to map them to Q-values.

In the case of the Frozen-lake example,

we will be using a one-layer network which takes the state encoded in a one-hot vector (1x16),and produces a vector of 4 Q-values,one for each action.

Such a simple network acts kind of like a glorified table, with the network weights serving as the old cells.

The key difference is that we can easily expand the tensorflow network with added layers, activation functions, and different input types, whereas all that is impossible with a regular table.

The method of updating is a little different as well.

Instead of directly updating our table, with a network we will be using backpropogation and a loss function.

Our loss function will be sum-of-squares loss, where the difference between the current predicted Q-values, and the “target” value is computed and the gradients passed through the network.

In this case, our Q-target for the chosen action is the equivalent to the Q-value computed in equation above.

L o s s = \sum (Q t a r g e t - Q) 2

import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

env = gym.make('FrozenLake-v0')

[2016-08-28 22:40:31,086] Making new env: FrozenLake-v0

tf.reset_default_graph()

#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=(1,16),dtype=tf.float32)
W = tf.Variable(tf.random_uniform((16,4),0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=(1,4),dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)

init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        #Reset environment and get first new observation
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network
        while j < 99:
            j+=1
            #Choose an action by greedily (with e chance of random action) from the Q-network
            a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
            if np.random.rand(1) < e:
                a[0] = env.action_space.sample()
            #Get new state and reward from environment
            s1,r,d,_ = env.step(a[0])
            #Obtain the Q' values by feeding the new state through our network
            Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})
            #Obtain maxQ' and set our target value for chosen action.
            maxQ1 = np.max(Q1)
            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1
            #Train our network using target and predicted Q values
            _,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})
            rAll += r
            s = s1
            if d == True:
                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
                break
        jList.append(j)
        rList.append(rAll)
print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"