[note][reinforcement-learning] (0) using reinforcement to solve frozen-lake

using reinforcement to solve frozen-lake

we will use GYM library to solve frozen-lake problem.

GYM provides an easy way to experiment with agents in a grid game.

The frozen-lake environment consists of a 4 by 4 grid of blocks. There are a start block, a goal block, a safe frozen block and a dangerous hole.

The objective is to move agent from start block to goal block and not fall into the hole block.

Each time agent can choose go four directions: up, down, left and right by one block.

There maybe a wind which occasionally blows the agent to a block that it didn't choose.

As such, a perfect performance everytime is impossible.

But learning to avoid hole and reach the goal is sill doable.

The reward of each step is 0, except for entering the goal block which rewards 1.

Thus, we will need an algorithm that learns long-time expected rewards.

Q-learning is a table of values for every state(row) and action (column) possible in the environment.

In the case of frozen-lake problem, we have 16 states and 4 actions, giving us a 16 by 4 table of Q-values.

We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

We update the table by Bellman Equation. It states that the expect long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. In this way, we reuse our own Q-table when estimating how to update our table for future actions! In equation form, the rule looks like this:

Q(state,action)=r+γmax(Q(state,action))

This says that the Q-value for a given state (s) and action (a) should represent the current reward (r) plus the maximum discounted (γ) future reward expected according to our own table for the next state (s’) we would end up in. The discount variable allows us to decide how important the possible future rewards are compared to the present reward. By updating in this way, the table slowly beings to obtain accurate measures of the expected future reward for a given action in a given state.

import gym
import numpy as np
env = gym.make('FrozenLake-v0') # gym 自带FrozenLake环境,所以字符串必须一致
[2016-08-28 18:52:39,722] Making new env: FrozenLake-v0
# grids 
Q = np.zeros((env.observation_space.n, env.action_space.n))
# learning parameters

lr = 0.85
y = 0.99

num_steps = 2000

rList = []
for ii in xrange(num_steps):
    j = 0
    # reset environment, get first observation
    s = env.reset()
    rAll = 0
    d = False
    while j < 99:
        j += 1
        # choose action greedily
        a  = np.argmax(Q[s,:] + np.random.randn(1, env.action_space.n)*1.0/ (ii+1))
        s1, r, d, _ = env.step(a)
        # update Q table
        Q[s, a] = Q[s,a] + lr * (r + y * np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:break
    rList.append(rAll)
print("Score over time: " +  str(sum(rList)/num_steps))
print("Final Q-Table Values")
print(Q)
Score over time: 0.35
Final Q-Table Values
[[  6.13590716e-03   6.14249158e-03   5.29860236e-01   1.56022837e-02]
 [  1.91698554e-04   2.95870014e-04   2.10001305e-04   3.33685432e-01]
 [  2.24586661e-03   2.56720079e-03   3.65946835e-03   2.80146626e-01]
 [  1.45247907e-03   1.82229915e-04   1.21586370e-03   2.62976533e-01]
 [  7.62102747e-01   9.49716764e-05   1.21329204e-03   1.12768247e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.64798004e-01   4.95678153e-05   7.58364558e-04   8.00482449e-10]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  3.99071293e-03   1.31990383e-05   2.12142655e-03   5.37417612e-01]
 [  9.48052771e-04   8.17525930e-01   9.00640853e-04   1.09892593e-03]
 [  8.95540010e-01   4.88582407e-04   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  7.92931304e-04   0.00000000e+00   9.71283575e-01   2.33591697e-03]
 [  0.00000000e+00   0.00000000e+00   9.99524630e-01   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]]

We instead need some way to take a description of our state, and produce Q-values for actions without a table:

that is where neural networks come in.

By acting as a function approximator, we can take any number of possible states that can be represented as a vector and learn to map them to Q-values.

In the case of the Frozen-lake example,

we will be using a one-layer network which takes the state encoded in a one-hot vector (1x16),and produces a vector of 4 Q-values,one for each action.

Such a simple network acts kind of like a glorified table, with the network weights serving as the old cells.

The key difference is that we can easily expand the tensorflow network with added layers, activation functions, and different input types, whereas all that is impossible with a regular table.

The method of updating is a little different as well.

Instead of directly updating our table, with a network we will be using backpropogation and a loss function.

Our loss function will be sum-of-squares loss, where the difference between the current predicted Q-values, and the “target” value is computed and the gradients passed through the network.

In this case, our Q-target for the chosen action is the equivalent to the Q-value computed in equation above.

Loss=(QtargetQ)2
import gym
import numpy as np
import random
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
[2016-08-28 22:40:31,086] Making new env: FrozenLake-v0
tf.reset_default_graph()
#These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=(1,16),dtype=tf.float32)
W = tf.Variable(tf.random_uniform((16,4),0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

#Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=(1,4),dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
updateModel = trainer.minimize(loss)
init = tf.initialize_all_variables()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
#create lists to contain total rewards and steps per episode
jList = []
rList = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        #Reset environment and get first new observation
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Network
        while j < 99:
            j+=1
            #Choose an action by greedily (with e chance of random action) from the Q-network
            a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
            if np.random.rand(1) < e:
                a[0] = env.action_space.sample()
            #Get new state and reward from environment
            s1,r,d,_ = env.step(a[0])
            #Obtain the Q' values by feeding the new state through our network
            Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})
            #Obtain maxQ' and set our target value for chosen action.
            maxQ1 = np.max(Q1)
            targetQ = allQ
            targetQ[0,a[0]] = r + y*maxQ1
            #Train our network using target and predicted Q values
            _,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})
            rAll += r
            s = s1
            if d == True:
                #Reduce chance of random action as we train the model.
                e = 1./((i/50) + 10)
                break
        jList.append(j)
        rList.append(rAll)
print "Percent of succesful episodes: " + str(sum(rList)/num_episodes) + "%"
Percent of succesful episodes: 0.442%
plt.plot(rList)
全部评论

相关推荐

刚刷到字节跳动官方发的消息,确实被这波阵仗吓了一跳。在大家还在纠结今年行情是不是又“寒冬”的时候,字节直接甩出了史上规模最大的转正实习计划——ByteIntern。咱们直接看几个最硬的数,别被花里胡哨的宣传词绕晕了。首先是“量大”。全球招7000多人是什么概念?这几乎是把很多中型互联网公司的总人数都给招进来了。最关键的是,这次的资源分配非常精准:研发岗给了4800多个Offer,占比直接超过六成。说白了,字节今年还是要死磕技术,尤其是产品和AI领域,这对于咱们写代码的同学来说,绝对是今年最厚的一块肥肉。其次是大家最关心的“转正率”。官方直接白纸黑字写了:整体转正率超过50%。这意味着只要你进去了,不划水、正常干,每两个人里就有一个能直接拿校招Offer。对于2027届(2026年9月到2027年8月毕业)的同学来说,这不仅是实习,这简直就是通往大厂的快捷通道。不过,我也得泼盆冷水。坑位多,不代表门槛低。字节的实习面试出了名的爱考算法和工程实操,尤其是今年重点倾斜AI方向,如果你简历里有和AI相关的项目,优势还是有的。而且,转正率50%也意味着剩下那50%的人是陪跑的,进去之后的考核压力肯定不小。一句话总结:&nbsp;27届的兄弟们,别犹豫了。今年字节这是铁了心要抢提前批的人才,现在投递就是占坑。与其等到明年秋招去千军万马挤独木桥,不如现在进去先占个工位,把转正名额攥在手里。
喵_coding:别逗了 50%转正率 仔细想想 就是转正与不转正
字节7000实习来了,你...
点赞 评论 收藏
分享
评论
点赞
收藏
分享

创作者周榜

更多
牛客网
牛客网在线编程
牛客网题解
牛客企业服务