强化深度学习中使用Dyna-Q算法确定机器人问题中不同规划的学习和策略实战(超详细 附源码)

简介: 强化深度学习中使用Dyna-Q算法确定机器人问题中不同规划的学习和策略实战(超详细 附源码)

需要源码请点赞关注收藏后评论区留下QQ并且私信~~~

一、模型、学习、规划简介

1:模型

Agent可以通过模型来预测环境并做出反应,这里所说的模型通常指模拟模型,即在给定一个状态和动作时,通过模型可以对下一状态和奖赏做出预测

模型通常可以分为分布模型和样本模型两种类型

分布模型:该模型可以生成所有可能的结果及其对应的概率分布

样本模型:该模型能够从所有可能的情况中产生一个确定的结果

从功能上讲,模型是用于模拟环境和产生模拟经验的。与样本模型相比,分布模型包含更多的信息,只是现实任务中难以获得所有的状态转移概率

2:学习

学习过程是从环境产生的真实经验中进行学习,根据经验的使用方法,学习过程可以分为直接强化学习和模型学习两种类型

直接强化学习:在真实环境中采集真实经验,根据真实经验直接更新值函数或策略,不受模型偏差的影响

模型学习:在真实环境中采集真实经验,根据真实经验来构建和改进模拟模型,提供模拟模型精度,使其更接近真实环境

3:规划

规划过程是基于模拟环境或经验模型,从模拟经验中更新值函数,实现改进策略的目的,学习和规划的核心都是通过回溯操作来评估值函数,不同之处在于:在规划过程中,Agent并没有与真实环境交互

规划通常可分为状态空间规划和方案空间规划,状态空间规划是在状态空间中寻找最优策略,值函数的计算都是基于状态的,通常该规划方法视为搜索方法,其基本思想如下

1:所有规划算法都以计算值函数作为策略改进的中间关键步骤

2:所有规划算法都可以通过基于模型产生的模拟经验来计算值函数

二、Dyna-Q结构及其算法

Dyna-Q架构包含了在线规划Agent所需要的主要功能,该架构讲学习和规划有机地结合在一起,是有模型和无模型方法的融合,其数据来源包括基于真实环境采样的真实经验以及基于模拟模型采样的模拟经验,通过直接强化学习或间接强化学习来更新值函数或者策略

架构图如下

三、Dyna-Q不同规划对学习步数的影响

机器人环境搭建以及背景可点击如下链接了解

机器人环境

此处比较不同规划步数对实验效果的影响,当机器人离开边界或者撞到障碍物则得到-10的奖赏,到达充电桩获得+1的奖赏,其他情况奖赏均为0,不同需要不同情节数,可视化结果如下

代码如下

import gym
from gym import spaces
from gym.utils import seeding
from random import random, choice
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
class Grid(object):
    def __init__(self, x:int = None,
                       y:int = None,
                       type:int = 0,
                       reward:float = 0.0):
        self.x = x                          # 坐标x
        self.y = y
        self.type = type                    # 类别值(0:空;1:障碍或边界)
        self.reward = reward                # 该格子的即时奖励
        self.name = None                    # 该格子的名称
        self._update_name()
    def _update_name(self):
        self.name = "X{0}-Y{1}".format(self.x, self.y)
    def __str__(self):
        return "name:{3}, x:{0}, y:{1}, type:{2}".format(self.x,
                                                                    self.y,
                                                                    self.type,
                                                                    self.name
                                                                    )
class GridMatrix(object):
    def __init__(self, n_width:int,                 # 水平方向格子数
                       n_height:int,                # 竖直方向格子数
                       default_type:int = 0,        # 默认类型
                       default_reward:float = 0.0,  # 默认即时奖励值
                       ):
        self.grids = None
        self.n_height = n_height
        self.n_width = n_width
        self.len = n_width * n_height
        self.default_reward = default_reward
        self.default_type = default_type
        self.reset()
    def reset(self):
        self.grids = []
        for x in range(self.n_height):
            for y in range(self.n_width):
                self.grids.append(Grid(x,
                                       y,
                                       self.default_type,
                                       self.default_reward))
    def get_grid(self, x, y=None):
        '''获取一个格子信息
        args:坐标信息,由x,y表示或仅有一个类型为tuple的x表示
        return:grid object
        '''
        xx, yy = None, None
        if isinstance(x, int):
            xx, yy = x, y
        elif isinstance(x, tuple):
            xx, yy = x[0], x[1]
        assert(xx >= 0 and yy >= 0 and xx < self.n_width and yy < self.n_height), "任意坐标值应在合理区间"
        index = yy * self.n_width + xx
        return self.grids[index]
    def set_reward(self, x, y, reward):
        grid = self.get_grid(x, y)
        if grid is not None:
            grid.reward = reward
        else:
            raise("grid doesn't exist")
    def set_type(self, x, y, type):
        grid = self.get_grid(x, y)
        if grid is not None:
            grid.type = type
        else:
            raise("grid doesn't exist")
    def get_reward(self, x, y):
        grid = self.get_grid(x, y)
        if grid is None:
            return None
        return grid.reward
    def get_type(self, x, y):
        grid = self.get_grid(x, y)
        if grid is None:
            return None
        return grid.type
# 格子世界环境
class GridWorldEnv(gym.Env):
    metadata = {
        'render.modes': ['human', 'rgb_array'],
        'video.frames_per_second': 30
    }
    def __init__(self, n_width: int=5,
                       n_height: int = 5,
                       u_size=40,
                       default_reward: float = 0.0,
                       default_type=0):
        self.u_size = u_size                        # 当前格子绘制尺寸
        self.n_width = n_width                      # 格子世界宽度(以格子数计)
        self.n_height = n_height                    # 高度
        self.width = u_size * n_width               # 场景宽度 screen width
        self.height = u_size * n_height             # 场景长度
        self.default_reward = default_reward
        self.default_type = default_type
        self.grids = GridMatrix(n_width=self.n_width,
                                n_height=self.n_height,
                                default_reward=self.default_reward,
                                default_type=self.default_type)
        self.reward = 0                             # for rendering
        self.action = None                          # for rendering
        # 0,1,2,3 represent up, down, left, right
        self.action_space = spaces.Discrete(4)
        # 观察空间由low和high决定
        self.observation_space = spaces.Discrete(self.n_height * self.n_width)
        self.ends = [(0, 0)]  # 终止格子坐标,可以有多个
        self.start = (0, 4)  # 起始格子坐标,只有一个
        self.types = [(2, 2, 1)]
        self.rewards = []
        self.refresh_setting()
        self.viewer = None  # 图形接口对象
        self.seed()  # 产生一个随机子
        self.reset()
    def seed(self, seed=None):
        # 产生一个随机化时需要的种子,同时返回一个np_random对象,支持后续的随机化生成操作
        self.np_random, seed = seeding.np_random(seed)
        return [seed]
    def step(self, action):
        assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))
        self.action = action                        # action for rendering
        old_x, old_y = self._state_to_xy(self.state)
        new_x, new_y = old_x, old_y
        if action == 0: new_y += 1  # up
        elif action == 1: new_y -= 1  # down
        elif action == 2: new_x -= 1  # left
        elif action == 3: new_x += 1  # right
        # boundary effect
        if new_x < 0: new_x = 0
        if new_x >= self.n_width: new_x = self.n_width - 1
        if new_y < 0: new_y = 0
        if new_y >= self.n_height: new_y = self.n_height - 1
        # wall effect:
        # 类型为1的格子为障碍格子,不可进入
        if self.grids.get_type(new_x, new_y) == 1:
            new_x, new_y = old_x, old_y
        self.reward = self.grids.get_reward(new_x, new_y)
        done = self._is_end_state(new_x, new_y)
        self.state = self._xy_to_state(new_x, new_y)
        # 提供格子世界所有的信息在info内
        info = {"x": new_x, "y": new_y, "grids": self.grids}
        return self.state, self.reward, done, info
    # 将状态变为横纵坐标
    def _state_to_xy(self, s):
        x = s % self.n_width
        y = int((s - x) / self.n_width)
        return x, y
    def _xy_to_state(self, x, y=None):
        if isinstance(x, int):
            assert (isinstance(y, int)), "incomplete Position info"
            return x + self.n_width * y
        elif isinstance(x, tuple):
            return x[0] + self.n_width * x[1]
        return -1  # 未知状态
    def refresh_setting(self):
        '''用户在使用该类创建格子世界后可能会修改格子世界某些格子类型或奖励值
        的设置,修改设置后通过调用该方法使得设置生效。
        '''
        for x, y, r in self.rewards:
            self.grids.set_reward(x, y, r)
        for x, y, t in self.types:
            self.grids.set_type(x, y, t)
    def reset(self):
        self.state = self._xy_to_state(self.start)
        return self.state
    # 判断是否是终止状态
    def _is_end_state(self, x, y=None):
        if y is not None:
            xx, yy = x, y
        elif isinstance(x, int):
            xx, yy = self._state_to_xy(x)
        else:
            assert (isinstance(x, tuple)), "坐标数据不完整"
            xx, yy = x[0], x[1]
        for end in self.ends:
            if xx == end[0] and yy == end[1]:
                return True
        return False
    # 图形化界面
    def render(self, mode='human', close=False):
        if close:
            if self.viewer is not None:
                self.viewer.close()
                self.viewer = None
            return
        zero = (0, 0)
        u_size = self.u_size
        m = 2                                       # 格子之间的间隙尺寸
        # 如果还没有设定屏幕对象,则初始化整个屏幕具备的元素。
        if self.viewer is None:
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(self.width, self.height)
            # 绘制格子
            for x in range(self.n_width):
                for y in range(self.n_height):
                    v = [(x * u_size + m, y * u_size + m),
                         ((x + 1) * u_size - m, y * u_size + m),
                         ((x + 1) * u_size - m, (y + 1) * u_size - m),
                         (x * u_size + m, (y + 1) * u_size - m)]
                    rect = rendering.FilledPolygon(v)
                    r = self.grids.get_reward(x, y) / 10
                    if r < 0:
                        rect.set_color(0.9 - r, 0.9 + r, 0.9 + r)
                    elif r > 0:
                        rect.set_color(0.3, 0.5 + r, 0.3)
                    else:
                        rect.set_color(0.9, 0.9, 0.9)
                    self.viewer.add_geom(rect)
                    # 绘制边框
                    v_outline = [(x * u_size + m, y * u_size + m),
                                 ((x + 1) * u_size - m, y * u_size + m),
                                 ((x + 1) * u_size - m, (y + 1) * u_size - m),
                                 (x * u_size + m, (y + 1) * u_size - m)]
                    outline = rendering.make_polygon(v_outline, False)
                    outline.set_linewidth(3)
                    if self._is_end_state(x, y):
                        # 给终点方格添加金黄色边框
                        outline.set_color(0.9, 0.9, 0)
                        self.viewer.add_geom(outline)
                    if self.start[0] == x and self.start[1] == y:
                        outline.set_color(0.5, 0.5, 0.8)
                        self.viewer.add_geom(outline)
                    if self.grids.get_type(x, y) == 1:  # 障碍格子用深灰色表示
                        rect.set_color(0.3, 0.3, 0.3)
                    else:
                        pass
            # 绘制个体
            self.agent = rendering.make_circle(u_size / 4, 30, True)
            self.agent.set_color(1.0, 1.0, 0.0)
            self.viewer.add_geom(self.agent)
            self.agent_trans = rendering.Transform()
            self.agent.add_attr(self.agent_trans)
            # 更新个体位置
        x, y = self._state_to_xy(self.state)
        self.agent_trans.set_translation((x + 0.5) * u_size, (y + 0.5) * u_size)
        return self.viewer.render(return_rgb_array= mode == 'rgb_array')
# 环境参数设定
class Agent():
    def __init__(self, env):
        self.episode = 1
        self.Q = {}
        self.actions = [0, 1, 2, 3]
        self.position = env.start
        self.model = {}
    # 建立模型
    def make_model(self, pos, act, reward, next_state):
        self.model["{0},{1}".format(pos, act)] = "{0},{1}".format(reward, next_state)
    # 模型规划
    def q_planning(self,n):
        for i in range(0,n):
            a = [i for i in self.model.keys()]
            done = False
            if a != []:
                str = choice(a)
                pos = str.split(",")[0]+","+str.split(",")[1]
                act = int(str.split(",")[2])
                reward = float(self.model[str].split(",")[0])
                next_state = self.model[str].split(",")[1]+","+self.model[str].split(",")[2]
                if next_state == "(0,0)" or next_state == "(4,3)":
                    done = True
                self.updateQ(pos, act, next_state, reward, done)
    def chaxunQ(self, pos, act):
        judge = False
        for i in self.Q:
            if i == "{0},{1}".format(pos, act):
                judge = True
                break
        if judge == True:
            return True
        else:
            self.Q["{0},{1}".format(pos, act)] = float(format(random()/10000, '.3f'))
            return
    # 更新状态动作值Q函数
    def updateQ(self, pos, action, next_pos, reward, done):
        if done == False:
            self.chaxunQ(pos, action)
            old_q = self.Q["{0},{1}".format(pos, action)]
            action1 = self.performmax(next_pos)
            # self.chaxunQ(next_pos, action1)
            new_q = self.Q["{0},{1}".format(next_pos, action1)]
            old_q = old_q + 0.1 * (reward+0.9 * new_q - old_q)
            self.Q["{0},{1}".format(pos, action)] = float(format(old_q, '.3f'))
        else:
            self.chaxunQ(pos, action)
            self.Q["{0},{1}".format(pos, action)] = float(format(reward, '.3f'))
            # print(pos, action,reward)
    # 动作选取策略
    def perform(self, pos):
        eplison = random()
        self.chaxunQ(pos, choice([0, 1, 2, 3]))
        if eplison > 1/self.episode:
            maxq = -1000
            act = ""
            for i in self.Q:
                list = i.split(",")
                state = list[0] + "," + list[1]
                if state == str(pos):
                    if self.Q[i] > maxq:
                        maxq = self.Q[i]
                        act = list[2]
            return int(act)
        else:
            return choice([0, 1, 2, 3])
    # argmaxQ
    def performmax(self, pos):
        maxq = -1000
        str1 = ""
        self.chaxunQ(pos,choice([0,1,2,3]))
        for i in self.Q:
            list = i.split(",")
            state = list[0]+","+list[1]
            if state == str(pos):
                if self.Q[i] > maxq:
                    maxq = self.Q[i]
                    str1 = list[2]
        return int(str1)
def run(n):
    agent = Agent(env)
    total_j = 0
    total_r = 0
    a = []
    b = []
    env.refresh_setting()
    for i in range(0, 300):
        done = False
        env.reset()
        r = 0
        j = 0
        while done == False:
            state = env._state_to_xy(env.state)
            action = agent.perform(state)
            next_state, reward, done, info = env.step(action)
            next_state = env._state_to_xy(next_state)
            # 更新Q值
            agent.updateQ(state, action, next_state, reward,done)
            # 更新模型
            agent.make_model(state, action, reward, next_state)
            r += reward
            # 模型规划
            agent.q_planning(n)
            j = j+1
        tpend(total_r)
        agent.episode += 1
        # print(agent.Q)
        total_j += j
        if i != 0:
            b.append(j)
        print("回合={0},步数={1},奖赏={2}".format(i, j, '%.3f' % r))
    # return (np.array(a)/np.array(total_j)).tolist()
    # return a
    return b
if __name__ == "__main__":
    n_width = 5
    n_height = 5
    default_reward = 0
    env = GridWorldEnv(n_width, n_height, default_reward=default_reward)
    env.types = [(2, 2, 1)]
    env.rewards = [(0, 0, 1), (4, 3, 1)]  # 奖赏值设定
    env.start = (0, 4)
    env.ends = [(0, 0), (4, 3)]
    env.refresh_setting()
    x = range(1, 300)
    ln1, = plt.plot(x, run(0), label=u"n=0")
    ln2, = plt.plot(x, run(5), label=u"n=5")
    ln3, = plt.plot(x, run(10), label=u"n=10")
    ln4, = plt.plot(x, run(30), label=u"n=30")
    font1 = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=15)
ontproperties=font1)
    plt.ylabel(u'步数', fontproperties=font1)
    plt.show()

四、Dyan-Q算法对策略的影响

同样Agent在不同的情节中采用不同的n获得的策略也不一样,当n=50的时候明显快,而且策略更加广泛

代码如下

from random import random, choice
import gym
from gym import spaces
from gym.utils import seeding
class Grid(object):
    def __init__(self, x:int = None,
                       y:int = None,
                       type:int = 0,
                       reward:float = 0.0):
        self.x = x                          # 坐标x
        self.y = y
        self.type = type                    # 类别值(0:空;1:障碍或边界)
        self.reward = reward                # 该格子的即时奖励
        self.name = None                    # 该格子的名称
        self._update_name()
    def _update_name(self):
        self.name = "X{0}-Y{1}".format(self.x, self.y)
    def __str__(self):
        return "name:{3}, x:{0}, y:{1}, type:{2}".format(self.x,
                                                                    self.y,
                                                                    self.type,
                                                                    self.name
                                                                    )
class GridMatrix(object):
    def __init__(self, n_width:int,                 # 水平方向格子数
                       n_height:int,                # 竖直方向格子数
                       default_type:int = 0,        # 默认类型
                       default_reward:float = 0.0,  # 默认即时奖励值
                       ):
        self.grids = None
        self.n_height = n_height
        self.n_width = n_width
        self.len = n_width * n_height
        self.default_reward = default_reward
        self.default_type = default_type
        self.reset()
    def reset(self):
        self.grids = []
        for x in range(self.n_height):
            for y in range(self.n_width):
                self.grids.append(Grid(x,
                                       y,
                                       self.default_type,
                                       self.default_reward))
    def get_grid(self, x, y=None):
        '''获取一个格子信息
        args:坐标信息,由x,y表示或仅有一个类型为tuple的x表示
        return:grid object
        '''
        xx, yy = None, None
        if isinstance(x, int):
            xx, yy = x, y
        elif isinstance(x, tuple):
            xx, yy = x[0], x[1]
        assert(xx >= 0 and yy >= 0 and xx < self.n_width and yy < self.n_height), "任意坐标值应在合理区间"
        index = yy * self.n_width + xx
        return self.grids[index]
    def set_reward(self, x, y, reward):
        grid = self.get_grid(x, y)
        if grid is not None:
            grid.reward = reward
        else:
            raise("grid doesn't exist")
    def set_type(self, x, y, type):
        grid = self.get_grid(x, y)
        if grid is not None:
            grid.type = type
        else:
            raise("grid doesn't exist")
    def get_reward(self, x, y):
        grid = self.get_grid(x, y)
        if grid is None:
            return None
        return grid.reward
    def get_type(self, x, y):
        grid = self.get_grid(x, y)
        if grid is None:
            return None
        return grid.type
class GridWorldEnv(gym.Env):
    metadata = {
        'render.modes': ['human', 'rgb_array'],
        'video.frames_per_second': 30
    }
    def __init__(self, n_width: int=5,
                       n_height: int = 5,
                       u_size=40,
                       default_reward: float = 0.0,
                       default_type=0):
        self.u_size = u_size                        # 当前格子绘制尺寸
        self.n_width = n_width                      # 格子世界宽度(以格子数计)
        self.n_height = n_height                    # 高度
        self.width = u_size * n_width               # 场景宽度 screen width
        self.height = u_size * n_height             # 场景长度
        self.default_reward = default_reward
        self.default_type = default_type
        self.grids = GridMatrix(n_width=self.n_width,
                                n_height=self.n_height,
                                default_reward=self.default_reward,
                                default_type=self.default_type)
        self.reward = 0                             # for rendering
        self.action = None                          # for rendering
        # 0,1,2,3 represent up, down, left, right
        self.action_space = spaces.Discrete(4)
        # 观察空间由low和high决定
        self.observation_space = spaces.Discrete(self.n_height * self.n_width)
        self.ends = [(0, 0)]  # 终止格子坐标,可以有多个
        self.start = (0, 4)  # 起始格子坐标,只有一个
        self.types = [(2, 2, 1)]
        self.rewards = []
        self.refresh_setting()
        self.viewer = None  # 图形接口对象
        self.seed()  # 产生一个随机子
        self.reset()
    def seed(self, seed=None):
        # 产生一个随机化时需要的种子,同时返回一个np_random对象,支持后续的随机化生成操作
        self.np_random, seed = seeding.np_random(seed)
        return [seed]
    def step(self, action):
        assert self.action_space.contains(action), "%r (%s) invalid" % (action, type(action))
        self.action = action                        # action for rendering
        old_x, old_y = self._state_to_xy(self.state)
        new_x, new_y = old_x, old_y
        if action == 0: new_y += 1  # up
        elif action == 1: new_y -= 1  # down
        elif action == 2: new_x -= 1  # left
        elif action == 3: new_x += 1  # right
        # boundary effect
        if new_x < 0: new_x = 0
        if new_x >= self.n_width: new_x = self.n_width - 1
        if new_y < 0: new_y = 0
        if new_y >= self.n_height: new_y = self.n_height - 1
        # wall effect:
        # 类型为1的格子为障碍格子,不可进入
        if self.grids.get_type(new_x, new_y) == 1:
            new_x, new_y = old_x, old_y
        self.reward = self.grids.get_reward(new_x, new_y)
        done = self._is_end_state(new_x, new_y)
        self.state = self._xy_to_state(new_x, new_y)
        # 提供格子世界所有的信息在info内
        info = {"x": new_x, "y": new_y, "grids": self.grids}
        return self.state, self.reward, done, info
    # 将状态变为横纵坐标
    def _state_to_xy(self, s):
        x = s % self.n_width
        y = int((s - x) / self.n_width)
        return x, y
    def _xy_to_state(self, x, y=None):
        if isinstance(x, int):
            assert (isinstance(y, int)), "incomplete Position info"
            return x + self.n_width * y
        elif isinstance(x, tuple):
            return x[0] + self.n_width * x[1]
        return -1  # 未知状态
    def refresh_setting(self):
        '''用户在使用该类创建格子世界后可能会修改格子世界某些格子类型或奖励值
        的设置,修改设置后通过调用该方法使得设置生效。
        '''
        for x, y, r in self.rewards:
            self.grids.set_reward(x, y, r)
        for x, y, t in self.types:
            self.grids.set_type(x, y, t)
    def reset(self):
        self.state = self._xy_to_state(self.start)
        return self.state
    # 判断是否是终止状态
    def _is_end_state(self, x, y=None):
        if y is not None:
            xx, yy = x, y
        elif isinstance(x, int):
            xx, yy = self._state_to_xy(x)
        else:
            assert (isinstance(x, tuple)), "坐标数据不完整"
            xx, yy = x[0], x[1]
        for end in self.ends:
            if xx == end[0] and yy == end[1]:
                return True
        return False
    # 图形化界面
    def render(self, mode='human', close=False):
        if close:
            if self.viewer is not None:
                self.viewer.close()
                self.viewer = None
            return
        zero = (0, 0)
        u_size = self.u_size
        m = 2                                       # 格子之间的间隙尺寸
        # 如果还没有设定屏幕对象,则初始化整个屏幕具备的元素。
        if self.viewer is None:
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(self.width, self.height)
            # 绘制格子
            for x in range(self.n_width):
                for y in range(self.n_height):
                    v = [(x * u_size + m, y * u_size + m),
                         ((x + 1) * u_size - m, y * u_size + m),
                         ((x + 1) * u_size - m, (y + 1) * u_size - m),
                         (x * u_size + m, (y + 1) * u_size - m)]
                    rect = rendering.FilledPolygon(v)
                    r = self.grids.get_reward(x, y) / 10
                    if r < 0:
                        rect.set_color(0.9 - r, 0.9 + r, 0.9 + r)
                    elif r > 0:
                        rect.set_color(0.3, 0.5 + r, 0.3)
                    else:
                        rect.set_color(0.9, 0.9, 0.9)
                    self.viewer.add_geom(rect)
                    # 绘制边框
                    v_outline = [(x * u_size + m, y * u_size + m),
                                 ((x + 1) * u_size - m, y * u_size + m),
                                 ((x + 1) * u_size - m, (y + 1) * u_size - m),
                                 (x * u_size + m, (y + 1) * u_size - m)]
                    outline = rendering.make_polygon(v_outline, False)
                    outline.set_linewidth(3)
                    if self._is_end_state(x, y):
                        # 给终点方格添加金黄色边框
                        outline.set_color(0.9, 0.9, 0)
                        self.viewer.add_geom(outline)
                    if self.start[0] == x and self.start[1] == y:
                        outline.set_color(0.5, 0.5, 0.8)
                        self.viewer.add_geom(outline)
                    if self.grids.get_type(x, y) == 1:  # 障碍格子用深灰色表示
                        rect.set_color(0.3, 0.3, 0.3)
                    else:
                        pass
            # 绘制个体
            self.agent = rendering.make_circle(u_size / 4, 30, True)
            self.agent.set_color(1.0, 1.0, 0.0)
            self.viewer.add_geom(self.agent)
            self.agent_trans = rendering.Transform()
            self.agent.add_attr(self.agent_trans)
            # 更新个体位置
        x, y = self._state_to_xy(self.state)
        self.agent_trans.set_translation((x + 0.5) * u_size, (y + 0.5) * u_size)
        return self.viewer.render(return_rgb_array= mode == 'rgb_array')
# 环境参数设定
class Agent():
    def __init__(self, env):
        self.episode = 1
        self.Q = {}
        self.actions = [0, 1, 2, 3]
        self.position = env.start
        self.model = {}
    def make_model(self, pos, act, reward, next_state):
        self.model["{0},{1}".format(pos, act)] = "{0},{1}".format(reward, next_state)
    def q_planning(self,n):
        for i in range(0, n):
            a = [i for i in self.model.keys()]
            done=False
            if a != []:
                str = choice(a)
                pos = str.split(",")[0]+","+str.split(",")[1]
                act = int(str.split(",")[2])
                reward = float(self.model[str].split(",")[0])
                next_state = self.model[str].split(",")[1]+","+self.model[str].split(",")[2]
                if next_state == "(8,5)" or next_state == "(1,6)":
                    done = True
                self.updateQ(pos,act,next_state,reward,done)
    def chaxunQ(self, pos, act):
        judge = False
        for i in self.Q:
            if i == "{0},{1}".format(pos, act):
                judge = True
                break
        if judge == True:
            return True
        else:
            self.Q["{0},{1}".format(pos, act)] = float(format(random()/10000, '.3f'))
            return
    # 更新状态动作值Q函数
    def updateQ(self, pos,action,next_pos,reward,done):
        if done == False:
            self.chaxunQ(pos, action)
            old_q = self.Q["{0},{1}".format(pos, action)]
            action1 = self.performmax(next_pos)
            new_q = self.Q["{0},{1}".format(next_pos, action1)]
            old_q = old_q + 0.1 * (reward+0.9 * new_q - old_q)
            self.Q["{0},{1}".format(pos, action)] = float(format(old_q, '.3f'))
        else:
            self.chaxunQ(pos, action)
            self.Q["{0},{1}".format(pos, action)] = float(format(reward, '.3f'))
    # 动作选取策略
    def perform(self, pos):
        eplison = random()
        self.chaxunQ(pos, choice([0, 1, 2, 3]))
        if eplison > 1/self.episode:
            maxq = -1000
            act = ""
            for i in self.Q:
                list = i.split(",")
                state = list[0] + "," + list[1]
                if state == str(pos):
                    if self.Q[i] > maxq:
                        maxq = self.Q[i]
                        act = list[2]
            return int(act)
        else:
            return choice([0, 1, 2, 3])
    # argmaxQ
    def performmax(self, pos):
        maxq = -1000
        str1 = ""
        self.chaxunQ(pos,choice([0,1,2,3]))
        for i in self.Q:
            list = i.split(",")
            state = list[0]+","+list[1]
            if state == str(pos):
                if self.Q[i] > maxq:
                    maxq = self.Q[i]
                    str1 = list[2]
        return int(str1)
def run(n):
    agent = Agent(env)
    total_j = 0
    total_r = 0
    a = []
    b = []
    env.refresh_setting()
    for i in range(0, 300):
        done = False
        env.reset()
        r = 0
        j = 0
        while done == False and j < 50:
            j = j + 1
            state = env._state_to_xy(env.state)
            action = agent.perform(state)
            next_state, reward, done, info = env.step(action)
            next_state = env._state_to_xy(next_state)
            # 更新模型
            agent.make_model(state, action, reward, next_state)
            # 更新Q值
            agent.updateQ(state, action, next_state, reward, done)
            r += reward
            # 模型规划
            agent.q_planning(n)
        if i >= 2:
            total_r += r
            a.append(total_r)
        agent.episode += 1
        action_Q = {}
        for i in sorted(list(agent.Q.keys()), key=lambda x: [x[1], x[4], x[5], x[-1]]):
            action_Q[i] = agent.Q[i]
        print("n={0}:Q值{1}".format(n, action_Q))
        # P_A:状态采取策略
        P_A = {}
        Q_keys = list(action_Q.keys())
        for i in range(0, len(Q_keys)-4, 4):
            temp_action_list = []
            max_a_value = max(action_Q[Q_keys[j]] for j in range(i, i+4))
            # temp_num:标记四个动作最大值
            temp_num = [0, 0, 0, 0]
            # PA:四个动作概率
            PA = [0, 0, 0, 0]
            for k in range(i, i+4):
                if action_Q[Q_keys[k]] == max_a_value:
                    temp_action_list.append(Q_keys[k])
                    temp_num[k-i] = 1
            valid_action_p = round(1/len(temp_action_list), 2)
            for m in range(4):
                if temp_num[m] == 1:
                    PA[m] = valid_action_p
            P_A[Q_keys[i][0:-2]] = PA
        print("Q_A: ", P_A)
        total_j += j
        b.append(j)
    return a
if __name__ == "__main__":
    n_width = 11
    n_height = 7
    default_reward = -0.1
    env = GridWorldEnv(n_width, n_height, default_reward=default_reward)
    env.types = [(1, 2, 1), (2, 2, 1), (3, 2, 1), (4, 2, 1), (5, 2, 1), (6, 2, 1), (7, 2, 1), (8, 2, 1), (9, 2, 1),
                 (10, 2, 1)]
    env.rewards = [(8, 5, 1), (1, 6, 1)]  # 奖赏值设定
    env.start = (4, 0)  # 机器人出发点坐标
    env.ends = [(8, 5), (1, 6)]  # 吸入状态坐标
    env.refresh_setting()
    for i in range(0, 2):
        print("episode=", i+1)
        run(0)
    for i in range(0, 2):
        print("episode=", i+1)
        run(50)

创作不易 觉得有帮助请点赞关注收藏~~~

相关文章
|
15天前
|
机器学习/深度学习 人工智能 算法
深度学习入门:理解神经网络与反向传播算法
【9月更文挑战第20天】本文将深入浅出地介绍深度学习中的基石—神经网络,以及背后的魔法—反向传播算法。我们将通过直观的例子和简单的数学公式,带你领略这一技术的魅力。无论你是编程新手,还是有一定基础的开发者,这篇文章都将为你打开深度学习的大门,让你对神经网络的工作原理有一个清晰的认识。
|
16天前
|
机器学习/深度学习 人工智能 算法
植物病害识别系统Python+卷积神经网络算法+图像识别+人工智能项目+深度学习项目+计算机课设项目+Django网页界面
植物病害识别系统。本系统使用Python作为主要编程语言,通过收集水稻常见的四种叶片病害图片('细菌性叶枯病', '稻瘟病', '褐斑病', '稻瘟条纹病毒病')作为后面模型训练用到的数据集。然后使用TensorFlow搭建卷积神经网络算法模型,并进行多轮迭代训练,最后得到一个识别精度较高的算法模型,然后将其保存为h5格式的本地模型文件。再使用Django搭建Web网页平台操作界面,实现用户上传一张测试图片识别其名称。
66 21
植物病害识别系统Python+卷积神经网络算法+图像识别+人工智能项目+深度学习项目+计算机课设项目+Django网页界面
|
15天前
|
机器学习/深度学习 算法 TensorFlow
交通标志识别系统Python+卷积神经网络算法+深度学习人工智能+TensorFlow模型训练+计算机课设项目+Django网页界面
交通标志识别系统。本系统使用Python作为主要编程语言,在交通标志图像识别功能实现中,基于TensorFlow搭建卷积神经网络算法模型,通过对收集到的58种常见的交通标志图像作为数据集,进行迭代训练最后得到一个识别精度较高的模型文件,然后保存为本地的h5格式文件。再使用Django开发Web网页端操作界面,实现用户上传一张交通标志图片,识别其名称。
44 6
交通标志识别系统Python+卷积神经网络算法+深度学习人工智能+TensorFlow模型训练+计算机课设项目+Django网页界面
|
11天前
|
机器学习/深度学习 人工智能 算法
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
文本分类识别系统。本系统使用Python作为主要开发语言,首先收集了10种中文文本数据集("体育类", "财经类", "房产类", "家居类", "教育类", "科技类", "时尚类", "时政类", "游戏类", "娱乐类"),然后基于TensorFlow搭建CNN卷积神经网络算法模型。通过对数据集进行多轮迭代训练,最后得到一个识别精度较高的模型,并保存为本地的h5格式。然后使用Django开发Web网页端操作界面,实现用户上传一段文本识别其所属的类别。
24 1
【新闻文本分类识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
|
9天前
|
机器学习/深度学习 算法 计算机视觉
深度学习之图像修复算法
基于深度学习的图像修复算法旨在通过学习和生成模型来填补图像中的缺失或损坏部分。
28 7
|
8天前
|
机器学习/深度学习 人工智能 自然语言处理
深度学习中的优化算法及其应用
本文旨在探讨深度学习中常用的优化算法,包括梯度下降、动量方法、AdaGrad、RMSProp和Adam等。通过分析每种算法的原理、优缺点及适用场景,揭示它们在训练深度神经网络过程中的关键作用。同时,结合具体实例展示这些优化算法在实际应用中的效果,为读者提供选择合适优化算法的参考依据。
|
12天前
|
机器学习/深度学习 人工智能 自然语言处理
深度学习还不如浅层网络?RL教父Sutton持续反向传播算法登Nature
【9月更文挑战第24天】近年来,深度学习在人工智能领域取得巨大成功,但在连续学习任务中面临“损失可塑性”问题,尤其在深度强化学习中更为突出。加拿大阿尔伯塔大学的研究人员提出了一种名为“持续反向传播”的算法,通过选择性地重新初始化网络中的低效用单元,保持模型的可塑性。该算法通过评估每个连接和权重的贡献效用来决定是否重新初始化隐藏单元,并引入成熟度阈值保护新单元。实验表明,该算法能显著提升连续学习任务的表现,尤其在深度强化学习领域效果明显。然而,算法也存在计算复杂性和成熟度阈值设置等问题。
25 2
|
11天前
|
机器学习/深度学习 人工智能 算法
【果蔬识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
【果蔬识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台。果蔬识别系统,本系统使用Python作为主要开发语言,通过收集了12种常见的水果和蔬菜('土豆', '圣女果', '大白菜', '大葱', '梨', '胡萝卜', '芒果', '苹果', '西红柿', '韭菜', '香蕉', '黄瓜'),然后基于TensorFlow库搭建CNN卷积神经网络算法模型,然后对数据集进行训练,最后得到一个识别精度较高的算法模型,然后将其保存为h5格式的本地文件方便后期调用。再使用Django框架搭建Web网页平台操作界面,实现用户上传一张果蔬图片识别其名称。
31 0
【果蔬识别系统】Python+卷积神经网络算法+人工智能+深度学习+计算机毕设项目+Django网页界面平台
|
7天前
|
机器学习/深度学习 数据采集 自然语言处理
深度学习在自然语言处理中的应用与挑战
本文探讨了深度学习技术在自然语言处理(NLP)领域的应用,包括机器翻译、情感分析和文本生成等方面。同时,讨论了数据质量、模型复杂性和伦理问题等挑战,并提出了未来的研究方向和解决方案。通过综合分析,本文旨在为NLP领域的研究人员和从业者提供有价值的参考。
|
3天前
|
机器学习/深度学习 算法 算法框架/工具
深度学习在图像识别中的应用及代码示例
【9月更文挑战第32天】本文将深入探讨深度学习在图像识别领域的应用,包括其原理、技术、优势以及挑战。我们将通过一个简单的代码示例,展示如何使用深度学习技术进行图像识别。无论你是初学者还是有经验的开发者,都可以从中获得启发和帮助。让我们一起探索这个充满无限可能的领域吧!
16 8

热门文章

最新文章

下一篇
无影云桌面