1 简介
Introduction
In a search and rescue scenario, a robot must navigate through a disaster area, identifying survivors to be rescued while preserving its own safety. Many tradeoffs must be calculated, such as whether to exploit a known area or explore a new area. Acquired knowledge about the safety levels of certain areas, likelihood of finding survivors, and urgency of attention for each survivor are all factors that must account into the route planning of the robot.
This applied scenario is essentially a foraging task in which beliefs about foraging sites and safety of areas can be modeled by Bayesian inference methods. In this project, I simulate small environments with different arrangements and configurations of foraging sites, and observe the effects of explorative and exploitative agents searching the environment.
Methods
The project was completed in two stages. The first focuses on getting the agent to effectively learn the changing danger levels in an environment. The second stage adds on the element of foraging.
Experiment 1: Inferring the safety levels of different regions of the map
Figure 1
As seen in figure 1, the environment is simulated as a square 2D grid. Gray squares indicate off-limit areas, while any other colored area indicates a part of the valid path. Squares of increasing redness indicate the danger level of the area, a value that ranges 1-100. A random destination, marked by a star, is set for the agent, marked by a dot, who first starts out in the home square, marked by a dark outline and star pattern. The left side of the figure represents the true state of the world, whereas the right side represents the agent’s beliefs of the world. As we see in the initial maps, the agent starts out with no knowledge of the dangers on the map, and is already aware of the entire physical layout and able to calculate multiple routes to reach the assigned destination.
Each step in the simulation has a defined sequence of events. First, all possible routes from the current location to the destination are calculated. Then, the accumulated danger of each route is found by summing the danger values of each square involved in the route. The agent takes a weighted sum of the distance and accumulated danger of each route and chooses the ideal route to take. It takes one step along this preferred route. In the case of this small map, the danger rating is multiplied by .1 to achieve a reasonable effect, such that a path of length 3 containing a danger spot of rating 40 would amount to a route desirability score of 7. In the new location, the agent samples the area for danger and updates its beliefs about the danger rating of that square. It also does the same for all neighboring locations of the current square. If the destination is now reached, a new random destination is assigned, and the process starts over. The sequence of random destination points allows the agent to roam the entire environment for our observation.
For each square, the agent has a belief value of the danger level of the square, which can range from 1 to 100. The prior belief of this value is represented as a normal probability distribution the danger levels as μ parameters.
Belief of μ is updated by Bayes rule using maximum a posteriori estimate. The prior is represented as a normal probability distribution as in the following equation. Initially μ is set to 0 and σ to 1.
Likelihood is also represented as a normal distribution, with x as a new observation of danger rating.
Hypotheses about μ are updated using Bayes rule.
To calculate the maximum a posteriori estimate, we note that the likelihood is set to a distribution of N(μ,1) while the prior has a distribution of N(μ0 ,σ0). The product of two normal distributions results in another normal distribution with parameters μ and σ as
Thus, the mode of such a distribution is simply a calculation of μ. We can then plug in values to determine the updated belief. Setting σ0 to 1 for simplicity and n to 1 (a memoryless strategy), we end up with a simple averaging of the old and new belief of danger.
Figure 2
With this simple model of averaging, the agent is able to effectively learn the danger ratings of each area. Figure 2 shows the exploring agent at the initial stages of exploration. It has learned the danger ratings of a few areas.
Figure 3
After more time has passed, the entire map is learned.
Figure 4
The agent can adapt to new dangers in the environment. The addition of a dangerous spot in the center is immediately detected, and will eventually be completely learned.
Figure 5
The agent also properly strategizes its route. In Figure 5, there are 2 reasonably short routes from the current location to the destination. After the agent has properly learned the danger mapping, it will take the slightly longer route in order to avoid the crossing the highly dangerous red strip. We can see that it effectively balances between distance and safety.
Experiment 2: Collecting resources on the map
Figure 6
The next step is to add a foraging component to the search and rescue. The stars on the map now represent resources, with darker shades of blue representing higher reward values from a range of 1-100. Similar to the safety levels, the reward levels are sampled when the agent crosses a resource in its path.
Figure 7
The agent’s new goal is to pick its own new destinations, whether it is to go to a new foraging site or to return home. The agent now has a health value that gets decremented by 1 for distance and further decremented by a sampling of the danger in that area. If health reaches below zero, the agent dies and starts over at the home square, with no recollection of past observances. The agent can only restore health by returning to the home square, but must also strive to maximize foraged rewards. An exploration parameter is also added. This parameter directly influences 3 things: the prior belief of the resources having high reward, the minimum health buffer that the agent maintains before heading home, and the selection of destination. An explorative agent will simply pick a random resource to visit, whereas and exploitative agent will pick one that it has seen to provide the greatest reward.
Rather than a normal distribution, the rewards are sampled along an exponential distribution to ensure that all values are positive. The goal of the agent is then to learn the decay rate of λ for the exponential distribution of each square. For ease of computation, the conjugate prior of a gamma distribution is used to represent prior belief of λ.
The likelihood is represented by an exponential distribution
Hypotheses of gamma are updated using Bayes rule
Noting that our prior has a distribution of Gamma(α,β) and our likelihood has distribution Exponential(λ), the conjugate pair of distributions results in a posterior of Gamma as well, with α and β calculated as
.
The mode of a Gamma distribution is
.
Therefore, plugging in the α and β of the posterior into the mode equation results in the following:
Exploration is a value ranging (0,1) that affects 3 parts of the decision-making process:
1)The initial value of α in the belief. A highly explorative agent will have an α that affects the shape of the distribution such that the mean is very high. The agent thinks all foraging locations are very promising.
2)The safety net value of health that the agent maintains before deciding to head home
3)Picking a destination. A random number is chosen in the range (0,1). If it is less than the exploration value, the agent will pick a random foraging site to head towards rather than just picking a site it knows is good.
Results
For each map, 5 conditions of exploration parameter values are tested for a fixed and equal amount of time (50 steps). Then, for each map, the number of deaths occurred and reward earned in each life is recorded down.
Map 1 is designed for the benefit of a more explorative agent. Foraging spots of very high value are hidden within dangerous areas that an explorative agent would be more likely to reach. Map 2 is designed more to the benefit of an exploitative agent, as the hidden rewards are of very low value. Map 3 is somewhere in between 1 and 2, with some hidden foraging spots worth a lot and some worth a little.
Map 1: Environment beneficial for exploration
Map 2: Environment beneficial for exploitation
2 部分代码
%Route Planning%Iteration 1: Only learns the safety level of each location%Runs route planning simulation%Parameters: sigma, safety_wt, exploration?, exploitation?%Variables: real grid, belief grid, location, destination%0:neutral, >0:safe, <0:unsafe, NaN:not a valid locationfunction bayesianNavigation() clear all; %ui functions and variables speed = .2; quit = false; function set_real_grid(points) points = round(points); real_grid(points(1,2),points(1,1))=str2double(get(danger_edit,'String')); end function set_speed(sp) speed = sp; end function qt() quit = true; end %Set up figure/UI f = figure('Position', [100, 100, 1200, 500]); danger_edit = uicontrol('Parent',f,'Style','edit','Position',[60,475,40,23],'String','0'); danger_text = uicontrol('Parent',f,'Style','text','Position',[10,475,40,23],'String','Danger Rating:'); qt_ui = uicontrol('Parent',f,'Style','pushbutton','Position',[110,475,40,23],'String','Quit','Callback',@(~,~)qt()); spd_ui = uicontrol('Parent',f,'Style','slider','Position',[170,473,200,23],'Min',0,'Max',10,'SliderStep',[1 1]./10,... 'Value',5,'Callback',@(h,~)set_speed(get(h,'Value'))); health_text = uicontrol('Parent',f,'Style','text','Position',[425,475,40,23],'String','Health: 1000'); reward_text = uicontrol('Parent',f,'Style','text','Position',[475,475,40,23],'String','Reward: 30'); %Set parameters sigma = 1; danger_wt = .5; %Set real and belief grid %real_grid = [0 0 50; 0 NaN 0; 0 0 0]; %belief_grid = [0 0 0; 0 NaN 0; 0 0 0]; real_grid = [0 0 50 0 0; 0 NaN NaN NaN 0; 0 0 0 0 20; 0 NaN NaN NaN 0; 50 0 0 0 30]; belief_grid = [0 0 0 0 0; 0 NaN NaN NaN 0; 0 0 0 0 0; 0 NaN NaN NaN 0; 0 0 0 0 0]; %Set home, real resources, belief resources home = [1 1]; real_resources = [1 1 0]; belief_resources = [1 1 0]; %Set health counter health_counter = 1000; %Set reward counter reward_counter = 0; %Set current location and destination %curr_loc = [3 3]; %dest_loc = [1 1]; curr_loc = [1 1]; dest_loc = [5 5]; %Loop through time until whenever while true if quit break end %plot and wait s1 = subplot(1,2,1); imgsc = plot_grid(real_grid,curr_loc,dest_loc,home,real_resources); ax = imgca; set(imgsc,'ButtonDownFcn',@(~,~)set_real_grid(get(ax,'CurrentPoint')),'HitTest','on') s2 = subplot(1,2,2); plot_grid(belief_grid, curr_loc, dest_loc, home, belief_resources); pause(1-speed/11); %check whether goal has been reached if isequal(curr_loc,dest_loc) %compare journey with optimal journey and record results %set new dest dest_loc = new_dest(real_grid); else %make safety observation and update belief grid observation = randn + real_grid(curr_loc(1),curr_loc(2)); curr_belief = belief_grid(curr_loc(1),curr_loc(2)); belief_grid(curr_loc(1),curr_loc(2)) = (curr_belief+observation)/2; %assume sigma = 1 %make safety observation and update belief grid for neighbors [ neighbors ] = find_neighbors(curr_loc, belief_grid); for i = 1:size(neighbors,1) observation = randn + real_grid(neighbors(i,1),neighbors(i,2)); curr_belief = belief_grid(neighbors(i,1),neighbors(i,2)); belief_grid(neighbors(i,1),neighbors(i,2)) = (curr_belief+observation)/2; %assume sigma = 1 end %take resource and update belief_resources %update health counter %calculate routes routes = calculate_routes(curr_loc,dest_loc,belief_grid); %calculate safety belief of each route danger_ratings = zeros(1,size(routes,1)); for i =1:size(routes,1) route = routes{i}; idx = sub2ind(size(belief_grid), route(:,1), route(:,2)); danger_ratings(i) = sum(belief_grid(idx)); end %calculate distance of each route distances = zeros(1,size(routes,1)); for i =1:size(routes,1) route = routes{i}; distances(i) = size(route,1); end %calculate desired route based on safety and distance desirability = danger_wt .* danger_ratings + distances; [b,ind] = sort(desirability,'ascend'); %take a step in the desired direction curr_loc = routes{ind(1)}(2,:); end end%Plot difference in safety/efficiency of actual vs optimal route taken as a function of timeend
3 仿真结果
4 参考文献
[1]毛蕊蕊. 基于贝叶斯网络的最佳交通路径规划[D]. 西安工程大学.
博主简介:擅长智能优化算法、神经网络预测、信号处理、元胞自动机、图像处理、路径规划、无人机等多种领域的Matlab仿真,相关matlab代码问题可私信交流。
部分理论引用网络文献,若有侵权联系博主删除。