A unified view of reinforcement learning methods that
model-based reinforcement learning method : planning
model-free reinforcement learning method : learning
In this chapter, our goal is a similar integration of model-based and model-free methods - intermix them
model : anything that an agent can use to predict how the environment will respond to its action
distribution model : produce a description of all possibilities and their probabilities
sample model : produce just one of the possibilities
planning : any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment
state-space planning methods share a common structure
model →simulated experience →(backups) →values →policy
various state-space planning methods differ only in the kinds of updates they do
the difference between planning and learning
planning uses simulated experience generated by a model
learning methods use real experience generated by the environment
→ the common structure means that many ideas and algorithms can be transferred between planning and learning
learning methods require only experience as input and in many cases they can be applied to simulated experience just as well as to real experience
e.x. random-sample one-step tabular Q-planning
In addition the second theme in this chapter is the benefits of planning in small, incremental steps
Dyna-Q : a simple architecture integrating the major functions needed in an online planning agent
In planning, there are two roles for real experience
1) : model-learning (indirect reinforcement learning), 2) : direct reinforcement learning
Indirect methods - make fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions
direct methods - much simpler and are not affected by biases in the design of the model
planning(model-learning) : random-sample one-step tabular Q-planning
direct reinforcement learning : one-step tabular Q-planning
model-learning method is also table-based and assumes the environment is deterministic
search control - the process that selects the starting states and actions for the simulated experiences generated by the model
Typically, as in Dyna-Q, the same reinforcement learning methods is used both for learning from real experience and for planning from simulated experience
The reinforcement learning method is thus the final common path for both learning and planning - differing only in the source of their experience
[**Example 8.1] Dyna Maze**
more step make optimal policy more faster than 0-step
I think more step will be better when the model is accurate because reflect the reward on value functions which are far away form termination state
Model may be incorrect because
when the model is incorrect, the planning process is likely to compute a sub-optimal policy
In some cases, the sub-optimal policy computed by planning quickly leads to the discovery and correction of the modeling error
Great difficulties arise when the environment changes to become better than it was before, and yet the formerly correct policy does not reveal the improvement
this general problem is another version of the conflict between exploration and exploitation
Dyna-Q+ solve this problem by heuristic method - to encourage the behavior that tests long-untried actions a special "bonus reward" is given on simulated experience involving these actions
time step , some small
reward would be
[**Example 8.2] Blocking Maze**
[**Example 8.3] Shortcut Maze**