Handbook for Data Science and Artificial Intelligence
內容目录
Handbook for Data Science and Artificial Intelligence
內容目录

4.4.4. 强化学习

强化学习就是学习“做什么(即如何把当前的情境映射成动作)”才能使数值化的收益信号最大化。

  • 【试错】学习者不会被告知应该采取什么动作,而是必须自己通过尝试去发现那些动作会产生最丰厚的累积收益。
  • 【延迟收益】动作往往影响的不仅仅是即时收益,也会影响下一个情境,从而影响随后的收益。

收益 —— reward,回报 —— return,gain

幕 episode

4.4.4. 一些关键性的概念

4.4.4. 折扣加权平均

$$ (1-a)^n+\sum_{i=1}^{n}\alpha(1-a)^{n-i}=1 $$

权值和是 1

4.4.1.2. bootstrap (自举)

用后继各个状态的价值估计值来更新当前某个状态的价值估计值。

4.4.4. 最大化偏差

原因:将估计值中的最大值视为对真实价值的最大值的估计

解决方法:double-Q,一个Q用来确定最大动作,一个Q用来计算其价值估计

$$ Q_1(S,A) \leftarrow Q_1(S,A) + \alpha(R + \gamma Q_2(S',argmax_{a}Q_1(S',a))-Q_1(S,A)) $$

4.4.4. 对比

4.4.1.4.1. DP vs MC vs TD

  • DP: model-based
  • MC: model free, exclusively relying on actual rewards and complete returns.
  • TD: model free, TD learning methods update targets with regard to existing estimates. This approach is known as bootstrapping.

DP 利用了 bootstrap (自举) 法来进行期望更新。MC 和 TD 都是采样更新。 但相较于MC,TD单次采样不需要进行到终局,它使用了自举法,用当前估计值来代替真实值。 相较于DP,TD 的计算是基于采样得到的单个后继节点的样本数据,而不是基于所有可能后继节点的完整分布。

4.4.4. 探索利用窘境

  • $\epsilon$-greedy
  • UCB: quantize the uncertainty and treat as a part of the value estimation.

$$ A_t = \mathop{argmax} \limits_{a} [Q_t(a)+c\sqrt{\frac{ln~t}{N_t(a)}}] $$

  • softmax

$\epsilon$-greedy 进行非贪心决策的时候是盲目的选择,但更好的是根据潜力(即可能的收益)来进行选择。潜力的评估指标有:

  • 估计的最大值 $\rightarrow$ 会导致最大化偏差
  • 不确定性

4.4.2.1. off-policy:

  • target policy (目标策略): 用来学习的策略
  • 行动策略: 生成行动样本的策略

相比较于 on-policy,off policy 方差更大,收敛更慢

4.4.4. 环境

4.4.4.1. Gym

Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball.[Official website]

4.4.4.1.1. A New Gym Env

4.4.4.1.1.1. Create a env class
The main API methods that users of this class need to know are:
        step
        reset
        render
        close
        seed
And set the following attributes:
        action_space: The Space object corresponding to valid actions
        observation_space: The Space object corresponding to valid observations
        reward_range: A tuple corresponding to the min and max possible rewards

Refer to the full documentation and code for the details

All step function should return four values:

observation (object): agent's observation of the current environment
reward (float) : amount of reward returned after previous action
done (bool): whether the episode has ended, in which case further step() calls will return undefined results
info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

reset fucntion should return the initial states

4.4.4.1.1.2. Register the new environment

using the following code to register a new gym environment

from gym.envs.registration import register

register(
    id='VirtualTB-v0',
    entry_point='virtualTB.envs:VirtualTB',
)