强化学习的数学原理

z_ai
RL笔记
2025-07-02
38热度
0评论

Chapter 1 Basic Concepts

1.1 State and Action

states, states space, actions, action space,

1.2 State transition

When taking an action, the agent may move from one state to another. Such a process is called state transition.

1.3 Policy

A policy tells the agent which actions to take at every state.

1.4 Reward

After executing an action at a state, the agent obtains a reward, denoted as r, as feedback from the environment.

The reward is a function of the state s and action a. Hence, it is also denoted as r(s, a).

1.5 Trajectories, returns, and episodes

1.5.1 Trajectory

A trajectory is a state-action-reward chain.

1.5.2 Returns

Returns are also called total rewards or cumulative rewards.

Returns can be used to evaluate policies

A return consists of an immediate reward and future rewards.

the discounted return is the sum of the discounted rewards

gamma 是折扣率 discount rate 范围是0-1

discounted return作用：

First, it removes the stop criterion and allows for infinitely long trajectories. Second, the discount rate can be used to adjust the emphasis placed on near- or far-future rewards.

1.5.3 episode

When interacting with the environment by following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial ) 即与环境交互直到终止状态产生的轨迹被称为episode

1.6 Markov decision processes

Markov decision processes (MDPs)

An MDP is a general framework for describing stochastic dynamical systems. The key ingredients of an MDP are listed below.

Sets: State space, Action space, Reward set

Model: State transition probability, Reward probability, p(s′|s, a) and p(r|s, a) for all (s, a) are called the model or dynamics.

Policy: In state s, the probability of choosing action a is π(a|s).

Markov property: The Markov property refers to the memoryless property of a stochastic process.

Chapter 2 State Values and Bellman Equation

state value, Bellman equation, policy evalution, action value

2.1 Motivating example 1: Why are returns important?

2.2 Motivating example 2: How to calculate returns?

calculate separately

bootstrapping

the value of v can be calculated easily as

2.3 State values

is called the state-value function or simply the state value of s

2.4 Bellman equation

the Bellman equation is a set of linear equations that describe the relationships between the values of all the states

then

2.5 Examples for illustrating the Bellman equation

见书2.5节

2.6 Matrix-vector form of the Bellman equation

其中都是向量的形式

2.7 Solving state values from the Bellman equation

2.7.1 Closed-form solution

since is a simle linear equation, its closed-form solution can be easily obtained as

直接求解

2.7.2 Iterative solution

基于下列迭代式求解

2.8 From state value to action value

action value =

the relationship between action values and state values

动作价值和状态价值之间的关系

2.8.1 The Bellman equation in terms of action values

2.13代入2.14

矩阵形式没看懂怎么转换过来的

Chapter 3 Optimal State Values and Bellman Optimality Equation

Definition 3.1 (Optimal policy and optimal state value). A policy π∗ is optimal if vπ∗(s) ≥ vπ(s) for all s ∈ S and for any other policy π. The state values of π∗ are the optimal state values.

3.1 Bellman optimality equation

3.1.1 Maximization of the right-hand side of the BOE

简而言之就是最大化公式3.1其实就是最大化q(s,a)，就是选择价值最大的动作

3.1.2 Matrix-vector form of the BOE

3.1.3 Contraction mapping theorem

3.1.4 Contraction property of the right-hand side of the BOE

(1) Theorem 3.2

3.2 Solving an optimal policy from the BOE

geedy 贪婪

3.3 Factors that influence optimal policies

discount rate、reward values

最优策略不变性

Chapter4 Value Iteration and Policy Iteration

4.1 Value Iteration

两步

1、策略更新：贪婪的找到回报最高的action

2、价值更新：根据每一个的state选择的action更新状态价值

4.1.1 元素形式和实现

4.2 Policy Iteration

两步

1、策略评估：计算该策略下的所有state的价值

2、策略提升：选择最优的策略

如何计算?

1、直接求解

2、迭代式求解

策略迭代收敛定律 Convergence of policy iteration

The state value sequence generated by the policy iteration algorithm converges to the optimal state value. As a result, the policy sequenceconverges to an optimal policy.

4.3 Truncated Policy Iteration

计算状态价值的时候迭代一次就是价值迭代算法，迭代到最后收敛的话就是策略迭代算法，而截断策略迭代算法则是迭代k次的算法。

一方面，它的收敛速度比值迭代算法更快，因为它在策略评估步骤中计算了多个迭代。另一方面，它的收敛速度比策略迭代算法慢，因为它只计算有限数量的迭代。

Chapter 5 Monte Carlo Methods

Model-Free

5.1 Motivating example: Mean Estimation

均值估计问题

5.2 MC Basic: The simplest MC-based algorithm

将策略迭代算法中的策略评估部分替换为MC估计

获得动作价值函数

model-based

model-free

5.2.2 The MC basic algorithm

step1 策略评估

计算所有(s, a)对的动作价值函数，方法就是收集足够多的episodes，使用其平均回报作为动作价值函数的估计。

step2 策略提升

5.3 MC Exploring Starts

5.3.1 Utilizing samples more efficiently

更充分利用采样的数据，在5.2中每一次估算都需要采样很多条episodes来计算一个(s, a)对，利用不够充分，我们可以迭代式的使用每次采样的数据来更新一条episode的很多(s, a)状态价值对的动作价值函数。

5.3.2 Updating policies more efficiently

更高效更新策略，利用单个episode的返回来近似更新相应的action值，迭代式的更新，跟5.3.1说的一样，同一种方法

5.3.3 Algorithm description

exploring starts 条件需要从每个 state-action 对开始的足够多的 episode。只有对每一个状态-动作对都进行了很好的探索，我们才能准确估计它们的动作值（根据大数定律），从而成功找到最优策略。否则，如果某个操作没有得到很好的探索，其操作值可能会被错误地估计，并且即使它确实是最好的操作，策略也可能不会选择该操作。MC Basic 和 MC Exploring Starts 都需要此条件。但是，在许多应用程序中很难满足此条件，尤其是那些涉及与环境物理交互的应用程序。我们可以取消探索开始要求吗？答案是肯定的，如下一节所示。