This file provides a function `register_forward_hook_for_model` that registers a forward hook on every operator of the model. After registration, during model inference, all tensors generated ...
KL 惩罚:约束新策略与参考策略的 KL 散度,防止策略偏离太远 与 PPO 的主要区别: - PPO 使用价值网络估计基线,GRPO 使用组内相对奖励作为基线 - GRPO 不需要训练价值网络,节省显存和计算 - GRPO 引入参考模型和 KL 散度惩罚 """ import torch import torch.nn as nn import torch.optim as optim from ...