https://github.com/iBacklight/PipelineLLM/blob/40278bdcfe9d117b850b8abae0b20f5c9c9882db/Post_train/rlhf/GRPO/grpo_algorithm.py#L362-L381 这里回放、旧策略,新策略的计算似乎也有些问题
PipelineLLM/Post_train/rlhf/GRPO/grpo_algorithm.py
Lines 362 to 381 in 40278bd