I am trying to reproduce your result on DDPM model (like SDXL ,but better), not flow matching method. I follow your configuration, don't use cfg when rollout step.
because SDXL need cfg in the inference step, so rollout step returns low quality image. with this trainging data, after 1200 step, training model outputs bad image. like this:
the output images almost lost any texture, the backgrounding color become to gray.
I use geneval reward model only, the reward score curve like this:
