Denser ≠ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Arxiv: 2607.01763 Hugging Face License: Apache 2.0 github.com/Moenupa/SDPO-CL Citation


TL;DR: Dense token-level supervision accelerates specialization, but brings more severe forgetting, hurting continual learning.

Recent work suggests that reinforcement learning naturally mitigates catastrophic forgetting during continual post-training123. We revisit this claim through two on-policy methods: token-level supervision Self-Distillation Policy Optimization (SDPO)2 versus sequence-level rewards Group Relative Policy Optimization (GRPO)4.

Our findings show that:

  • 🚀 SDPO rapidly specializes to the current domain.
  • 🎓 Self-distillation is highly sensitive to teacher stability and quality.
  • ⚠️ Dense token-level supervision increases model drift and forgetting.
  • 📉 In continual post-training, SDPO can become unstable or even collapse.
  • GRPO learns more conservatively while preserving prior capabilities significantly better.

Main Findings

Finding Summary
Teacher Stability Stable teachers outperform rapidly updated EMA teachers.
CoT Distillation More supervision is not always better—long CoTs often amplify noise.
OOD Generalization SDPO improves source-like tasks but hurts partially related domains.
Continual Learning SDPO accumulates forgetting across sequential domains.
Model Drift Dense supervision causes substantially larger parameter and response drift.
Collapse Self-distillation can amplify formatting artifacts into catastrophic failures.
SDPO in Continual Learning
GRPO consistently preserves previously learned capabilities, while SDPO forgets rapidly as more domains are introduced.
SDPO Gain and Forgetting Curve
SDPO exhibits an inverted U-shaped curve in OOD generalization across datasets.
Dense supervision induces much larger changes in both parameter space and response space than sequence-level RL.

Citation

@misc{wang2026denserneqbetterlimits,
      title={Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training}, 
      author={Meng Wang and Haohan Zhao and Wenzhuo Liu and Lu Yang and Geng Liu and Haiyang Guo and Guo-Sen Xie and Gaofeng Meng and Hongbin Liu and Fei Zhu},
      year={2026},
      eprint={2607.01763},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2607.01763}, 
}
  1. https://arxiv.org/abs/2511.08567 “The Path Not Taken: RLVR Provably Learns Off the Principals” 

  2. https://arxiv.org/abs/2601.20802 “Reinforcement Learning via Self-Distillation”  2

  3. https://arxiv.org/abs/2507.05386 “Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training” 

  4. https://arxiv.org/abs/2402.03300 “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”