Denser ≠ Better: Limits of On-Policy Self-Distillation for Continual Post-Training
TL;DR: Dense token-level supervision accelerates specialization, but brings more severe forgetting, hurting continual learning.
Recent work suggests that reinforcement learning naturally mitigates catastrophic forgetting during continual post-training123. We revisit this claim through two on-policy methods: token-level supervision Self-Distillation Policy Optimization (SDPO)2 versus sequence-level rewards Group Relative Policy Optimization (GRPO)4.
Our findings show that:
- 🚀 SDPO rapidly specializes to the current domain.
- 🎓 Self-distillation is highly sensitive to teacher stability and quality.
- ⚠️ Dense token-level supervision increases model drift and forgetting.
- 📉 In continual post-training, SDPO can become unstable or even collapse.
- ✅ GRPO learns more conservatively while preserving prior capabilities significantly better.
Main Findings
| Finding | Summary |
|---|---|
| Teacher Stability | Stable teachers outperform rapidly updated EMA teachers. |
| CoT Distillation | More supervision is not always better—long CoTs often amplify noise. |
| OOD Generalization | SDPO improves source-like tasks but hurts partially related domains. |
| Continual Learning | SDPO accumulates forgetting across sequential domains. |
| Model Drift | Dense supervision causes substantially larger parameter and response drift. |
| Collapse | Self-distillation can amplify formatting artifacts into catastrophic failures. |
Citation
@misc{wang2026denserneqbetterlimits,
title={Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training},
author={Meng Wang and Haohan Zhao and Wenzhuo Liu and Lu Yang and Geng Liu and Haiyang Guo and Guo-Sen Xie and Gaofeng Meng and Hongbin Liu and Fei Zhu},
year={2026},
eprint={2607.01763},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2607.01763},
}
-
https://arxiv.org/abs/2511.08567 “The Path Not Taken: RLVR Provably Learns Off the Principals” ↩
-
https://arxiv.org/abs/2601.20802 “Reinforcement Learning via Self-Distillation” ↩ ↩2
-
https://arxiv.org/abs/2507.05386 “Reinforcement Fine-Tuning Naturally Mitigates Forgetting in Continual Post-Training” ↩
-
https://arxiv.org/abs/2402.03300 “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models” ↩