WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement
arXiv:2603.22352v1 Announce Type: new Abstract: Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improvement of language models, but existing …
Fangyuan Li, Pengfei Li, Shijie Wang, Junqi Gao, Jianxing Liu, Biqing Qi, Yuqiang Li
2 views