Academic

UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

arXiv:2604.05517v1 Announce Type: new Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both con

arXiv:2604.05517v1 Announce Type: new Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.

Executive Summary

The article introduces UniCreative, a novel reinforcement learning framework designed to address the dual challenges of global coherence in long-form narratives and local expressiveness in short-form texts. By decoupling the tension between macroscopic planning and spontaneous creativity, the authors propose two core innovations: AC-GenRM, an adaptive constraint-aware reward model that dynamically synthesizes task-specific criteria without reliance on ground-truth references, and ACPO, a policy optimization algorithm that aligns model outputs with human preferences across structural and content dimensions. Empirical validation demonstrates the framework's ability to autonomously differentiate between tasks requiring rigorous planning versus direct generation, showcasing emergent meta-cognitive capabilities. This approach circumvents the scalability limitations of traditional alignment paradigms, which depend on static rewards and high-cost supervised data, thereby offering a scalable and reference-free solution to creative writing alignment.

Key Points

  • The tension between global coherence (long-form) and local expressiveness (short-form) in creative writing is addressed through a unified reinforcement learning framework, UniCreative.
  • AC-GenRM introduces adaptive, query-specific reward modeling that eliminates the need for ground-truth references, enabling dynamic preference alignment.
  • ACPO facilitates policy optimization by aligning model outputs with human preferences across both content quality and structural paradigms without supervised fine-tuning.
  • Empirical results highlight UniCreative's emergent meta-cognitive ability to autonomously differentiate task requirements, validating the direct alignment approach.

Merits

Innovation in Alignment Paradigms

UniCreative pioneers a reference-free reinforcement learning framework, eliminating dependency on costly supervised data and static reward signals, which addresses a critical bottleneck in scalable alignment for creative writing.

Adaptive Reward Modeling

AC-GenRM's dynamic synthesis of task-specific criteria enables fine-grained, query-aware preference judgments, enhancing alignment precision without ground-truth references.

Empirical Validation and Emergent Capabilities

The framework demonstrates significant performance gains across diverse writing tasks and exhibits meta-cognitive differentiation between planning and direct generation, suggesting robustness and adaptability.

Demerits

Dependence on Reward Model Quality

The effectiveness of AC-GenRM hinges on the quality and representativeness of synthesized reward signals, which may introduce biases or inconsistencies if not meticulously calibrated.

Computational Overhead

Reinforcement learning frameworks, particularly those involving dynamic reward modeling and policy optimization, entail significant computational resources, potentially limiting accessibility for smaller research teams or organizations.

Validation in Real-World Scenarios

While empirical results are promising, further validation in real-world creative writing contexts—such as professional storytelling or content creation—is necessary to assess broader applicability and user acceptance.

Expert Commentary

The authors present a compelling and timely advancement in the alignment of AI systems for creative writing, addressing a critical gap in the literature. The decoupling of long-form coherence and short-form expressiveness through a unified reinforcement learning framework is both innovative and pragmatic, particularly given the scalability challenges of traditional alignment paradigms. The introduction of AC-GenRM and ACPO represents a significant theoretical contribution, as it challenges the conventional reliance on static rewards and ground-truth references, instead opting for dynamic, query-specific preference modeling. The emergent meta-cognitive ability observed in the model—its capacity to autonomously differentiate between tasks—is particularly noteworthy, as it suggests a step toward more generalizable and adaptable AI systems. However, the practical deployment of such frameworks will require rigorous testing in diverse, real-world scenarios to ensure that the synthesized rewards and optimization objectives align with human values and expectations. Furthermore, the computational demands of the approach may pose challenges for widespread adoption, highlighting the need for further research into efficiency and accessibility.

Recommendations

  • Future work should explore hybrid approaches that combine UniCreative's adaptive reward modeling with lightweight fine-tuning techniques to mitigate computational overhead.
  • To ensure broader applicability, the framework should be validated across a wider range of creative writing tasks, including collaborative human-AI co-creation scenarios.
  • Ethical frameworks and guidelines should be developed in parallel to assess the societal implications of AI-generated creative content, particularly in terms of originality, attribution, and potential misuse.

Sources

Original: arXiv - cs.AI