Academic

Improving Latent Generalization Using Test-time Compute

arXiv:2604.01430v1 Announce Type: new Abstract: Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We

A
Arslan Chaudhry, Sridhar Thiagarajan, Andrew Lampinen
· · 1 min read · 3 views

arXiv:2604.01430v1 Announce Type: new Abstract: Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or 'thinking', specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.

Executive Summary

This article proposes an innovative approach to improve latent generalization in language models (LMs) through test-time compute, or 'thinking'. By training models to produce long chains-of-thought (CoTs) using Reinforcement Learning (RL) from correctness feedback, the authors demonstrate significant improvements in latent generalization on in-distribution knowledge. Moreover, the thinking approach generalizes to new knowledge, outperforming augmentation baselines. While the results show promise, the brittleness of factual self-verification limits the performance of thinking models on pure reversal tasks. The study highlights the potential of test-time thinking as a flexible direction for improving LMs' latent generalization capabilities.

Key Points

  • Test-time compute (thinking) improves latent generalization in LMs
  • Reinforcement Learning from correctness feedback enables long chains-of-thought
  • Thinking generalizes to new knowledge and outperforms augmentation baselines
  • Factual self-verification limits the performance of thinking models on reversal tasks

Merits

Improving Latent Generalization

The study provides a innovative approach to address the limitations of in-weights learning, enabling LMs to perform deductive reasoning and generalize to new knowledge.

Flexibility and Scalability

The thinking approach generalizes to new knowledge, making it a promising direction for improving LMs' latent generalization capabilities.

Demerits

Limitations on Reversal Tasks

The brittleness of factual self-verification limits the performance of thinking models on pure reversal tasks, requiring further research to overcome this limitation.

Task-Specific Solutions

The study's focus on reversal tasks may limit the applicability of the thinking approach to other tasks, requiring further research to generalize the results.

Expert Commentary

This article contributes significantly to the field of natural language processing (NLP) by providing an innovative approach to improve latent generalization in LMs. The study's findings demonstrate the potential of test-time thinking in enabling LMs to perform deductive reasoning and generalize to new knowledge. While the results are promising, the limitations of factual self-verification and task-specific solutions require further research to overcome. The study's implications for practical applications and policy decisions highlight the need for continued investment in NLP research.

Recommendations

  • Further research should focus on developing more robust and generalizable LMs that can integrate in-weights learning and in-context learning.
  • The thinking approach should be applied to a broader range of tasks and domains to evaluate its generalizability and scalability.

Sources

Original: arXiv - cs.LG