LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
arXiv:2604.03263v1 Announce Type: new Abstract: Most current long-context language models still rely on attention to handle both local interaction and long-range state, which leaves relatively little room to test alternative decompositions of sequence modeling. We propose LPC-SM, a hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block, and we use Orthogonal Novelty Transport (ONT) to govern slow-memory writes. We evaluate a 158M-parameter model in three stages spanning base language modeling, mathematical continuation, and 4096-token continuation. Removing mHC raises the Stage-A final LM loss from 12.630 to 15.127, while adaptive sparse control improves the Stage-B final LM loss from 12.137 to 10.787 relative to a matched fixed-ratio continuation. The full route remains stable at sequence length 4096, where Stage C ends with final LM loss 11.582 and improves the delayed-identifier diagnos
arXiv:2604.03263v1 Announce Type: new Abstract: Most current long-context language models still rely on attention to handle both local interaction and long-range state, which leaves relatively little room to test alternative decompositions of sequence modeling. We propose LPC-SM, a hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block, and we use Orthogonal Novelty Transport (ONT) to govern slow-memory writes. We evaluate a 158M-parameter model in three stages spanning base language modeling, mathematical continuation, and 4096-token continuation. Removing mHC raises the Stage-A final LM loss from 12.630 to 15.127, while adaptive sparse control improves the Stage-B final LM loss from 12.137 to 10.787 relative to a matched fixed-ratio continuation. The full route remains stable at sequence length 4096, where Stage C ends with final LM loss 11.582 and improves the delayed-identifier diagnostic from 14.396 to 12.031 in key cross-entropy. Taken together, these results show that long-context autoregressive modeling can be organized around a broader division of labor than attention alone.
Executive Summary
The article LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling presents a novel hybrid autoregressive architecture that separates local attention, persistent memory, predictive correction, and run-time control within the same block. The authors evaluate a 158M-parameter model in three stages, demonstrating its effectiveness in base language modeling, mathematical continuation, and 4096-token continuation. The proposed architecture, combined with Orthogonal Novelty Transport (ONT), shows promising results in improving long-context autoregressive modeling. The authors' approach addresses the limitations of attention-based models and provides a more efficient decomposition of sequence modeling. The study's findings have significant implications for natural language processing and language modeling, and the proposed architecture has the potential to be applied in various NLP tasks.
Key Points
- ▸ The authors propose a hybrid autoregressive architecture, LPC-SM, which separates local attention, persistent memory, predictive correction, and run-time control within the same block.
- ▸ The proposed architecture uses Orthogonal Novelty Transport (ONT) to govern slow-memory writes.
- ▸ The authors evaluate a 158M-parameter model in three stages, demonstrating its effectiveness in base language modeling, mathematical continuation, and 4096-token continuation.
Merits
Strength in Decomposition
The proposed architecture provides a more efficient decomposition of sequence modeling by separating local attention, persistent memory, predictive correction, and run-time control within the same block.
Effectiveness in Long-Context Modeling
The authors demonstrate the effectiveness of the proposed architecture in long-context autoregressive modeling, achieving promising results in various tasks.
Efficient Use of Memory
The use of Orthogonal Novelty Transport (ONT) to govern slow-memory writes enables efficient use of memory and improves the overall performance of the model.
Demerits
Limited Evaluation
The authors' evaluation is limited to a single 158M-parameter model, and it would be beneficial to explore the proposed architecture with different model sizes and configurations.
Dependency on Orthogonal Novelty Transport
The effectiveness of the proposed architecture relies heavily on the use of Orthogonal Novelty Transport (ONT), which may not be suitable for all applications or models.
Expert Commentary
The article LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling presents a novel and promising approach to long-context autoregressive modeling. The proposed architecture provides a more efficient decomposition of sequence modeling and demonstrates effectiveness in various tasks. However, the authors' evaluation is limited, and further research is needed to explore the proposed architecture with different model sizes and configurations. Additionally, the effectiveness of the proposed architecture relies heavily on the use of Orthogonal Novelty Transport (ONT), which may not be suitable for all applications or models. Nevertheless, the study's findings have significant implications for natural language processing and language modeling, and the proposed architecture has the potential to be applied in various NLP tasks.
Recommendations
- ✓ Further research is needed to explore the proposed architecture with different model sizes and configurations.
- ✓ The authors should investigate the use of alternative memory management techniques to improve the overall performance of the model.
Sources
Original: arXiv - cs.CL