copy and paste this google map to your website or blog!
Press copy button and paste into your blog or website.
(Please switch to 'HTML' mode when posting into your blog. Examples: WordPress Example, Blogger Example)
[1706. 03762] Attention Is All You Need - arXiv. org The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration The best performing models also connect the encoder and decoder through an attention mechanism We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely
arXiv. org e-Print archive arXiv is a free distribution service and an open-access archive for nearly 2 4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics Materials on this site are not peer-reviewed by arXiv
Submit TeX LaTeX - arXiv info TeX Submissions (La)TeX processing changes — April 2025 Comparison between the Legacy Submission System and the current one Supported TeX processors Submissions are automatically processed Considerations for LaTeX submissions Considerations for PDFLaTeX submissions We don't have your style files or macros Do not submit in double-spaced "referee" mode Prepare the references carefully Include
[2505. 17117] From Tokens to Thoughts: How LLMs and Humans . . . Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e g , robin and blue jay are both birds; most birds can fly) These concepts reflect a trade-off between expressive fidelity and representational simplicity Large Language Models (LLMs) demonstrate remarkable linguistic abilities
[2006. 16668] GShard: Scaling Giant Models with Conditional . . . Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel
Fourier Position Embedding: Enhancing Attentions Periodic . . . Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention Using Discrete Signal Processing theory, we
The Entropy Mechanism of Reinforcement Learning for Reasoning . . . Abstract This paper aims to overcome a major obstacle in scaling reinforcement learning (RL) for reasoning with large language models (LLMs), namely the collapse of policy entropy Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, leading to an overly confident policy model As a