R-HORIZON: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

Abstract

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand and respond to complex, long-horizon scenarios. To address this incomplete evaluation of Large Reasoning Models (LRMs), we propose R-HORIZON, a method designed to stimulate long-horizon reasoning behaviors in LRMs through query composition. Based on R-HORIZON, we construct a long-horizon reasoning benchmark, comprising complex multi-step reasoning tasks with interdependent problems that span long reasoning horizons. Through comprehensive evaluation of LRMs using the R-HORIZON benchmark, we find that even the most advanced LRMs suffer significant performance degradation. Our analysis reveals that LRMs exhibit limited effective reasoning length and struggle to allocate thinking budget across multiple problems appropriately. Recognizing these limitations, we use R-HORIZON to construct long-horizon reasoning data for reinforcement learning with verified rewards (RLVR). Compared to training with single-horizon data, RLVR with R-HORIZON not only substantially improves performance on the multi-horizon reasoning tasks, but also promotes accuracy on standard reasoning tasks (+7.5 on AIME2024). These results position R-HORIZON as a scalable, controllable, and low-cost paradigm for enhancing and evaluating the long-horizon reasoning capabilities of LRMs.

R-HORIZON Concept Illustration

Overview

Recent advances in reasoning-focused language models (e.g., OpenAI o1, DeepSeek-R1) have demonstrated remarkable improvements through test-time scaling and long Chain-of-Thought (CoT). However, existing benchmarks primarily focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to handle complex, long-horizon scenarios.

To address these limitations, we introduce R-HORIZON, which:

Transforms isolated problems into complex multi-step reasoning scenarios through query composition.
Establishes the R-HORIZON Benchmark comprising 6 representative datasets from mathematics, code generation, and agent applications.
Enables reinforcement learning with verified rewards (RLVR) using long-horizon reasoning data.

The diagram below illustrates our method of transforming single problems into complex chained scenarios.

Evaluation Results

We evaluate 20+ state-of-the-art Large Reasoning Models (LRMs) on the R-HORIZON Benchmark, revealing a significant performance degradation as reasoning horizons increase. Key findings from our benchmark evaluation include:

Universal Performance Degradation: Even the most powerful models suffer severe drops as problem count increases. For instance, DeepSeek-R1 drops from 87.3% (single problem) to 24.6% (5 problems) on AIME25.
Model Size Matters: Larger models exhibit more resilience to multi-horizon challenges.
Task-Dependent Degradation: Code generation tasks show steeper performance declines compared to mathematics. Many reasoning models lose their tool-calling abilities in web search scenarios.

Training with R-HORIZON

We use R-HORIZON composed data for Reinforcement Learning with Verified Rewards (RLVR). Training with this long-horizon data yields substantial improvements on both single and multi-horizon reasoning tasks, demonstrating the effectiveness of our method for enhancing LRM capabilities.

Training Results Highlights:

Dual Performance Gains: Training with 2-composed problems significantly improves both multi-horizon reasoning (+17.4 points on AIME24 n=2) and single-problem performance (+7.5 points on AIME24 original).
Scalable Complexity: Increasing composition complexity (n=4) enhances the model's ability to handle problems requiring more reasoning steps, achieving 50.6% on Math500 (n=8).

Results of different number of composed queries and reward function
	MATH500		AIME24		AIME25		AMC23		Avg.
	Origin	n=8	Origin	n=2	Origin	n=2	Origin	n=2	Origin	Multi
R1-Qwen-7B	93.6	11.8	48.3	16.4	33.3	3.5	90.2	48.8	66.4	20.1
Baseline (n=1)	95.6	8.4	57.9	16.7	47.9	5.1	95.9	55.0	74.3	21.3
R-HORIZON (n=2)	95.4	21.4	65.4	34.1	49.6	10.0	94.1	80.6	76.1	36.5
R-HORIZON (n=4)	94.6	50.6	62.9	34.8	45.4	8.1	91.9	79.1	73.7	43.2

Key Findings

1. Limited Effective Reasoning Length

As the number of interdependent problems increases, Large Reasoning Models (LRMs) struggle to maintain their performance. The gap between actual and theoretical accuracy widens significantly, indicating that models cannot sustain their original performance over longer reasoning horizons.

Model errors stabilize within a certain context range. For instance, the smaller 7B model's primary error range is (4-6k tokens), while the larger 32B model's range is extended to (8-10k tokens), suggesting larger models possess a longer effective reasoning boundary.

Analysis of accuracy and error position with R1-Qwen-7B and R1-Qwen-32B

2. Localized Reflection Behavior

Models' reflection frequency increases and then converges as the number of problems grows. However, over half of the complex tasks lack any long-range reflection (reflection that spans beyond the current problem), indicating that the reflection mechanism in current LRMs is highly localized and insufficient for long-horizon scenarios.

Reflection analysis on MATH500 dataset

3. Inefficient Thinking Budget Allocation

Current mainstream LRMs, including state-of-the-art models like DeepSeek-R1, exhibit an inability to effectively allocate their thinking budget across the reasoning horizon. They tend to over-allocate tokens to early reasoning stages, failing to distribute resources reasonably to subsequent, critical problems.

The thinking budget allocation for different query configurations across models

4. Impact of R-HORIZON Training

Training models with R-HORIZON composed data (using Reinforcement Learning with Verified Rewards) promotes significantly more efficient and robust reasoning across multiple metrics:

Improved Performance: Training significantly improves model performance on composed tasks and shows better generalization to longer horizons, while alleviating the "overthinking" phenomenon (generating shorter, more efficient responses).
Better Allocation: Models learn a more reasonable token budget allocation across multi-step problems.
Longer Reflection: R-HORIZON facilitates engaging in longer-range reflection with increasing frequency, directly improving performance on long-horizon reasoning.

Analysis of reinforcement learning effects with single and composed datasets

BibTeX

@misc{lu2025rhorizonfarlargereasoning,
  title={R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?}, 
  author={Yi Lu and Jianing Wang and Linsen Guo and Wei He and Hongyin Tang and Tao Gui and Xuanjing Huang and Xuezhi Cao and Wei Wang and Xunliang Cai},
  year={2025},
  eprint={2510.08189},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2510.08189}
}

R-HORIZON How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?