icon HiAR-ICL: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

Jinyang Wu 1 *    Mingkuan Feng 1 *    Shuai Zhang 1     Feihu Che 2   
Zengqi Wen 2    Chonghua Liao 3    Jianhua Tao 1 2
1 Department of Automation, Tsinghua University 2 Beijing National Research Center for Information Science and Technology 3 Institution for Interdisciplinary Information Sciences, Tsinghua University

Schematic comparison between HiAR-ICL and traditional zero-shot and few-shot in-context learning methods

We introduces HiAR-ICL, a novel paradigm to enhance the complex reasoning capabilities of large language models. Unlike traditional in-context learning, HiAR-ICL shifts the focus from example-based analogical learning to abstract thought patterns. The primary contributions of HiAR-ICL are as follows:

  1. Novel ICL Insight. We transcend traditional ICL by extending "context" from specific examples to abstract (higher-level) reasoning patterns, advancing the frontier of ICL research.
  2. Automated Adaptive Reasoning Paradigm. We propose an MCTS-powered framework that automatically generates diverse reasoning patterns and adaptively applies them based on problem characteristics, enabling robust reasoning performance and cross-domain generalization.
  3. Remarkable Performance and Efficiency. HiAR-ICL achieves 80.6% accuracy on MATH and 62.5% on AMC with Qwen2.5-7B-Instruct, surpassing GPT-4o (77.2% and 57.5%), while reducing time cost by approximately 10x compared to leading test-time inference methods.

Abstract

In-context learning (ICL) enables large language models (LLMs) to perform downstream tasks through advanced prompting and high-quality demonstrations. However, traditional ICL paradigms encounter significant limitations in complex reasoning tasks, stemming primarily from their dependence on example quality and absence of explicit reasoning guidance. To address these challenges, we introduce HiAR-ICL, a High-level Automated Reasoning paradigm in ICL that shifts focus from specific examples to abstract reasoning patterns, thereby extending the conventional concept of “context” in ICL. Our approach begins by defining five atomic reasoning actions, upon which we employ Monte Carlo Tree Search to systematically construct high-level reasoning patterns. During inference, HiAR-ICL dynamically selects appropriate reasoning patterns based on problem attributes, providing explicit guidance for the model’s reasoning process. Experiments demonstrate HiAR-ICL's effectiveness and efficiency: utilizing only 200 prior samples with Qwen2.5-7B-Instruct, our method achieves 80.6% accuracy on MATH and 62.5% on AMC, exceeding GPT-4o's 77.2% and 57.5%. Our approach enhances performance across models of varying sizes while generalizing effectively across domains. Further analysis reveals that HiAR-ICL can also serve as a plug-and-play inference method compatible with post-training techniques like GRPO. Code and data are available at https://github.com/jinyangwu/HiARICL.

Overview of HiAR-ICL

environment infrastructure

Our approach begins by defining five atomic reasoning actions, upon which we employ Monte Carlo Tree Search to systematically construct high-level reasoning patterns. During inference, HiAR-ICL dynamically selects appropriate reasoning patterns based on problem attributes, providing explicit guidance for the model's reasoning process. Specifically, HiAR-ICL consists of two main components:

  1. MCTS-powered Thought Card Construction. Leverage MCTS to systematically construct high level thought cards, which effectively guides subsequent problem-solving.
  2. Adaptive Reasoning and Verification. Dynamically select and execute optimal reasoning patterns based on the problem's cognitive complexity, followed by solution verification.

Experimental Setups

  1. Models. HiAR-ICL is a general approach applicable to various LLMs. In our experiments, we evaluate its effectiveness using powerful models: Llama3-8B-Instruct, Llama-3.1-8B-Instruct, Yi-1.5-6B-Chat, Qwen2-7B-Instruct, Qwen2.5-7B/14B-Instruct, and GPT-4o. By mainly focusing on LLMs with parameter counts generally under 10B, we aim to demonstrate the robustness and efficiency of our method. We expect that applying HiAR-ICL to small language models will achieve results comparable to or exceeding closed-source LLMs.
  2. Datasets. Our evaluation benchmarks encompass: (1) arithmetic reasoning: GSM8K and SVAMP; (2) complex mathematical reasoning: MATH and AMC; (3) multi-hop commonsense reasoning: StrategyQA; and (4) PHD-level Scientific Reasoning: GPQADiamond.
  3. Baselines. We evaluate HiAR-ICL against three strong baseline categories: (1) traditional example-based ICL methods, including zero-shot CoT, few-shot CoT, and SC+CoT; (2) tree-based methods, including ToT (Yao et al., 2023), RAP (Hao et al., 2023), ReST-MCTS (Zhang et al., 2024c), LLaMA-Berry (Zhang et al., 2024b), and rStar (Qi et al., 2024). (3) powerful LLMs, including Llama3.1-405B, GPT-4o, and Claude-3.5.
    • Metrics.We evaluate our approach using two metrics: accuracy, based on the strict matching between the model’s final answer and ground truth, and average time cost per sample to assess computational efficiency compared to existing leading search-based approaches.

      Results

      Main Results

      As shown in Table 1, we evaluate the effectiveness of HiARICL across six mainstream reasoning benchmarks. We provide comprehensive comparisons between HiAR-ICL and ICL methods. We have three key findings:

      1. HiAR-ICL consistently outperforms traditional ICL methods across models. For example, Llama3-8B-Instruct's accuracy on MATH improved from 17.8% (few-shot CoT) to 46.6% (HiAR-ICL), a 2.6x improvement. Similarly, on AMC, performance increased from 7.5% (few-shot CoT) to 30% (HiAR-ICL), demonstrating the substantial potential of our approach.
      2. The consistent performance gains across diverse reasoning tasks of different domains and varying difficulty levels further validate the generalizability of our high-level reasoning patterns. HiAR-ICL demonstrates robust improvements in arithmetic reasoning (GSM8K, SVAMP), mathematical reasoning (MATH, AMC), commonsense reasoning (StrategyQA), and PhD-level scientific knowledge reasoning (GPQADiamond). This cross-task effectiveness suggests that the high-level reasoning patterns transcend specific domain boundaries and difficulty levels.
      3. Our approach yields substantial improvements on small models, with Qwen2-7BInstruct increasing from 52.9% to 66.8% and Yi-1.5-6B-chat from 40.5% to 57.4% on MATH. Similar enhancements are observed on PhD-level GPQADiamond and olympiadlevel AMC. These results highlight HiARICL's potential for boosting the reasoning capabilities of small models.

      environment infrastructure



      Out-of-Distribution Generalization

      We also evaluate HiAR-ICL's performance against in-context learning (ICL) and supervised fine-tuning (SFT) under OOD scenarios. To ensure a fair comparison, we use the same 200 seed samples for both thought card construction and SFT. As illustrated in Figure 5, while ICL and SFT experience significant performance degradation, HiAR-ICL demonstrates remarkable resilience, preserving robust performance across multiple models and datasets. These results underscore HiARICL's superior robustness and generalization capabilities, positioning it as a more reliable and adaptable solution for handling diverse reasoning tasks, including both ID and OOD data.

      environment infrastructure


      Plug-and-Play Capability

      Similar to ICL methods like CoT, HiAR-ICL operates as a training-free test-time inference framework compatible with post-training techniques. We demonstrated this by applying HiAR-ICL to models that have undergone GRPO training on the MATH training set. Table 5 shows that our framework consistently enhances performance when integrated with these approaches. This synergy suggests HiAR-ICL captures reasoning patterns complementary to those acquired during post-training, affirming its plug-and-play versatility. Our comprehensive results in Tables 1, 5, and 14 further demonstrate HiAR-ICL’s broad applicability across various model architectures and training paradigms, including base, instruction-tuned, and reinforcement learning-optimized models. This also demonstrates the generalizability and versatility of high-level thought patterns in complex reasoning tasks. Future work could explore deeper integration with post-training methods to maximize complementary benefits

      environment infrastructure


      More results and analysis are provided in our paper.

      BibTeX

      @article{wu2024beyond,
        title={Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts},
        author={Wu, Jinyang and Feng, Mingkuan and Zhang, Shuai and Che, Feihu and Wen, Zengqi and Tao, Jianhua},
        journal={arXiv preprint arXiv:2411.18478},
        year={2024}
      }