icon HiAR-ICL: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

Jinyang Wu 1 * Mingkuan Feng 1 * Shuai Zhang 1 Feihu Che 2 Zengqi Wen 2 Jianhua Tao 1 2
1 Department of Automation, Tsinghua University 2 Beijing National Research Center for Information Science and Technology

Schematic comparison between HiAR-ICL and traditional zero-shot and few-shot in-context learning methods

We introduces HiAR-ICL, a novel paradigm to enhance the complex reasoning capabilities of large language models. Unlike traditional in-context learning, HiAR-ICL shifts the focus from example-based analogical learning to abstract thinking patterns. It employs Monte Carlo Tree Search to explore reasoning paths and creates "thought cards" to guide inferences. By dynamically matching test problems with appropriate thought cards through a proposed cognitive complexity framework, HiAR-ICL achieves remarkable accuracy of 79.6% with 7B model on the challenging MATH benchmark, surpassing both GPT-4o and Claude 3.5. In summary, the primary contributions of this paper are as follows:

  1. Novel ICL Framework. We extend the traditional concept of "context" from specific examples to higher level cognitive reasoning patterns, advancing the frontier of ICL research.
  2. Automated Reasoning Paradigm. We propose a fully automated reasoning paradigm through MCTS, which eliminates human intervention in demonstration design, and aligns with LLM's intrinsic reasoning capabilities.
  3. Human-Like Reasoning Behavior. We introduce five atomic reasoning actions that emulate human cognitive processes, enabling more effective problem-solving.
  4. Superior Performance. HiAR-ICL significantly outperforms existing methods on complex reasoning benchmarks, achieving 79.6% accuracy on MATH with Qwen2.5-7B-Instruct, substantially outperforming GPT-4o (76.6%).

Abstract

In-context Learning (ICL) enables large language models to tackle downstream tasks through sophisticated prompting and high-quality demonstrations. However, this traditional ICL paradigm shows limitations when facing complex mathematical reasoning tasks, primarily due to its heavy dependence on example quality and the necessity for human intervention in challenging scenarios. To address these limitations, this paper presents HiAR-ICL, a High-level Automated Reasoning paradigm in ICL that shifts focus from specific examples to abstract thinking patterns, extending the conventional concept of context in ICL. HiAR-ICL introduces five atomic reasoning actions as fundamental components for constructing chain-structured patterns. Using Monte Carlo Tree Search, we explore reasoning paths and construct thought cards to guide subsequent inference. We then develop a cognitive complexity framework that dynamically matches problems with appropriate thought cards. Experimental results demonstrate HiAR-ICL's effectiveness, achieving remarkable accuracy (79.6%) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6%).

Overview of HiAR-ICL

environment infrastructure

We propose the core steps of HiAR-ICL, which focus on consists of four main components:

  1. Define Atom Reasoning Actions. Define five human-like fundamental thought actions as building blocks for each chain-structured thought card
  2. Construct Thought Cards via MCTS. Leverage MCTS (Selection, Expansion, Simulation, Backpropagation) to construct thought cards comprehensively.
  3. Select Reasoning Patterns. Identify the optimal three reasoning patterns based on the problem's cognitive complexity level.
  4. Solve and Verify. Perform reasoning process under the selected patterns, and validate candidate solutions using PRM, ORM, or consistency-based verification.

Experimental Setups

  1. Models. HiAR-ICL is a general approach applicable to various LLMs. In our experiments, we evaluate its effectiveness using powerful models: Llama3-8B-Instruct, Llama-3.1-8B-Instruct, Yi-1.5-6B-Chat, Qwen2-7B-Instruct, and Qwen2.5-7B/14B-Instruct. By focusing on LLMs with parameter counts generally under 10B, we aim to demonstrate the robustness and efficiency of our method. We expect that applying HiAR-ICL to small language models will achieve results comparable to or exceeding closed-source LLMs.
  2. Datasets. Our evaluation benchmarks encompass: (1) arithmetic reasoning: GSM8K and SVAMP; (2) complex mathematical reasoning: MATH; (3) multi-hop commonsense reasoning: StrategyQA.
  3. Baselines. We evaluate HiAR-ICL against three strong baseline categories: (1) traditional example-based ICL methods, including zero-shot CoT, few-shot CoT, and SC+CoT; (2) tree-based methods, including ToT (Yao et al., 2023), RAP (Hao et al., 2023), ReST-MCTS (Zhang et al., 2024c), LiteSearch (Wang et al., 2024a), MCTSr (Zhang et al., 2024a), BEATS (Sun et al., 2024), LLaMA-Berry (Zhang et al., 2024b), and rStar (Qi et al., 2024). (3) powerful closed-source LLMs, including GPT-4, GPT-4o, Claude-3.5 and Gemini-1.5-pro.
    • Metrics. We evaluate our approach using two key metrics. We report accuracy as our primary evaluation metric, where correctness is determined by comparing the model's final answer with the ground truth. To ensure consistent answer extraction, we require the LLM to explicitly state its solution following a predefined format (e.g., "The answer is"). Additionally, we measure the average reasoning time to analyze our method's computational complexity compared to existing search-based approaches.

      Results

      Main Results

      As shown in Table 1, we evaluate the effectiveness of HiARICL across four mainstream reasoning benchmarks. We provide comprehensive comparisons between HiAR-ICL and ICL methods. We have two key findings:

      1. HiAR-ICL consistently performs better than traditional ICL methods across all tasks. For example, Llama3-8B's accuracy on the MATH benchmark improved from 17.8% (few-shot CoT) to 43.2% with HiAR-ICL, representing a substantial performance enhancement of 2.4 times.
      2. Our method exhibits the most substantial performance improvements on relatively small language models. For example, Qwen2-7B-Instruct improved from 52.9% to 63.8%, Yi-1.5-6B-Chat from 40.5% to 54.0%, and Llama3-8BInstruct from 17.8% to 43.2%. These results underscore our approach's potential to efficiently guide smaller language models in generating and selecting optimal solutions

      environment infrastructure

      BibTeX

      @misc{wu2024hiaricl,
              title={Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS}, 
              author={Jinyang Wu and Mingkuan Feng and Shuai Zhang and Feihu Che and Zengqi Wen and Jianhua Tao},
              year={2024},
              eprint={2411.18478},
              archivePrefix={arXiv},
              primaryClass={cs.CL},
              url={https://arxiv.org/abs/2411.18478}, 
        }