Reasoning Is Planning: how RAP changes the way LLMs think

There is a 2023 paper that has aged remarkably well: "Reasoning with Language Model is Planning with World Model" (Hao et al., EMNLP 2023, arXiv:2305.14992).

It is one of those works whose core idea quietly became common knowledge while most people forgot the source. Today, when you hear about "tree-of-thoughts", "agents with search", "LLM-as-world-model", or MCTS applied to reasoning — most of the conceptual foundation is in this paper.

Let me unpack what RAP proposes, why it works, and why it still matters for anyone building with LLMs today.

The problem: Chain-of-Thought is intuition, not deliberation

Large language models got really good at "System 1" reasoning — fast, associative thinking. Chain-of-Thought (CoT) is the canonical example: you ask the model to "think step by step" and it unrolls a token sequence that, with luck, leads to the right answer.

The problem is that CoT is fundamentally autoregressive and greedy. The model:

does not maintain an explicit internal world state;
does not anticipate the consequences of an action;
does not revisit bad choices;
does not compare alternative trajectories.

That breaks down on tasks that demand deliberate planning: Blocksworld, multi-step math, chained logical inference. Tasks where the correct answer requires looking ahead before committing to a step.

In humans, this is "System 2" — the deliberate brain that mentally simulates scenarios ("if I do X, the world becomes Y") before acting. To do that we need two things: an internal model of the world + a search mechanism over possible futures.

The pivot: the LLM is already a world model

RAP's insight is elegantly simple:

The LLM implicitly already has a model of the world. You just have to prompt it into that role.

Instead of using the LLM only as the next-action generator (the agent), RAP repurposes the same LLM into two simultaneous roles:

Reasoning agent — proposes the next action given the current state.
World model — predicts the next state given a (state, action) pair.

Formally, this becomes an MDP (Markov Decision Process): states, actions, transitions, rewards. All driven by the same model with different prompts.

The authors show that "state" and "action" materialize naturally per task:

Task	State	Action
Blocksworld	block-stack configuration	"pick up the orange block"
Math (GSM8K)	sub-questions + intermediate answers	next sub-question
Logical inference (PrOntoQA)	set of derived facts	next rule to apply

The engine: Monte Carlo Tree Search over reasoning space

Given a world model, RAP applies MCTS — the same algorithm that cracked Go with AlphaGo — over the reasoning tree.

Each node is a reasoning state. Each edge is an action proposed by the LLM. MCTS runs four phases per iteration:

Selection. From the root, walk down the tree using UCT (Upper Confidence bound for Trees), balancing exploration (under-visited paths) and exploitation (high-Q paths).
Expansion. At the leaf, the LLM-as-agent samples d candidate actions. For each, the LLM-as-world-model predicts the next state.
Simulation. Roll out to a terminal (or max depth) using a lightweight policy to estimate future reward.
Backpropagation. The observed reward flows back up the tree, updating Q-values for every (state, action) on the path.

After N iterations, the path with the highest accumulated reward is the final answer.

The subtle ingredient: designing reward without ground truth

This is the part that makes RAP work in practice — and the part it is easy to miss on a quick read.

You do not have access to a task oracle at inference time. So reward must be built from the LLM itself. The authors combine, depending on the task:

Action likelihood. Log-prob of the LLM generating the action given the state — in-context confidence.
State confidence. The LLM can vote multiple times on the next state; the agreement fraction becomes a signal.
Self-evaluation. Direct prompt to the LLM: "is this step correct? Yes/No" — using the probability of "Yes".
Task-specific heuristics. In Blocksworld, the number of sub-goals achieved; in GSM8K, the usefulness of a sub-question.

The reward is a weighted aggregation of these signals. Not perfect — but good enough to steer MCTS in the right direction.

The results: small model + search > giant model + greedy

This is the part that grabs you.

Blocksworld (planning, LLaMA-33B vs GPT-4):

Method	2-step	4-step	6-step
CoT (LLaMA-33B)	0.17	0.02	0.00
CoT (GPT-4)	0.50	0.63	0.40
RAP (LLaMA-33B)	1.00	0.86	0.42

LLaMA-33B with RAP beats GPT-4 with CoT by 33% relative. On 4-step, the gap is jarring: 86% vs 63%.

GSM8K (math): RAP with LLaMA-33B reaches ~48.8%, against ~46.8% for the more expensive CoT + Self-Consistency. Consistent gains with fewer samples.

PrOntoQA (logic): RAP scores 94.2% on full accuracy (proof + answer), versus 89.5% for CoT + SC. On hard cases, the margin widens.

Why it matters in 2026

When the paper came out in May 2023, it felt like a "beautiful but expensive idea". Three years later, three things are clear:

The lineage is huge. Tree-of-Thoughts, Graph-of-Thoughts, planning-aware agents, OpenAI o1/o3, Deepseek-R1 — the entire "test-time compute" paradigm descends from the same intuition: spending more inference on search pays better than scaling the model.
LLM-as-world-model is underused. Most agent systems still treat the LLM only as a next-action generator. RAP reminds us it can also simulate consequences — you just have to prompt it.
MCTS survives because it is robust. Newer frameworks (long-chain RLHF, process reward models) eventually fall into similar structures: explore, simulate, evaluate, update.

What I take into day-to-day work

Building with agents, three principles from RAP made it into my toolbox:

Split the two roles in the prompt. When your agent also needs to anticipate consequences, separate them explicitly: one prompt to propose an action, another to simulate the resulting state. Accuracy goes up.
Treat confidence as a search signal. Log-probs, majority voting, self-evaluation. Use those signals as heuristics — not as truth — to decide where to spend more compute.
Cheap lookahead beats expensive greedy. Simulating 3 steps ahead with a small model usually beats 1 step greedy with a top model. In code agents, this translates to "before applying the diff, mentally simulate what the test will say".

Honest limitations

The paper does not hide the costs:

MCTS requires multiple LLM calls per node — expensive in latency and tokens.
World-model quality is the ceiling: if the LLM hallucinates the simulation, MCTS trusts the hallucination.
Reward design is hand-crafted per task — there is no universal reward.

For tasks where the cost of being wrong is high (planning, rigorous math, multi-step agents), the trade-off pays off. For conversational chatbots, it doesn't.

Closing

If you are building anything with agents in 2026, the original paper is worth re-reading with fresh eyes. Not for the specific architecture — you probably won't implement MCTS by hand — but for the mental shift:

Reasoning is not generating tokens. Reasoning is simulating futures and picking the best one.

And the LLM already knows how to do both. It just needs a prompt that asks for it.

Paper: Reasoning with Language Model is Planning with World Model — Shibo Hao, Yi Gu, Haodi Ma, Joshua J. Hong, Zhen Wang, Daisy Z. Wang, Zhiting Hu. EMNLP 2023. Code: github.com/Ber666/llm-reasoners