Tree of thoughts (ToT)

Definition

Tree of thoughts (ToT) extends CoT by maintaining multiple reasoning branches simultaneously. At each step, the model generates several candidate continuations; a heuristic or a separate evaluation model scores them, and a search algorithm (best-first, beam search, or BFS) decides which branches to expand further.

The key insight is that hard problems — planning, game playing, complex proofs — may require backtracking or exploring alternatives before committing. A single chain-of-thought path has no mechanism to recover from a poor intermediate step; ToT explicitly maintains a frontier of promising branches and prunes unpromising ones, much like classical tree search algorithms (MCTS, A*) applied to language generation.

Use it when a single chain-of-thought path might get stuck (e.g. game moves, multi-step planning) and you can afford multiple LLM calls. It trades compute for better search over the space of solutions. See reasoning patterns for the full set of options.

How it works

Tree expansion and pruning

Search strategies

Start from a root (e.g. the question or initial state). Branch: at each step, generate several continuations (e.g. next reasoning steps or moves). Score each branch with a heuristic or a separate model call (e.g. "how promising is this partial solution on a scale of 1–10?"). Expand the best node(s) and repeat; prune low-scoring branches to limit cost. The tree is built incrementally until a solution is found or a depth/budget limit is reached. Branching factor and max depth are key hyperparameters controlling the cost/quality tradeoff.

When to use / When NOT to use

Scenario	Use ToT	Don't use ToT
Game playing or puzzle solving with many moves	Yes — exploring branches is essential	No — CoT sufficient for single-path puzzles
Complex multi-step planning with backtracking	Yes — ToT can recover from dead ends	No — simpler tasks don't need backtracking
Creative generation with many valid options	Yes — generate and score multiple drafts	No — single creative output doesn't need it
High-volume production inference	No — multiple LLM calls are expensive	Yes — use CoT or direct prompting instead
Hard real-time constraints	No — ToT latency is high	Yes — not suitable for sub-second responses

Comparisons

Approach	Paths explored	Scoring	Cost	Best for
CoT	1	None	Low (1 call)	Linear multi-step tasks
Self-consistency	N (parallel)	Majority vote	Medium (N calls)	Tasks with verifiable answers
ToT	N (sequential, pruned)	Heuristic / model	High (N+ calls)	Planning, search, creative
MCTS (classical)	N (simulation)	Reward signal	Very high	Game AI with clear reward

Pros and cons

Pros	Cons
Explores and recovers from dead ends	Very high token and API cost
Produces higher-quality solutions on hard tasks	Requires a good scoring/evaluation function
Mirrors classical search — principled and adaptable	Complex to implement compared to CoT
Branching factor is tunable for cost/quality tradeoff	Not all tasks benefit from multi-path search

Code examples

from openai import OpenAI

client = OpenAI()

def generate_thoughts(state: str, n: int = 3) -> list[str]:
    """Generate N candidate next steps from the current state."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Current reasoning state:\n{state}\n\n"
                    f"Generate {n} distinct possible next reasoning steps. "
                    "Number each one."
                ),
            }
        ],
    )
    raw = response.choices[0].message.content
    # Simple parse: split on numbered lines
    return [line.strip() for line in raw.split("\n") if line.strip() and line[0].isdigit()]

def score_thought(state: str, thought: str) -> float:
    """Score a thought's promise on a 0-1 scale."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Rate how promising this reasoning step is for solving the task "
                    f"(0 = dead end, 1 = very promising).\n\n"
                    f"State: {state}\nThought: {thought}\n\nScore (0.0–1.0):"
                ),
            }
        ],
    )
    try:
        return float(response.choices[0].message.content.strip())
    except ValueError:
        return 0.5

# Simple best-first ToT (depth 2, branching factor 3)
task = "Plan 3 steps to build a minimal RAG chatbot."
candidates = generate_thoughts(task, n=3)
scored = [(thought, score_thought(task, thought)) for thought in candidates]
best = max(scored, key=lambda x: x[1])
print("Best next step:", best[0])

Practical resources

Tree of Thoughts (Yao et al.) — Original ToT paper with game-of-24 and creative writing benchmarks
LangChain – Agents and planning — ToT and related planning patterns
Princeton NLP – ToT repository — Reference implementation from the paper authors

Sources

Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023) — Seminal paper introducing multi-path tree search over reasoning steps with benchmarks on Game of 24 and creative writing.
Large Language Model Guided Tree-of-Thought (Long, 2023) — Concurrent work on tree-structured reasoning with LLM-based pruning.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) — Foundation that ToT extends from single-path to multi-path reasoning.
Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022) — Related sampling-based approach that ToT generalizes with structured search.