AI Summary Hub

Tree of thoughts (ToT)

Exploring multiple reasoning branches.

Definition

Tree of thoughts (ToT) extends CoT by maintaining multiple reasoning branches simultaneously. At each step, the model generates several candidate continuations; a heuristic or a separate evaluation model scores them, and a search algorithm (best-first, beam search, or BFS) decides which branches to expand further.

The key insight is that hard problems — planning, game playing, complex proofs — may require backtracking or exploring alternatives before committing. A single chain-of-thought path has no mechanism to recover from a poor intermediate step; ToT explicitly maintains a frontier of promising branches and prunes unpromising ones, much like classical tree search algorithms (MCTS, A*) applied to language generation.

Use it when a single chain-of-thought path might get stuck (e.g. game moves, multi-step planning) and you can afford multiple LLM calls. It trades compute for better search over the space of solutions. See reasoning patterns for the full set of options.

How it works

Tree expansion and pruning

Search strategies

Start from a root (e.g. the question or initial state). Branch: at each step, generate several continuations (e.g. next reasoning steps or moves). Score each branch with a heuristic or a separate model call (e.g. "how promising is this partial solution on a scale of 1–10?"). Expand the best node(s) and repeat; prune low-scoring branches to limit cost. The tree is built incrementally until a solution is found or a depth/budget limit is reached. Branching factor and max depth are key hyperparameters controlling the cost/quality tradeoff.

When to use / When NOT to use

ScenarioUse ToTDon't use ToT
Game playing or puzzle solving with many movesYes — exploring branches is essentialNo — CoT sufficient for single-path puzzles
Complex multi-step planning with backtrackingYes — ToT can recover from dead endsNo — simpler tasks don't need backtracking
Creative generation with many valid optionsYes — generate and score multiple draftsNo — single creative output doesn't need it
High-volume production inferenceNo — multiple LLM calls are expensiveYes — use CoT or direct prompting instead
Hard real-time constraintsNo — ToT latency is highYes — not suitable for sub-second responses

Comparisons

ApproachPaths exploredScoringCostBest for
CoT1NoneLow (1 call)Linear multi-step tasks
Self-consistencyN (parallel)Majority voteMedium (N calls)Tasks with verifiable answers
ToTN (sequential, pruned)Heuristic / modelHigh (N+ calls)Planning, search, creative
MCTS (classical)N (simulation)Reward signalVery highGame AI with clear reward

Pros and cons

ProsCons
Explores and recovers from dead endsVery high token and API cost
Produces higher-quality solutions on hard tasksRequires a good scoring/evaluation function
Mirrors classical search — principled and adaptableComplex to implement compared to CoT
Branching factor is tunable for cost/quality tradeoffNot all tasks benefit from multi-path search

Code examples

from openai import OpenAI

client = OpenAI()

def generate_thoughts(state: str, n: int = 3) -> list[str]:
    """Generate N candidate next steps from the current state."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Current reasoning state:\n{state}\n\n"
                    f"Generate {n} distinct possible next reasoning steps. "
                    "Number each one."
                ),
            }
        ],
    )
    raw = response.choices[0].message.content
    # Simple parse: split on numbered lines
    return [line.strip() for line in raw.split("\n") if line.strip() and line[0].isdigit()]

def score_thought(state: str, thought: str) -> float:
    """Score a thought's promise on a 0-1 scale."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": (
                    f"Rate how promising this reasoning step is for solving the task "
                    f"(0 = dead end, 1 = very promising).\n\n"
                    f"State: {state}\nThought: {thought}\n\nScore (0.0–1.0):"
                ),
            }
        ],
    )
    try:
        return float(response.choices[0].message.content.strip())
    except ValueError:
        return 0.5

# Simple best-first ToT (depth 2, branching factor 3)
task = "Plan 3 steps to build a minimal RAG chatbot."
candidates = generate_thoughts(task, n=3)
scored = [(thought, score_thought(task, thought)) for thought in candidates]
best = max(scored, key=lambda x: x[1])
print("Best next step:", best[0])

Practical resources

Sources

See also