advancedsrc · arxiv8 terms4 questions

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning.

the path

Read. Master the vocabulary. Fire two hot-takes. Then write the pitch and draw the system. End-state: you speak this like it's native.

01Brief
02Reference
03Vocabulary
04Warm-up
05The drill

The brief

read first · no peeking ahead

LongCoT is a 2,500-problem benchmark measuring frontier LLMs' ability to execute extended chain-of-thought reasoning over tens to hundreds of thousands of tokens across chemistry, math, CS, chess, and logic. Each problem has a short input and verifiable answer but requires navigating a graph of interdependent reasoning steps where individual steps are tractable but coordination breaks down. Current best models score under 10%, exposing a critical capability gap in long-horizon reasoning for autonomous tasks.

trade-offs

01`Individual step tractability masks horizon weakness`: Problems are engineered so each local step is solvable by frontier models, yet the cumulative sequencing still fails—revealing a gap between step-level capability and multi-step planning that no single-step benchmark captures.
02`Benchmark specificity vs. generalization`: LongCoT's expert-designed problems in five domains measure precision but may not transfer to unstructured real-world reasoning where problem graphs are messier and step dependencies less formal.
03`Length vs. complexity`: Simply extending sequence length does not isolate reasoning degradation from token-budget effects—hard to tell if failures stem from true long-horizon reasoning or OOM/cache pressure in existing architectures.
04`Interpretability cost`: Verifiable answers enable scoring but exclude open-ended reasoning domains (writing, design, exploration) where the 'right' path branches and multi-horizon trade-offs dominate.

how a founder would frame it

“Long-horizon reasoning is the gap between local skill and orchestrated execution—like knowing how to sing and dance separately but falling apart on a 10-minute stage performance.”

The system

study it · you'll redraw from memory

Vocabulary gym

flip · rate · repeat until all mastered

01 / 080 mastered

space: flip · ←→: nav · g: got it · r: review

term 01

Chain-of-Thought (CoT)

click or space to flip

definition

A reasoning pattern where an LLM generates intermediate steps and explanations before arriving at a final answer, making the reasoning process explicit and verifiable.

flip back ←

Hot-takes

one sentence each · lead with the verb

Two hot-takes. One sentence each. No hedging, no lists — just the sharpest answer you can land. The coach replies in seconds with a score and a tighter rewrite.

If individual steps in LongCoT problems are tractable for frontier models, what specific mechanism causes failures as the reasoning chain lengthens—is it context window saturation, loss of earlier constraints, or inability to backtrack when a step fails?

0 / 320 · ⌘↵ to send

Would a system that explicitly stores and validates intermediate outputs (e.g., proof-checking or symbolic execution at each step) bypass long-horizon reasoning deficits, or does it just move the problem to meta-level planning?

0 / 320 · ⌘↵ to send

The drill

write the pitch · draw the system

prompt

LongCoT reveals that frontier models achieve <10% accuracy on problems where each individual reasoning step is tractable. A founder building autonomous agents must choose: invest in architectural changes (e.g., hierarchical planning, external memory, or checkpoint-based re-planning) to extend reasoning horizon, or build systems that decompose long-horizon tasks into shorter, verifiable sub-tasks that can be solved and cached independently. Write a 400–600 word essay defending one approach. What are the failure modes of the opposite approach? How do you measure whether your chosen strategy actually improves autonomous task reliability in production, not just on LongCoT? What role do problem structure, domain specificity, and human-in-the-loop checkpoints play in your decision?

essay · target 400–600 words

000 / 500

judge