● live2026.06.16 · utc17 drills

Train the
architect's
voice.

Read a brief. Write the pitch. Draw the system from memory. Get graded on depth, vocabulary, and narrative until you speak like the engineer you already are.

Add your project →

§01 — the library

select · read · produce · iterate

/01
Evolution of Claude Opus System Prompts: 4.6 to 4.7
Analysis of how Anthropic's system prompt instructions changed between Claude Opus 4.6 and 4.7 releases. Reveals shifts in behavior constraints, safety guardrails, and capability enablement through prompt engineering. Critical for understanding how foundation models behave in production across versions.
intermediatesimonwillison
08 terms
04 questions
→
/02
Enterprise AI Deployment at Scale: Hyatt's ChatGPT Integration
Hyatt deployed ChatGPT Enterprise across its global workforce using GPT-5.4 and Codex to enhance productivity and guest-facing operations. The rollout demonstrates large-scale LLM adoption patterns: standardizing model versions, controlling API access, and embedding AI into existing employee workflows. Key challenges include organizational change management, ensuring consistent model behavior across distributed teams, and measuring true ROI beyond early adoption metrics.
intermediateopenai
08 terms
04 questions
→
/03
Codex: Multimodal Developer Environment with Computer Use
OpenAI's updated Codex expands beyond code generation to include computer use (screen control), in-app browsing, image generation, and persistent memory—creating an integrated agent environment for developers. The system routes tasks across multiple modalities and external tools while maintaining context across sessions. This represents a shift from isolated code completion to stateful, multi-tool orchestration.
intermediateopenai
08 terms
00 questions
→
/04
Building Fast Dynamic Language Interpreters
A deep dive into interpreter design patterns for achieving near-native performance in dynamically-typed languages without sacrificing runtime flexibility. Covers bytecode compilation, JIT tracing, inline caching, and type specialization as practical optimization layers. Explains the architectural trade-offs between dispatch speed, memory footprint, and compilation overhead.
advancedhn
10 terms
04 questions
→
/05
Quantum Computers and 128-Bit Symmetric Key Security
Quantum computers pose no practical threat to 128-bit symmetric encryption (AES-128) because breaking it would require ~2^64 operations even with Grover's algorithm, far exceeding any foreseeable quantum capability. The article clarifies the asymmetry between symmetric and asymmetric cryptography in quantum contexts: RSA/ECC are vulnerable, but symmetric keys remain safe. This has major implications for post-quantum migration strategies and crypto infrastructure planning.
intermediatehn
08 terms
00 questions
→
/06
Kimi Vendor Verifier: Accuracy Testing Infrastructure for LLM Inference Providers
Kimi built a system to systematically verify the accuracy and behavior of LLM inference providers against ground truth. The tool benchmarks outputs from multiple vendors (OpenAI, Anthropic, etc.) across standardized test suites to catch regression, drift, or deviation. This addresses a critical gap: inference providers can silently degrade or behave inconsistently without direct observability.
intermediatehn
07 terms
00 questions
→
/07
Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis
LLM reasoning fails in two ways: flaws within steps (logic errors, hallucinations) and flaws across steps (overthinking, underthinking). This work shows ground-truth labels don't fix this, and instead proposes CRAFT: a framework that builds a Reasoning Knowledge Graph from consensus patterns across multiple candidate reasoning traces, then synthesizes a high-quality trace via topological generation. The method achieves 10%+ accuracy gains on logical and mathematical reasoning benchmarks.
advancedarxiv
10 terms
04 questions
→
/08
HiVLA: Hierarchical Vision-Language-Action for Embodied Manipulation
HiVLA decouples Vision-Language-Action models into two tiers: a VLM planner that performs semantic task decomposition and visual grounding, and a flow-matching Diffusion Transformer that executes motor control with cascaded cross-attention. The architecture preserves zero-shot reasoning while allowing independent optimization of planning and execution for long-horizon robotic manipulation in cluttered scenes.
advancedarxiv
09 terms
04 questions
→
/09
Formalizing Vibe-Testing: Personalized LLM Evaluation at Scale
Users evaluate LLMs informally by testing on personally relevant tasks and judging responses against implicit subjective criteria—a process called 'vibe-testing.' This paper formalizes vibe-testing as a two-stage pipeline: personalizing both the prompts (what) and the evaluation rubric (how), then demonstrates that personalized evaluation can flip model preference rankings compared to standard benchmarks. The work bridges the gap between reproducible metrics and real-world utility by capturing and systematizing user-centric evaluation signals.
intermediatearxiv
07 terms
04 questions
→
/10
Rhetorical Questions in LLM Representations: A Linear Probing Study
This study uses linear probes to investigate how LLMs internally represent rhetorical questions across different social-media datasets. Rhetorical signals emerge early in the model and are most stable in last-token representations, achieving 0.7–0.8 AUROC for binary classification. However, cross-dataset transfer reveals that rhetorical questions are encoded via multiple distinct linear directions rather than a single shared representation, with probes trained on different corpora producing conflicting rankings on the same data.
advancedarxiv
08 terms
04 questions
→
/11
LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
LongCoT is a 2,500-problem benchmark measuring frontier LLMs' ability to execute extended chain-of-thought reasoning over tens to hundreds of thousands of tokens across chemistry, math, CS, chess, and logic. Each problem has a short input and verifiable answer but requires navigating a graph of interdependent reasoning steps where individual steps are tractable but coordination breaks down. Current best models score under 10%, exposing a critical capability gap in long-horizon reasoning for autonomous tasks.
advancedarxiv
08 terms
04 questions
→
/12
SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments
SpatialEvo is a self-evolving framework for 3D spatial reasoning that replaces model consensus with deterministic geometric validation computed directly from point clouds and camera poses. A shared-parameter policy co-evolves across questioner and solver roles, with the questioner generating physically valid spatial questions and the solver deriving answers against verified ground truth. A task-adaptive scheduler concentrates training on weakest categories, achieving state-of-the-art results on nine benchmarks without manual curriculum design.
advancedarxiv
09 terms
04 questions
→
/13
Pre-train Space Reinforcement Learning for LLM Reasoning
PreRL applies reward-driven updates directly to the marginal output distribution P(y) rather than the conditional P(y|x), bypassing the base model's inherited output bottleneck. Negative Sample Reinforcement (NSR) within pre-train space rapidly prunes incorrect reasoning paths while amplifying reflection behaviors. Dual Space RL combines NSR-PreRL initialization with standard RL fine-tuning to expand and refine the reasoning policy.
advancedarxiv
10 terms
04 questions
→
/14
The Paper Computer: Simulating Computation Without Electronics
A deep dive into how paper-based computational systems (like Conway's Game of Life or physical constraint propagation) can execute algorithms without digital hardware. The post explores the theoretical foundations of computation divorced from electronics, showing how information flow and state transitions work through purely mechanical means. This challenges assumptions about what infrastructure is necessary to perform computation.
intermediatehn
08 terms
04 questions
→
/15
Durable Workflows for Agents (pause / resume / retry)
Agents that run for minutes or hours need the same guarantees backend jobs do: survive crashes, retry transient failures, resume from the last step, and be observable. Durable workflow engines (Temporal, Inngest, Vercel Workflow) turn brittle long-running agents into crash-safe pipelines of idempotent steps.
advancedseed
08 terms
04 questions
→
/16
RAG vs. Long-Context: When to Retrieve vs. Stuff
With 1M+ token context windows, the instinct is to dump everything in and let the model sort it. In practice, retrieval still wins on cost, latency, freshness, and attention-dilution for large corpora — but long-context wins for small, dense, high-coherence tasks.
intermediateseed
08 terms
04 questions
→
/17
Multi-Agent Supervisor Pattern (LangGraph-style)
A supervisor LLM routes tasks to specialized worker agents (researcher, coder, critic), aggregates their outputs, and decides when the task is done. Popularized by LangGraph; now a default pattern for agentic workflows where a single model would otherwise lose focus over long horizons.
intermediateseed
08 terms
04 questions
→

end · §01atelier · command · v0.1

Train thearchitect'svoice.

Evolution of Claude Opus System Prompts: 4.6 to 4.7

Enterprise AI Deployment at Scale: Hyatt's ChatGPT Integration

Codex: Multimodal Developer Environment with Computer Use

Building Fast Dynamic Language Interpreters

Quantum Computers and 128-Bit Symmetric Key Security

Kimi Vendor Verifier: Accuracy Testing Infrastructure for LLM Inference Providers

Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

HiVLA: Hierarchical Vision-Language-Action for Embodied Manipulation

Formalizing Vibe-Testing: Personalized LLM Evaluation at Scale

Rhetorical Questions in LLM Representations: A Linear Probing Study

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Pre-train Space Reinforcement Learning for LLM Reasoning

The Paper Computer: Simulating Computation Without Electronics

Durable Workflows for Agents (pause / resume / retry)

RAG vs. Long-Context: When to Retrieve vs. Stuff

Multi-Agent Supervisor Pattern (LangGraph-style)

Train the
architect's
voice.