Rhetorical Questions in LLM Representations: A Linear Probing Study.
the path
Read. Master the vocabulary. Fire two hot-takes. Then write the pitch and draw the system. End-state: you speak this like it's native.
The brief
This study uses linear probes to investigate how LLMs internally represent rhetorical questions across different social-media datasets. Rhetorical signals emerge early in the model and are most stable in last-token representations, achieving 0.7–0.8 AUROC for binary classification. However, cross-dataset transfer reveals that rhetorical questions are encoded via multiple distinct linear directions rather than a single shared representation, with probes trained on different corpora producing conflicting rankings on the same data.
- 01Transferability without shared representation: a probe generalizes across datasets but does not capture a universal encoding, suggesting modularity comes at the cost of consistency.
- 02Discourse vs. syntax trade-off: models emphasizing rhetorical stance miss syntax-driven interrogatives, while those optimized for surface form overlook deeper argumentative intent.
- 03Early vs. late stability: rhetorical signals emerge early but are most stably captured at the final token, meaning intermediate layers may be noisier and less actionable for downstream tasks.
- 04Single-direction simplicity vs. multi-direction fidelity: assuming a single linear direction fails to capture the full richness of rhetorical encoding, complicating interpretability and control.
“LLMs encode rhetoric the way humans interpret arguments in different social contexts—the same persuasive move reads differently depending on where you sit in the conversation.”
The system
Vocabulary gym
Linear Probe
A simple linear classifier trained on frozen LLM representations to test whether a semantic property is linearly separable in the embedding space.
flip back ←Hot-takes
Two hot-takes. One sentence each. No hedging, no lists — just the sharpest answer you can land. The coach replies in seconds with a score and a tighter rewrite.
The paper shows that rhetorical signals stabilize at the final token representation. How would you design a multi-token aggregation strategy—mean pooling, attention-based, or contrastive—and would it improve or degrade the detectability of rhetorical phenomena?
Cross-dataset transfer reaches 0.7–0.8 AUROC but produces conflicting rankings. If you had to deploy this system in production on an unseen corpus, how would you validate that your probe had learned rhetoric rather than spurious social-media artifacts?
The drill
Linear probes reveal that rhetorical questions in LLMs are encoded by multiple distinct linear directions, not a single shared representation. This creates a design dilemma for any system that needs to reliably detect or control rhetorical language: should you (a) train separate specialized probes for each discourse context, accepting the cost of complexity and maintenance; (b) force a single unified representation by regularizing during model training, accepting the loss of nuance and discourse sensitivity; or (c) learn a meta-probe that selects among multiple direction-sets dynamically, trading off computation and latency for adaptability? Defend your choice in light of the paper's finding that overlap between dataset-specific top instances is below 0.2. What downstream task—content moderation, debate analysis, stance detection—would most benefit from one approach, and which would suffer most from it?