SCALE Workshop @ ICML 2026 — 10th July 2026, Seoul, South Korea
Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.
Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, workflows, state dynamics, and recurring failure modes. We introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can accumulate environment-specific experience from multimodal web agent trajectories. LME-V2 contains 451 manually curated questions from customized shopping, forum, admin, and ServiceNow-style environments, with histories ranging from 25M to 115M tokens. Frontier LLMs reach at most 14.1% without trajectory evidence, confirming that LME-V2 requires learned experience beyond parametric knowledge. We evaluate memory under a context-gathering formulation and propose AgentRunbook: AgentRunbook-R is an efficient RAG pipeline over raw states, transitions, and notes, while AgentRunbook-C uses a scaffolded coding agent to gather evidence from trajectory files. AgentRunbook-C achieves the best overall accuracy, reaching 74.9% on LME-V2-Small and 70.1% on LME-V2-Medium, while improving the accuracy and latency trade-off over an off-the-shelf coding agent. We will release the benchmark and memory implementations.
We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately 11–15 points and yielding a roughly 35-37\% incorporation rate for both models; (iii) these gains do not compound over subsequent turns, as agents regress on up to 24\% of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate.
Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage. The result is CUA-World, a collection of over 10K long-horizon tasks spanning domains from medical science and astronomy to engineering and enterprise systems, each configured with realistic data along with train and test splits. CUA-World also includes CUA-World-Long, a challenging long-horizon benchmark with tasks often requiring over 500 steps, far exceeding existing benchmarks. Distilling successful trajectories from the training split into a 2B vision-language model outperforms models 2× its size. We also apply the same auditing principle at test time: a separate VLM reviews completed trajectories and provides feedback on what remains, improving Gemini-3-Flash on CUA-World-Long from 11.5% to 14.0%. We release all code, infrastructure, and benchmark data to facilitate future research in realistic computer-use agents.
LLM-based autonomous agents require external memory to overcome their statelessness and limited context window for long-term interaction and dynamic knowledge reasoning. However, existing memory retrieval methods often lack adaptability and sample efficiency, and struggle to retrieve the right mixture of memories from heterogeneous stores. We propose \textit{Exploratory-Assimilating Reflection (EAR)}, a framework for high initial retrieval performance and sample-efficient adaptation. EAR combines two mechanisms: Exploratory Reflection, which performs iterative search to bootstrap retrieval and collect useful experiences for each query, and Assimilating Reflection, which replays these experiences from an Experience Buffer to refine a global reranker more efficiently than methods relying only on immediate rewards. Experiments show that EAR improves retrieval by up to 17.9% over the baseline retriever on two long-term dialogue benchmarks. We also show that EAR is highly sample-efficient and robust to noisy feedback.
Multi-agent LLM systems increasingly depend on orchestrator agents to coordinate specialist agents, yet orchestration strategies remain informal, chosen by convention rather than analysis. We formalize the orchestrator as a constrained optimization problem over a typed agent registry with capability declarations, cost profiles, and trust boundaries. We define three coordination strategies, sequential delegation, parallel fan-out, and hierarchical decomposition, and derive threshold conditions under which each minimizes a combined cost-latency objective for a given task dependency structure. We implement OrchestRAte, a strategy-switching orchestrator that selects coordination mode dynamically via dependency graph analysis, and evaluate it on three enterprise workflow benchmarks: document processing, compliance review, and incident triage. On these benchmarks, OrchestRAte reduces end-to-end cost by 28–42% relative to static sequential baselines while trading at most 0.7 percentage points of task-completion quality. The cost gains come with a caveat: on highly parallelizable tasks, a static parallel baseline still achieves the lowest raw latency because it avoids the overhead of strategy selection entirely. We identify an empirical orchestrator overhead threshold around 15% of total token budget, beyond which the coordination cost of hierarchical strategies erodes their planning advantage.
Foundation robotics models, most popularly Vision Language Action (VLA) models, struggle to perform well over long horizon tasks due to their reliance on immediate sensory input. This induces compounding errors over inference timesteps, further exacerbated by non-robust backbones. To address this, we introduce \textit{Training-Free Memory Conditioned Action Generation}, a non-parametric retrieval-augmented framework that conditions a frozen VLA on historical expert trajectories. Our approach constructs a memory of expert demonstrations and utilizes a state-centric retrieval mechanism to guide action generation without any fine-tuning whatsoever. By performing extensive evaluation on 5 datasets over SOTA models, we show relative gains of upto 27% on task completion success. As an additional contribution we extend the popular CALVIN benchmark to task horizon of 6 and beyond, showcasing relative gains of upto 30%, while also demonstrating robustness to corrupted observations. Real-world experiments on complex tasks further demonstrate performance gains of up to 2X.
Developmental biology offers two kinds of inspiration for LLM agent design: structural principles (cell differentiation, morphogen gradients, gene regulatory networks) and dynamic processes (evolution, natural selection, population dynamics). We build a morphogenetic harness—a framework that applies both to LLM agents—and test which kind of inspiration actually helps. In Experiment 1 (static harness), we route queries to task-specific "cognitive skills" inspired by cell differentiation, testing on GSM8K, HumanEval, and MMLU across three LLMs. Skill routing significantly improves accuracy over unprompted baselines (Wilcoxon signed-rank $p<0.05$ for Haiku and Grok), with gains concentrated on math and logic. The gain over a generic chain-of-thought prompt is not significant in aggregate; a cross-routing control (applying the wrong skill) instead pinpoints correct task–skill matching as the operative mechanism (40pp drop on math under mis-routing). Ablation reveals that neither structure nor domain knowledge alone suffices—their synergy is key, mirroring the dual role of morphogens in biological patterning. In Experiment 2 (dynamic harness), we run evolutionary LLM colonies with bio-inspired reproduction, lateral inhibition, and fitness tracking. We find that evolution provides no benefit beyond the pass@k effect of running multiple agents. Cross-routing experiments confirm that the value of biological metaphor lies in the blueprint, not the process: static differentiation helps, but runtime evolution does not. We formalize this distinction through Ashby's Law of Requisite Variety and discuss implications for the emerging agent harness paradigm.
Persistent memory makes multimodal agents more capable, but it also creates a new attack surface: once unsupported content is written into memory, later retrieval and consolidation can reuse it as if it were reliable state. We study write-time defense for multimodal agent memory. Our system, SAGE-Mem, separates transient evidence from durable belief : observations may be stored as evidence, but they are promoted to belief only when they are sufficiently supported, independent, and non-conflicting. This targets a gap left by retrieval-time defenses, which act only after poisoned content has already entered memory. We evaluate on LoCoMo-Adv, an adversarial multimodal extension of LoCoMo-10, and on MM-BrowseComp-Adv, a multimodal browsing benchmark covering answer-overwrite, OCR, vision-caption, and visual-prompt attacks. On LoCoMo-Adv, at a conservative operating point, SAGE-Mem eliminates observed write admission and retrieval contamination relative to a retrieval-time baseline, but reduces benign completion under attack (0.460 vs. 0.642). On the canonical browsing overwrite setting, BrowseGuard, a browsing-specific write policy built on the same principle, blocks all 388 direct and paraphrased overwrite attempts while keeping attacked utility near its clean level (0.155 vs. 0.160). On the broader five-attack browsing suite, extending the same guarded write policy across browser, OCR, and caption channels reduces Write ASR from 0.2552 to 0.0369 and Retrieval ASR from 0.5636 to 0.3694. Overall, the results suggest that for memory-bearing agents, robustness should be evaluated not only at retrieval, but also at the point where observations become persistent state.
Quantization is a powerful strategy to build capable and resource-efficient large language models (LLMs) by reducing the bitwidth of the parameters. While quantized LLMs achieve state-of-the-art performance on unperturbed inputs using standard predictive metrics, their performance on perturbed inputs, measured using reliability metrics, remains underexplored, despite its importance for reliable deployment. To address this gap, we conduct a comprehensive reliability evaluation of quantized LLMs consisting of three key components: (1) Uncertainty: We assess the trustworthiness of LLMs quantized to 2, 3, 4, and 8 bits using six different quantization methods, employing established uncertainty metrics. (2) Robustness: We design character-level and word-level input perturbations to evaluate the reliability of quantized models under semantically-preserving variations in the inputs that arise in real-world applications. (3) Reliability scaling trends: We investigate how the reliability scales with the number of model bits. Our study reveals that while the performance scales monotonically with the total number of bits, the reliability scalings are nonlinear. A reliability peak occurs for 4-bit quantized models, indicating that quantizing moderately sized models offers the best reliability-efficiency trade-off. Additionally, our empirical findings reveal that quantization enhances the robustness of LLMs to natural input perturbations.
Agentic visual reasoning has advanced multimodal understanding through multi-step tool use, yet current methods restrict these tools to image manipulation and never reach for external knowledge, leaving open-source models brittle on tasks that require both. We study this setting through geolocalization, a task that inherently couples fine-grained visual inspection with open-world knowledge retrieval. We introduce \textbf{GeoVista}, a 7B agentic model that interleaves image zoom-in and web-search tool calls within a reasoning loop, and, to move beyond the low resolution of prior geolocalization benchmarks, curate \textbf{GeoBench}, a benchmark of high-resolution photos, panoramas, and satellite imagery whose geographic cues are only resolvable through fine-grained visual search. We train GeoVista in two stages: a cold-start SFT on synthesized agentic trajectories to instill the tool-use format, followed by RL with \textbf{CasPO} (Cascaded Policy Optimization). Instead of treating each prediction as a pass-or-fail event, CasPO cascades credit along the country, province, and city hierarchy, yielding a dense, multi-scale learning signal. Beyond accuracy, CasPO reshapes reasoning behavior: fewer tool-call errors, more targeted searches, and emergent self-reflection. GeoVista achieves state-of-the-art among open-source models and matches proprietary systems such as Gemini-2.5-flash and GPT-5 across most metrics, despite being an order of magnitude smaller.
Frontier models are transitioning from _multimodal large language models_ (MLLMs) that merely ingest visual information to _unified multimodal models_ (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human _mental imagery_. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, _visual thoughts do not yet benefit model reasoning_. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
Extracting structured information from visually rich documents at enterprise scale demands both the reasoning capability of large language models and the efficiency of deterministic execution. Current approaches either deploy LLMs as instance-level analysts, incurring per-document inference costs that are prohibitive for repetitive templates, or rely on manually authored vendor-specific prompts that do not scale beyond a handful of high-volume vendors. We present CACHE-ED2, a framework that repositions the LLM as a system-level developer: given a novel document format, a multimodal ReAct-based agent synthesizes a reusable Document Extraction DSL (DocExDSL) program encoding the complete extraction logic (strategies, disambiguation, transforms, and automated reasoning validation), then caches it for deterministic, LLM-free execution on all subsequent documents of the same format. A contrastive Document Format Encoder routes incoming documents to cached programs with perfect hit rates, and a HITL Feedback Analysis Agent incorporates analyst corrections to refine programs over time. On public benchmarks (CORD v2, FATURA) and 5,000 real-world invoices across 5 vendors, CACHE-ED2 achieves near-perfect extraction accuracy (0.99 average) with 1.0 extraction rate and perfect format detection across 50+ templates, matching or exceeding manually authored vendor-specific prompts, while reducing cumulative token consumption by ~2.6x over 1,000 documents.
With the rapid progress of large language models (LLMs), financial information retrieval has become a critical industrial application. Extracting task-relevant information from lengthy financial filings is essential for both operational and analytical decision-making. We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks. Our primary contribution is a systematic empirical study of when each component provides value: prompt engineering delivers consistent performance with minimal overhead, ICL enhances reasoning for complex queries when applied selectively, and multi-agent systems show potential primarily with larger models and careful architectural design. Extensive ablation studies across FinAgentBench, FiQA-2018, and FinanceBench reveal that simpler configurations often outperform complex multi-agent pipelines, providing practical guidance for practitioners. Our best configuration achieves an NDCG@5 of 0.71818 on FinAgentBench, ranking third while being the only training-free approach in the top three. We provide comprehensive feasibility analyses covering latency, token usage, and cost trade-offs to support deployment decisions. The source code will be released upon acceptance.
When videos extend from hours to days, directly processing them end-to-end becomes impractical for current Multi-modal Large Language Models (MLLMs). This ultra-long setting necessitates a two stage paradigm: query-agnostic memory construction followed by retrieval-based inference. Prior work invests in complex memory construction to pre-model high-level relations in videos, despite not knowing the downstream query at build time. We instead prioritize high-recall retrievability during memory building, and defer query-specific, high-level relation composition to inference time. To this end, we propose MERIT (Multi-key Episodic Retrieval with Inference-time Temporal expansion), a simple yet effective agentic framework for ultra-long video understanding. First, we formulate an episodic multi-key representation that enables precise retrieval of fine-grained memories through a simple key-matching mechanism. Second, we introduce a neighbor filtering mechanism to capture broader semantic context without the massive computational overhead of global memory construction. This is achieved by expanding the temporal scope exclusively around the retrieved segments at inference time. By leveraging simple key-matching with this on-demand temporal expansion, MERIT achieves state-of-the-art performance across three long-video benchmarks: EgoLifeQA, LVBench, and Video-MME (Long).
Multimodal retrieval-augmented generation misses two fundamental primitives: an evidence- grounding metric that evaluates generated claims based on evidence in text, imagery and structured data and a convergence criterion for determining the stopping point of an iterative system. We formally establish both. First, we propose CMEGS (Cross-Modal Evidence Grounding Score) as a claim-level metric with a specifically designed inter-modal agreement component that possesses boundedness, monotonicity and graceful degra- dation due to missing modalities. Second, we calculate a marginal improvement ratio (MIR) from consecutive values of CMEGS, proving its guaranteed termination under finite budget while meeting scale-relative sensitivity. We embed both primitives into an inference-only architecture with five agents and modality adaptation. On PathVQA, our architecture achieves 90.0% faithfulness with 1.1 mean iterations reducing hallucination by 80% relative to conventional RAG. On ScienceQA, it performs at par with the best iterative system in grounding but with 2.3×fewer iterations. The metric remains stable under nine different weight settings (variance below 1.3%) and exhibits strong correlation with grounding judgment scores (r= 0.79).
Large Language Models (LLMs) have accelerated the adoption of software development agents, now widely available as Integrated Development Environment (IDE) extensions and standalone applications. While these agents are typically general-purpose, it remains unclear whether specialist agents justify their additional development effort. We investigate this question in the context of business process automation, focusing on the transformation of Business Process Model and Notation (BPMN) diagrams into executable agentic workflows. Since BPMN specifies explicit control-flow semantics, we focus on deterministic workflows in which a fixed process model and inputs uniquely determine the executed path. We introduce a specialist workflow for this task and compare it against generalist agents such as Roo and Cline. Our results show that the specialist solution produces agents that outperform generalist baselines by approximately 9--20 percentage points in tool-use exactness, 2--4$\times$ in penalty-adjusted latency, and 3$\times$ fewer tool-call errors, while reducing generation token cost by over 95\% and eliminating repair iterations. We also find that generalist agents generate code inconsistently in both functionality and quality, limiting their suitability for industrial settings where reliability and maintainability are essential.
Training large transformer models incurs substantial computational costs, with a significant portion arising from hyperparameter search. This research introduces Green Gradients, a dual-phase, calculus-based meta-optimizer designed to reduce the carbon footprint of transformer fine-tuning. Phase 1 replaces traditional expensive Bayesian learning rate search with a single exponential scan of the loss landscape, where $\partial\mathcal{L}/\partial\lambda$ is estimated to evaluate candidates and identify an optimal initial learning rate $\lambda^{*}$ based on the decrease in loss. Phase 2 adapts $\lambda$ throughout training via hypergradient descent by computing $\partial\mathcal{L}/\partial\lambda$ through the chain rule. Evaluated on DistilBERT applied to the CLIMATE-FEVER dataset across 25 paired trials, Green Gradients achieves a 48.87% reduction in both energy consumption and carbon emissions compared to an Optuna + ReduceLROnPlateau baseline ($p = 1.51\times10^{-11}$) with no statistically significant change in test accuracy ($p = 0.106$), demonstrating the viability of Green Gradients in reducing the carbon footprint of transformer training while retaining accuracy.
Agentic AI systems increasingly treat retrieval as external memory, making storage footprint, index-refresh cost, and query latency first-class scaling constraints. However, standard dense retrieval pipelines decompress corpora into plain text before chunking, embedding, and indexing, inflating storage and build cost whenever memory must be refreshed under tight budgets. In this paper, we present ***compressed-domain retrieval***, in which the storage, indexing, and retrieval units coincide at the compressed chunk. The design pairs **ZPKG**, a novel Zstandard-based random-access packaging format, with **CEM**, a byte-level Compressed Embedding Model trained under a joint objective to map compressed chunks into the semantic retrieval space while keeping queries in plain text. Decompression is deferred until after candidate selection, preserving compatibility with standard RAG pipelines. On BEIR, the method reduces corpus storage to 30-33\%, cuts indexed vectors by 7.5-10.4$\times$, and accelerates HNSW index construction by 15.1-66.9$\times$. On an MTEB-based suite, CEM remains competitive with strong text encoders and surpasses all compared baselines on WinoGrande and TempReasonL1, positioning compressed-domain retrieval as a practical external-memory substrate for scaling agentic systems under memory, refresh, and latency budgets.
Chart question answering remains challenging because small errors in evidence extraction and visual grounding can easily propagate into incorrect answers. Although recent methods improve chart understanding, most still rely on one-pass answer generation without explicitly verifying whether the extracted evidence is accurate, and verifier feedback rarely persists beyond the current example. We propose $\textbf{BeaconChart}$, a verifier-guided multi-agent framework that improves chart understanding by accumulating verifier feedback as corrective instruction. Rather than directly trusting an initial prediction, BeaconChart first plans the required evidence, performs chart-aware extraction, and verifies whether the predicted answer is correctly grounded in the chart. When verification fails, it converts the feedback into reusable guidance that improves future inference on similar chart failures. Experiments on ChartQA and ChartX show that BeaconChart consistently outperforms both general-purpose multimodal large language models and recent chart-specialized baselines.
Long-term robotic operation has long been plagued by temporal failures during execution, as static task instructions (i.e., language conditions) fail to provide dynamic guidance for complex stages. This results in stage confusion and goal deviation in multi-stage scenarios, despite the strong short-term reaction capabilities of the underlying actuators. In this paper, we demonstrate that achieving robust long-term autonomy does not require retraining the actuators or complex hierarchical architectures, but rather temporal reparameterization of the task-instruction interface. We propose a training-free dual-system reasoning framework, LOGIV (LOgic-Graph Inference with VAL-verification), which decomposes global instructions into dynamic, multi-stage natural language priors. To ensure logical consistency, we introduce a graph-based self-correction mechanism that utilizes formal verification and standardized repair operators to autonomously correct plans generated by large language models. Experimental results on datasets such as DROID, AgiBot, EgoDex, and RoboTwin 2.0 show that our framework significantly improves task success rates without modifying the underlying model weights, with more pronounced effects on more complex tasks. Due to its architecture-agnostic and lightweight nature, LOGIV offers an efficient and plug-and-play solution for bridging the long-term temporal gap in existing world action models.
Modern Vision Language Models (VLMs) have demonstrated remarkable versatility in a wide range of applications, including multimodal reasoning, captioning, and analysis. However, VLMs still struggle with the fundamental tasks of object detection and localization, especially when compared to traditional Convolutional Neural Network (CNN) models. Contemporary VLMs tend to allocate very scarce attention to non-textual inputs such as image tokens and lack the inductive biases present in CNNs, making it difficult for them to identify challenging objects in complex scenes. To address these limitations, we propose a detection-guided dynamic attention steering system that leverages the locality insight from CNNs to efficiently steer a VLM's attention toward more relevant sections of an image. The steering intensity during each inference is dynamically scaled according to the confidence of CNN-generated bounding boxes. Extensive evaluations across multiple state-of-the-art VLMs show substantial improvements on object localization-oriented benchmarks, achieving up to a 4% accuracy gain. Results demonstrate the effectiveness of combining different model architectures to harness their respective strengths for advancing VLM capabilities.
Multi-agent debate (MAD) is a powerful paradigm for combining multiple large language models (LLMs) to achieve robust reasoning, but prior work has largely focused on language-only settings, leaving its multimodal potential underexplored. We present Weighted Iterative Society-of-Experts (WISE), a generalized MAD framework that systematically integrates heterogeneous multimodal LLMs to address challenging vision-and-language tasks in a zero-shot setting. Our key idea is to factor agents into three roles based on their multimodal capabilities: Solvers, which process multimodal inputs and generate candidate solutions; Reflectors, which may or may not access multimodal inputs but evaluate solutions, provide feedback, and assign weights; and an Orchestrator, which operates unimodally to reason over solutions and feedback and produce directives that guide subsequent reasoning. To account for varying agent reliability, we introduce an unsupervised probabilistic aggregation method, termed WISE–Dawid–Skene, which leverages the weighting scheme in WISE-MAD to adaptively combine agent outputs. We evaluate WISE on several challenging mathematical reasoning datasets and show that it consistently outperforms state-of-the-art methods across diverse LLM configurations, demonstrating its effectiveness as a general and scalable multimodal reasoning framework.
LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. Yet existing memory benchmarks evaluate updates only for independent entities, without testing whether a system can propagate the consequences of a changed fact through dependent entities or recognize when a previously valid answer has become uncertain. We introduce MEME, a benchmark that organizes memory evaluation along two orthogonal dimensions: entity scope and temporal dynamics. It defines six tasks per quadrant, including two novel ones, Cascade and Absence, that no prior benchmark covers. Evaluating six systems spanning three architectural paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 2.9\%, Absence: 1.4\% average) despite adequate static retrieval performance. Prompt optimization, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 partially closes the gap, but at $\sim$70$\times$ the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code is available at https://anonymous.4open.science/r/MEME-0612 and the dataset at https://huggingface.co/datasets/meme-benchmark/MEME
We present DEI: Diversity in Evolutionary In- ference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation op- erators across peer nodes communicating with non-blocking collective operations. Unlike ho- mogeneous parallel search, which replicates a single model’s inductive biases across all work- ers, DEI treats each LLM’s distinct creative prior as a complementary source of behavioral nov- elty. Extending the Digital Red Queen frame- work with DEI, nodes share local optimal solu- tions at the end of each round to seed the next round’s population. This creates cross-model ad- versarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle in- side a simulated machine, a four-node heteroge- neous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves +124% higher merged-archive QD-Score (45.90 vs. 20.46) and +28% higher coverage (80.6% vs. 63.0% of cells) than a single-node baseline at equal total LLM-call budget. The hetero- geneous ensemble also outperforms an equally- budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.
Enterprise environments differ fundamentally from the clean settings assumed in LLM research: knowledge is distributed across heterogeneous sources, often incomplete or inconsistent, and key procedural logic is implicitly encoded in artifacts rather than explicitly documented. In such settings, retrieval-based approaches are insufficient, as no single source contains the full workflow. We propose a replication-driven knowledge distillation framework for scalable learning in multimodal agents. The agent learns by reverse-engineering validated artifacts (e.g., Excel workbooks), reconstructing the underlying data pipeline, and distilling the inferred logic into structured knowledge (claims, procedures, and domain patterns). This enables synthesis and validation across noisy sources and supports reuse in future tasks. We evaluate on 120 simulated enterprise environments with multimodal inputs (SQL, spreadsheets, documentation, messaging app, emails, images, PDFs, CSV) and controlled noise. Our method consistently outperforms retrieval-based baselines on both task execution and conceptual understanding, and remains robust under environmental drift.
We present LS-111B, a 111B-parameter hybrid reasoning model for Korean-English enterprise agents under practical memory and serving con- straints. The model trains from a fully post- trained enterprise language model rather than a new pretraining run, and uses preamble condi- tioning to switch between concise non-reasoning behavior and longer tool-oriented reasoning. We study four choices for scaling tool-using agents efficiently: multilingual supervised fine-tuning, reinforcement learning with verifiable rewards for multi-step tool-use tasks, language-consistency re- wards for Korean user-facing responses, and 4-bit quantization for single-GPU serving. The adapted model improves mathematical reasoning, func- tion calling, and agentic natural-language-to-SQL (NL2SQL) performance while preserving general Korean and English instruction-following qual- ity. These results provide a practical recipe and failure-mode analysis for adapting post-trained multilingual models to verifiable agentic work- flows under memory-constrained deployment.
Vision-language-action models (VLAs) have shown strong promise as general-purpose robot policies. However, adapting them to new tasks typically requires costly task-specific teleoperation data. In contrast, human demonstration videos are easier to collect, making them an attractive source of supervision for task adaptation. In this work, we study one-shot demo-conditioned VLAs, where a robot is conditioned on a single demonstration video of the target task. We find that existing end-to-end approaches often treat the demonstration primarily as a task identifier, rather than as an executable visual plan for spatially precise action generation. To address this limitation, we propose SeeTraceAct, a demo-conditioned VLA framework that learns a visual latent plan from demonstrations via auxiliary future visual trace supervision. In our experiments on RoboCasa benchmarks, SeeTraceAct significantly outperforms strong baselines, with the largest gains on tasks that demand precise spatial interaction, showing that our approach is effective for grounding VLAs on one-shot demonstration videos.
Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-based LLMs often suffers from training instability. A core issue is off-policy optimization, where current training policy updates rely on trajectories generated by stale behavior policies. In MoE models, \textit{router drift} acts as the routing form of this gap because expert activations can change across model updates, causing unstable importance sampling weights in PPO-style algorithms. Routing Replay (R2) mitigates this issue by freezing the replay route within each reasoning trajectory, but it ignores how the router evolves under off-policy updates and thus causes \textit{router staleness}. To address this limitation, we propose \textbf{Predictive Routing Replay (PR2)}, which augments each router with a lightweight evolution predictor that learns to anticipate short-horizon router evolution. During the rollout phase, we use the predictive routing distribution to apply top-$k$ routing, enabling gradients to reach experts that are likely to become active after updates. During the training phase, we replay the resulting predicted route to retain consistency for stable importance estimation. Theoretical analysis and experiments show that PR2 reduces routing mismatch, improves RL stability, and improves benchmark accuracy.
Multimodal agents offer a compelling path to automating complex document-intensive workflows, yet a critical question remains: do these architectures demonstrate genuine strategic reasoning, or simply conduct stochastic trial-and-error search? To address this, we introduce Agentic Document VQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power and reliably differentiate between varying levels of agent capability. To rigorously assess agentic behavior, we introduce a novel evaluation protocol for measuring the accuracy-effort trade-off. Using this framework, we find that humans show strong metacognitive calibration, adapting or abandoning failed strategies, whereas frontier agents often persist in unproductive loops with diminishing returns. We release the dataset, evaluation harness, and leaderboard to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
Mixture-of-experts (MoE)-based large language models (LLMs) scale model capacity without proportional growth in compute, yet adapting them to new domains after pretraining remains challenging. Retraining from scratch is prohibitively expensive, while naive continual learning can cause catastrophic forgetting; i.e., new-domain gradients overwrite previously trained router models in the MoE. To this end, we propose LEGO, a novel expert expansion strategy that adds new experts with target-side routing pressure, steering new-domain tokens toward the expanded capacity, and anchor-side stabilization to prevent weight drift in existing experts relative to the previous checkpoint. The key insight of LEGO is that new-domain learning should be localized in dedicated capacity, leaving existing expert routing undisturbed. Extensive experimental results demonstrate that LEGO consistently improves the trade-off between acquiring new knowledge and preserving prior capabilities.
Multimodal STEM reasoning remains challenging because visual grounding and computation are often entangled in a single inference pass, leading to deterministic failures. We study whether explicitly decomposing these steps at test time can improve performance on multimodal physics and mathematics problems. To this end, we propose PRISM, a multi-agent framework that separates visual grounding, textual enrichment, and program-aided reasoning. Our results on the SeePhys dataset indicate that structured decomposition is most beneficial when visual dependency and reasoning complexity are high. These findings suggest that separating perception from reasoning can be a practical alternative to inference-time scaling for multimodal STEM tasks. Furthermore, we evaluate its generalization on the MATH-Vision benchmark for mathematical reasoning, demonstrating the robustness of our method.
GUI agents - vision-based models that control desktops, web browsers, and mobile devices through their graphical interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag-grounding data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag-grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The data is comprised of 286K training screenshots and 3.5M training tasks, plus a 2{,}000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude, Gemini) and open-source (Qwen, Kimi, Holo3) models, as well as a VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.
Adaptive orchestration of heterogeneous agents requires making sequential delegation and aggregation decisions under uncertain and evolving agent behaviour, e.g., coordinating specialised AI models with varying reliability, cost, and response quality. While prior work on agent orchestration focuses on performance or cost, uncertainty in agent reliability and output distributions is typically not modelled explicitly at the orchestration level. In this work, we study the problem of adaptive orchestration of heterogeneous agents under uncertainty, where a meta-controller must decide when to delegate to a single agent and when to aggregate multiple agent outputs, accounting for reliability, cost, and uncertainty. We propose BOT-Orch, a lightweight framework that casts orchestration as a bandit problem over agents, regularized by OT distances between agent output distributions and task-specific reference distributions. We show that the regularised orchestration enjoys $\mathcal{O}(\sqrt{T})$ regret under standard assumptions, and provably induces preference ordering among agents with identical mean rewards but differing distributional alignment. Empirically, we demonstrate that BOT-Orch outperforms standard bandit and heuristic aggregation baselines in synthetic but adversarial task allocation settings with heterogeneous, non-i.i.d. agent behaviour.
3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and avoids prompts filled with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +1.5% on Nr3D, with a notable +7.6% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding.
Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. This bottleneck is a central obstacle for efficient multimodal AI systems, where video and world models must maintain long temporal context for interactive simulation, planning, and embodied agents. Methods to quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in the cached partition sum: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1 and HY-WorldPlay at INT2 quantization, our correction improves PSNR by up to 5.9~dB and raises VBench Score by 7.8 points, recovering much of the quality lost to quantization.
Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.
Mastering long-horizon embodied tasks remains a fundamental challenge for AI, as current meth- ods often fail due to noisy data or intractable re- ward engineering. We introduce ASH, a fully autonomous agentic system that overcomes these limitations without any human involvement: no reward shaping, no expert annotation, and no domain-specific data curation. When encounter- ing an impasse, ASH uses its own experience to retrieve, and learn from relevant internet video. Evaluated in Pokémon Emerald—a complex RPG spanning dozens of hours—ASH dramatically out- performs baselines: while behavioral cloning and general purpose foundation models (Qwen Team, 2026) collapse to near-zero milestone comple- tion within the first few minutes, ASH sustains robust progression across multi-hour gameplay by continuously and autonomously acquiring new skills. This demonstrates that fully autonomous, self-improving agents are a scalable path for open- ended, long-horizon embodied learning.
Agentic planners must decide whether a fixed inference budget should buy more complete samples or a managed search frontier. We give a pre-deployment rule for this choice. The deployed LM, used as its own pruning verifier, is calibrated once to a precision $p$ on locally true-improving partial states. Combining $p$ with the mean horizon $\bar{L}$ yields a frontier-survival score $A_k \approx [1 - (1-p)^k]^{\bar{L}}$, and the smallest beam $k$ whose score clears a target success rate is the recommended width. The oracle labels used to measure $p$ appear only on a small calibration split, never inside the test-time solver. On our 100-maze MazeBench (4×4 to 6×6, generator-controlled), the rule predicts the useful regime at $k=3$ from a single calibration scalar $p=0.816$. Same-LM guided search then solves 98 of 100 mazes in 14.4 seconds per maze on one A100, while SC-10 solves 9 in 17.7 seconds. A Qwen2.5-3B holdout, a Qwen2.5-VL-7B run on rendered visual mazes (40/50 vs 1/50 for SC-10), and a same-LM two-ply Gomoku tournament (100/100 wins) all land on the safe side of the calibrated boundary, yielding a design map from $(p, \bar{L})$ to the minimum admissible beam across model size, modality, and task.
Deploying retrieval-augmented generation (RAG) pipelines on commodity accelerators such as the NVIDIA T4 (16 GB VRAM) exposes a critical failure mode we term the Compression Paradox: neural prompt compressors introduce substantial latency overhead and VRAM contention that frequently exceed the generation time saved. We observe two mechanistically distinct failure modes. First, an Always-Raw pipeline — serving Llama-3-8B-Instruct-AWQ (Lin et al., 2024) via vLLM (Kwon et al., 2023) with no compressor — crashes on 65% of long inputs due to standard KV-cache exhaustion within vLLM's pre-allocated memory pool. Second, an Always-Neural pipeline — adding LLMLingua-2 as a co-resident PyTorch process — triggers fatal cross-process VRAM contention: vLLM's rigid 55% reservation starves the PyTorch encoder's dynamic allocator, producing CUDA OOM on long inputs even when the LLM's KV-cache pool is not itself full. Because Llama-3-8B-Instruct-AWQ alone occupies 9 GB, only ~7 GB remains for the KV cache and any auxiliary model. We propose the Tri-Metric Router, a deterministic, training-free dispatch policy — analytic and auxiliary-model-free at dispatch time, with no learned components and no auxiliary model training — that selects among Raw, Neural (LLMLingua-2), and Lexical (BM25) pipelines using three CPU-bound heuristics: spatial complexity L, syntactic density ρ_key, and Type-Token Ratio (TTR). The adaptation signal is hardware-physical (VRAM headroom, latency crossover) rather than semantic — an underexplored axis of efficient inference on commodity accelerators. Unlike prior adaptive-RAG work (Asai et al., 2024; Jeong et al., 2024) that adapts retrieval decisions, our router addresses a downstream hardware-constraint problem: which compression mechanism can be safely applied post-retrieval on a constrained GPU without OOM. Routing thresholds are derived from a 100-document profiling sweep on LongBench (qasper) (Bai et al., 2024) (crossover threshold from N = 21 Long-band documents), isolating a hardware-specific latency crossover at L* ≈ 4,332 words. This threshold is derived from an N = 21 profiling subsample and is specific to our T4 deployment; cross-hardware re-calibration is required. The router achieves a 0% OOM rate on the evaluated workload; deterministic VRAM safety is enforced only on the Long-band branch via BM25 with a hard 4,096-token truncation, while the Raw and Neural branches retain OOM-safety as an empirical claim on in-distribution inputs at B = 1. On out-of-distribution holdouts, the Tri-Metric router achieves 88.5% oracle alignment with the optimal pipeline — the headline operating metric — while the 99.0% figure on the in-distribution calibration set represents an upper bound of the threshold fit, not an operating result. The router further preserves a statistically stable 49.3% Combined F1 — establishing a rigorous practical operating point without auxiliary model training or GPU memory overhead. Scope: Evaluated on a single T4, Llama-3-8B-Instruct-AWQ, LLMLingua-2, batch size B = 1, and LongBench English QA; we position this as a reference calibration for the methodology, not a universal result.
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities.
Long-context LLM inference, increasingly common in agentic and reasoning workloads, is bottlenecked by attention: repeated KV-cache reads make decoding memory-bound. Self-speculative decoding alleviates this by drafting tokens with sparse attention and verifying them with full attention, but existing batched methods remain synchronized: all requests in a batch share a single draft-verify schedule, even though the optimal draft length varies widely across requests and changes dynamically within each request. We propose ASPIRE, a non-synchronized batched self-speculative decoding framework built on three components. First, a unified mixed forward allows drafting and verifying requests to coexist in the same batched forward pass, removing the need for global draft-verify phases. Second, a lightweight online speculation scheduler uses per-request acceptance-rate estimates and a batch-aware cost model to let each request independently choose when to verify. Third, an intra-draft refresh layer performs full attention at a single designated layer during drafting, updating the sparse context at every draft step so that it does not become stale. Across three models and five reasoning and long-context benchmarks, ASPIRE achieves $1.70$-$4.58\times$ decoding throughput over autoregressive baselines and consistently outperforms prior self-speculative methods. To promote reproducibility, we will release our code upon acceptance.
Agentic workflows orchestrate optional operators (verifiers, routers, retries, escalations) but commit to a fixed pipeline per task type. This is structurally suboptimal: different requests, models, and runtime conditions favor different operator subsets. Additionally, no single hand-written workflow is uniformly Pareto-optimal. We present \textbf{Dyserve}, a framework that generates the serving strategy as a \textit{structured subgraph} over a workflow abstraction by solving an integer linear program (ILP). The ILP jointly selects per-node execution options (model, verification, speculative execution, retry) weighted by offline-profiled coefficients for service quality and latency performance. On LiveCodeBench, the generated strategies match the best-verify accuracy with $\sim\!10\times$ latency reduction. Additionally, we describe an event-driven suffix-repair extension that resolves a smaller ILP over the residual workflow when high-impact runtime events fire.
Coding agents offer major efficiency gains for software engineering, but their effects on the downstream functionality and quality of a code base remain understudied. Existing evaluations largely focus on whether agents can resolve individual issues, overlooking how agent-authored code may affect future development by making the code harder to interpret, extend, or build on---the code's \emph{maintainability}. We design a controlled experiment to compare the maintainability of agent-authored versus human-authored code by constructing two-step pull-request chains in which the same downstream task is performed on top of either human- or agent-authored code. We find that agent-authored code consistently leads to more downstream failures than human-authored code, with absolute task resolve rate drops of 1.0 to 13.0 percentage drop across Claude 4.5 Sonnet, GPT-5, GLM 4.7, and MiniMax 2.5, as well as larger increases across structural complexity and verbosity metrics. We report additional findings on differences in performance and code quality across task types and common failure modes. Our findings suggest that coding agents contribute to both downstream performance drops and code quality degradation, a critical consideration as agent use becomes more widespread and agent-authored code increasingly becomes the foundation for future development.
The growth of video content has created a strong demand for video summarization that finds key moments from numerous frames. Recent methods mainly use pretrained language models to estimate frame importance based on visual captions, but they often ignore non-visual cues such as speech and audio events. To address this, we propose SMART, a selective multimodal aggregation and temporal refinement framework for visual summarization. SMART introduces visual-guided modality selection to attend to relevant auxiliary cues, as well as progressive window attention to refine timestep-level representations over a broader temporal context. Experiments show that SMART outperforms 13 state-of-the-art methods on the widely-applied TVSum dataset, demonstrating the effectiveness of selective multimodal integration for video summarization.
Announcing an MCP tool in the system prompt cuts agent cost by 56% on Claude Sonnet 4.6 (N=30, cold cache); the default Claude Code deployment omits that announcement and captures only 19% of the gap (-10.9% vs. -56.3%). Same server, same execution, same cache regime; only the system-prompt addition differs. A pure-announcement variant (announcement text, no use-instruction) reaches -45.8%, 77% of the gap, attributing the rest to instruction priming. Practitioner benchmarks reporting 32-100x MCP savings (OnlyCLI 2025; Speakeasy 2025) measure the announced regime; users running default tooling do not. Prior benchmarks do not separate this from primitive type. To isolate it, we decompose six agent memory primitives along three axes (schema availability, schema discoverability, execution locality) and run controlled contrasts. CLI-vs-script holds locality fixed and varies only schema availability; eager-vs-lazy MCP holds the server fixed and varies only announcement. Across three task families (file report, multi-file Python refactor, code Q&A) at N=30 with per-task UUID prefix-cache defeat, CLI and eager MCP overlap at -56% on report and within 9pp on the other families: locality contributes no detectable savings on report or refactor and at most ~9pp on Code Q&A. A UserPromptSubmit hook reaches -80% by pre-executing the work entirely. Task quality is preserved across all primitives (oracle pass rates 1.00/1.00/0.99 on report/refactor/Code Q&A). An exact additive re-parameterization (three axes plus a pre-execution scaling term that captures the hook ceiling) decomposes per-primitive means with moderately stable axis coefficients across families; schema discoverability alone accounts for ~40% of baseline cost. Per-seed cost correlates with agent turn counts (Pearson r=0.92 / 0.88 / 0.71 on report / refactor / Code Q&A), consistent with the three axes acting through agent reasoning volume. The ordering replicates under warm cache on Claude Sonnet 4.6 and Opus 4.7, with one notable exception: Claude Haiku 4.5 discovers the lazy-MCP schema unaided (lazy collapses to -71%), so the discoverability axis is mediated by the model's tool-discovery behavior, not a fixed structural property. Benchmark and reference implementations released as supplementary material.
Efficient resource allocation is a critical capability for full autonomy in multi-agent systems. Existing tools typically provide indirect mechanisms for budget control, but do not optimize outcome quality under a fixed resource constraint. We introduce ZEBRA, a zero-shot budgeted resource allocation framework that models phase-level budget assignment as a continuous nonlinear knapsack problem. Rather than allocating budget directly, ZEBRA estimates phase utility curves and then computes the corresponding budget allocations algorithmically at inference time. We study two aggregations of phase utility -- an additive sum and a multiplicative product -- and unify them under a single dual-search solver. Through experiments on a balanced $150$-task APPS interview benchmark, both objectives outperform direct LLM allocation on every aggregate metric, and the gap scales with budget pressure: at $\alpha = 0.5$, ZEBRA retains $94.4\%$ of unconstrained quality versus $88.1\%$ for the LLM-direct baseline. These results suggest that lightweight algorithmic guidance -- without task-specific fine-tuning or reinforcement learning -- can improve the economic behavior of autonomous multi-agent systems.
We define per-call Tool-Existence Hallucination Rate (TEHR) and run it over BFCL multi-turn against two populations: five Anthropic 4.x versions spanning an eleven-month release window, and the Qwen3-Instruct family at six sizes from 0.6B to 32B (4-bit MLX). The Anthropic side is quiet. Across 2,599 calls under an opaque baseline we log zero TEHR events, with a Clopper--Pearson 95\% upper bound of $\leq 0.115\%$. Qwen3 is not. Its rate is non-monotone in scale, rising from $0\%$ at 0.6B to $1.87\%$ at 14B and falling back to $0\%$ at 32B, while the dominant distractor type slides from near-name (1.7B) to matched-random (4B/8B) to synonym (14B). We propose Registry-Visible Reprompting (RVR), a training-free middleware that checks each proposed call against the runtime registry and, on a miss, returns a single re-prompt wrapping a structured \texttt{tool\_not\_found} envelope. RVR removes every one of the fourteen pooled Qwen3 fabrications (Fisher's exact one-sided, $p = 7.1 \times 10^{-5}$ on $14/973$ vs.\ $0/945$). A content-matched ablation that drops the registry list (C0.7; $0/253$ on Qwen3-8B, matching C1) localizes the effect to the envelope shape rather than to listing the actual tool names. Deployers can therefore ship the cheaper signal without echoing internal registry contents. On the Anthropic zero-event regime RVR also logs zero events, but the strict-pass subset slips at $N = 60$, so we recommend tier-conditional deployment. We argue tool-name hallucination here is a mid-scale Qwen3 phenomenon. A one-line structured reply is enough to remove it on the models where it appears.
Policy-gradient algorithms from TRPO through PPO and MPO treat explicit $D_{\mathrm{KL}}$-ball trust regions as the bedrock of stable learning under function approximation. We reconsider that design: must every update be explicitly KL-bounded, or can well-regulated entropy dynamics already deliver the stability often ascribed to trust regions? ACE ($\textbf{A}$daptive $\textbf{C}$ontrol of $\textbf{E}$ntropy) augments soft actor-critic with a winsorised twin-critic disagreement signal and an AdaGrad temperature rule for $\alpha_t$ in $\log$-space-the same configuration on all twenty-one MuJoCo, DeepMind Control, and Gym benchmarks. Including an auxiliary trust-region multiplier solely as a probe, its dual satisfies $\lambda_t$= 0 at every audited step among $3{,}500$ ablation measurements (seven environments, five seeds, one hundred checkpoints each); removing that term outright does no aggregate harm, whereas freezing $\alpha$ or dropping the disagreement signal breaks exploration-demanding regimes. Theorem~4.1 supplies a closed-form bound showing that sufficiently controlled entropy dynamics implicitly satisfy standard KL budgets, rationalising an inactive trust-region dual. Empirically, ACE achieves the best final return on twelve of twenty-one tasks, with the largest advantages where fixed-entropy baselines plateau.
Mixture-of-Agents (MoA) aggregates responses from several LLMs in a fixed number of layers, spending the same compute on every query. We extend MoA with a self-consistency verifier that decides, per query, when to stop aggregating and which agents to invoke; we call the result Compute-Aware MoA (CA-MoA). On GSM8K with a heterogeneous Llama-3.2-3B + Qwen2.5-3B pool, CA-MoA reaches $85.0\%$ accuracy. That is the best score in the experiment, equal to $4$-sample self-consistency, and $5$ points above fixed-depth MoA-1 at $1.4\times$ MoA-1's tokens. We also run two stress tests where this win is absent: a homogeneous Llama-3.2-3B$\times 3$ pool where MoA-1 already saturates the agents at $92\%$, and a mixed-capability pool where adding a smaller Gemma2-2B agent corrupts aggregation. Across all three settings, CA-MoA's verifier signal still tracks correctness: queries that the gating policy lets terminate after one layer are answered correctly $89$-$100\%$ of the time, while queries that exhaust the budget are answered correctly $0$-$40\%$ of the time. The within-MoA gain over MoA-1 moves from $-8$ to $+13.3$ to $+5$ as the pool becomes more aggregation-appropriate. We pre-register a $\sim$7B similar-capability follow-up at $n=200$.
Language-model agents are increasingly used as black-box optimization policies: they propose a code or configuration edit, evaluate it with a fixed harness, observe a scalar metric, and iterate. Serial agent loops use feedback efficiently but under-utilize parallel hardware, whereas naive parallel loops improve utilization but evaluate many candidates against stale or non-composable baselines. We formulate this tension as a scheduling problem over an asynchronous experiment lineage. We propose \alps (Adaptive Lineage-Aware Parallel Search), a scheduler that combines a lineage selector, an operator-level bandit, a commit-hazard controller for adaptive parallelism, and a noise-aware promotion gate. \alps treats the LLM as a candidate proposer and assigns all stateful decisions---dispatch, validation, commit, and rebase---to the scheduler. We evaluate \alps on two tasks: a small-GPT autoresearch benchmark and a Qwen3 supervised fine-tuning data-mixture search. Across both tasks, we compare three policies---serial, naive parallel, and \alps---under matched wall-clock budgets. Preliminary results suggest that lineage-aware scheduling can recover cumulative-improvement behavior while retaining the throughput advantages of parallel execution.
Polysemanticity, where a single neuron responds to multiple unrelated concepts, is a central obstacle in mechanistic interpretability, yet the field lacks a principled continuous scalar for it. We introduce the $\textbf{Dirichlet Process Polysemanticity Index}$ (DPMI), a per-neuron score that combines inferred component count and component separation by fitting a non-parametric DPGMM and weighting by mean pairwise Jensen-Shannon divergence. On a controlled toy benchmark, DPMI achieves Spearman $\(\rho = 0.755\)$ and $AUROC \(= 0.877\)$, outperforming seven baselines. Against an independent Fourier-analytic ground truth from a modular-arithmetic transformer, DPMI remains significant with $\rho = 0.255\$ $\(p < 10^{-8}\)$. Across six architectures, we find a robust cross-modal law: language models are significantly more polysemantic than vision models $\(d = 0.803\$, $p < 10^{-129}\)$. Ablations show the non-parametric prior is essential (removing it drops $\(\rho\)$ by $0.040\-0.045\)$, and DPMI-guided SAE budget allocation improves reconstruction $\(R^2\)$ for the most polysemantic quartile by \(+0.010\) at fixed compute.
A widespread assumption in mechanistic interpretability holds that chain-of-thought (CoT) reasoning unfolds through discrete, recoverable cognitive phases-a prediction that would enable phase-specific circuit analysis and steering interventions. We test this using Switching Linear Dynamical Systems (SLDS) applied to residual-stream activations of DeepSeek-R1-Distill-Llama-8B across 997 MATH-benchmark traces at layer~16, complemented by a boundary diagnostic and a variance-discrimination analysis. Phase boundaries produce statistically significant but metrically weak distributional shifts (PC2: Cohen's $d = -0.293$, $p = 8.5\times10^{-6}$), and PCA directions are statistically independent of phase-discriminative directions (Spearman $\rho = -0.025$, $p = 0.78$), explaining why standard dimensionality reduction systematically discards the phase signal. Across all three experimental conditions and hyperparameter regimes, SLDS fails categorically to recover phase sequences (NMI $\leq 0.005$); inferred states instead capture positional structure ($\chi^2 = 2343$, $p \approx 0$) and syntactic token-type patterns ($\chi^2 = 293$, $p < 10^{-44}$). We conclude that CoT reasoning is a \emph{continuous dynamical process}: discrete-phase interpretability frameworks will systematically underfit residual-stream dynamics, and continuous-trajectory approaches are necessary.
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$--$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p < 10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$--$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.
Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\textbf{Latent Phase-Shift Rollback}$ (LPSR): at each generation step, we monitor the residual stream at a critical layer $l_{crit}$, detect abrupt directional reversals (phase shifts) via a cosine-similarity $+$ entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves 44.0% on MATH-500 with an 8B model versus $28.8\%$ for standard AR ($+15.2$ pp; McNemar $\chi^2 = 66.96$, $p < 10^{-15}$). Critically, prompted self-correction, the most natural inference-time baseline, scores only $19.8\%$, below standard AR; LPSR exceeds it by $+24.2$ pp ($\chi^2 = 89.4$, $p \approx 0$). LPSR also outperforms Best-of-16 ($+7.8$ pp) at $5.4\times$ lower token cost, and surpasses a standard 70B model ($35.2\%$) with $8.75\times$ fewer parameters at ${\sim}3\times$ the token budget. A 32-layer sweep reveals a novel $\textbf{detection-correction dissociation}$: error-detection AUC peaks at layer-14 ($0.718$) but task accuracy peaks at layer-16 ($44.0\%$ vs. $29.2\%$), demonstrating that optimal monitoring depth differs for detection and correction.
Multimodal agentic tasks---spanning web operations, software workflows, and tool-augmented reasoning over heterogeneous information sources---increasingly require coordinating multiple specialized agents rather than scaling a single generalist through longer reasoning chains. Designing such multi-agent systems remains combinatorially hard: existing approaches commit to fixed templates and domain-specific training, or deploy autonomous generalists at growing inference cost. We introduce TSAS (Two-Stage Agent Synthesis), a framework that generates task-specific multi-agent systems on demand. Given a task description, a single meta-agent call produces both a pool of specialized agents and a directed acyclic graph defining their coordination, without iterative search or additional training. On the GAIA benchmark, which probes long-horizon reasoning across web search, image, audio, and structured-file modalities, TSAS achieves 47.18% accuracy with GPT-5-mini at $31.58 per run, matching the top autonomous system while reducing cost by 61%, and outperforming agents built on Claude Opus~4 and Claude-4.5 Sonnet by over 16 percentage points. We further show that workflow generation can be distilled: a fine-tuned meta-agent (GraphGen-14B, based on Qwen2.5-Coder) reaches 57.57% on GAIA validation, more than doubling prior fine-tuned baselines. The code is available at https://anonymous.4open.science/r/aa3ADD.
Long-horizon robotic manipulation requires chaining semantically distinct primitives, reaching, grasping, moving, and placing, where per-primitive failures compound into end-to-end task breakdown. Existing hierarchical approaches use VLM planners to decompose instructions but route all primitives through a single generalist low-level controller, creating a planner-executor mismatch that limits reliability. We propose Specialist VLA, which retains a shared frozen TinyVLA-1.3B backbone while activating primitive-specific LoRA adapters selected by a Gemini-based planner, with dynamic re-querying every 30 control steps for mid-execution recovery. On a pick-and-place benchmark in Robosuite, Specialist VLA achieves 90% full-task success with re-querying versus 62% for a generalist baseline, demonstrating that primitive-level specialization and adaptive planning are complementary bottlenecks for long-horizon manipulation.
Compositional text-to-image generation requires satisfying multiple attribute constraints simultaneously, a task that single-pass generation routinely fails on. Two recent lines of work address this from opposite ends: agentic iterative refinement trains only the language planner while freezing the image generator, and unified policy optimization trains language and visual generation components but in a single generation round. We present NEXUS, a work-in-progress framework combining both properties: iterative, multi-round refinement with joint RL training of both the LLM planner (Qwen2.5-7B) and the reference-conditioned diffusion editor (FLUX.1-Kontext-dev) via shared GRPO reward. A dual-channel Bridge routes continuous LLM hidden states and discrete text instructions to the editor's conditioning mechanism. Empirically, using three-seed means on the full 553-prompt GenEval prompt set scored with our Qwen-VQA compositional score rather than the official GenEval evaluator: (1) zero-shot Gemini with the same iterative refinement loop reaches 0.740 and serves as the matched baseline; (2) NEXUS full reaches 0.734 at step 50 and 0.780 at step 100, surpassing this matched Gemini baseline by +4.0 pp at step 100; and (3) planner-only and no-Bridge variants do not exceed the matched Gemini baseline at step 100, highlighting the importance of the full co-adaptive system.
Vision-language models (VLMs) provide a promising starting point for multi-agent control because they can act before environment-specific training. However, it remains unclear how far this zero-shot advantage extends in partially observable cooperative settings, and how adapted VLM agents compare with specialized multi-agent reinforcement learning (MARL) policies. This work studies this question in a cooperative pursuit benchmark with controlled distribution shifts spanning visual appearance, semantic remapping, observation layout, agent counts, and environment scale. We compare zero-shot VLMs, LoRA-adapted VLMs, cold-start MARL, and fully trained IPPO/MAPPO baselines under matched local-observation constraints. Results show that zero-shot VLMs provide useful behavior where cold-start RL is nearly non-functional, and that supervised adaptation with 1k--25k expert demonstrations rapidly produces competitive long-horizon controllers. Adapted VLMs are especially robust to visual and semantic shifts, while trained MARL remains strongest on the in-distribution task and some coordination-heavy variants. Overall, the results reveal complementary scaling behavior: VLMs adapt rapidly from limited expert data and generalize better across visual-semantic shifts, while MARL remains stronger when extensive task-specific interaction is available.